DAY 59-100 DAYS MLCODE: RL - Policy Gradient

DAY 59-100 DAYS MLCODE: RL – Policy Gradient

In the previous blog, we discussed the Neural Network based policy, in this blog we are going to discuss the RL Policy Gradient.

When we are playing a game like Frozen Lake in the previous blog, we may reach the goal but before reaching the goal, there may be various steps involved which result in the goal. This means the reward which agent has received, is not only dependent on the current status but also on the previous status and we don’t know which exact steps of agent helped to achieve the goal this is called credit assignment problem.

To tackle this kind of problem, a common strategy is to evaluate an action based on the summation of all the rewards after applying a discount rate. In the below image, agent movements are highlighted by the color. For time assume that every step has the reward of the 1 point. Then the total reward to reach the step will be = 1 + r * 1 + r^2*1+r^3* 1.

Now think of scenario where discount rate r is zero means current reward does not get influenced by future reward and if discount rate is 1 then future rewards will be counted as much as the current reward. General value of discount rate is .95 or .99.

Frozen Lake game — Agent Movement to reach the goal

If we play the game enough, system will be able assign the +ive reward to the good actions and -ive rewards to the the bad actions. This is the way we can evaluate the each actions which agent is performing.

Policy Gradient

Policy Gradient algorithm help to optimize the parameters of the policy by moving the gradients towards higher rewards. REINFORCE algorithms is one of the classes of Policy gradient algorithm. This algorithm was introduced in the paper “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning” Here is one of the common variant:

Allow the Neural Network Policy to play the game and compute the gradient at each step but don’t apply gradient.
For each action compute the score after running several times
If the score is +ive means you want this action to be performed by agent, then apply the gradient so that agent will perform this action more. If the score is -ive that means it’s not a good action and you don’t want agent to perform this. Apply the opposite gradient so that agent will have less like chance to perform the action.
Calculate the mean of gradient vectors and use the mean to perfrom Gradient steps.

Now let’s create a simple example to demonstrate this using TensorFlow.

# 1. Specify the network architecture
input_nos = 4    # == env.observation_space.shape[0]
hidden_lyr_nos = 4 
output_nos = 1 # only outputs the probability of accelerating left

learning_rate = 0.01

initializer = tf.variance_scaling_initializer()

# 2. Build the neural network
X = tf.placeholder(tf.float32, shape=[None, input_nos])

hidden = tf.layers.dense(X, hidden_lyr_nos, activation=tf.nn.elu, kernel_initializer=initializer)
logits = tf.layers.dense(hidden, output_nos)
outputs = tf.nn.sigmoid(logits)  # probability of action 0 (left)

# 3. Select a random action based on the estimated probabilities
p_left_and_right = tf.concat(axis=1, values=[outputs, 1 - outputs])
action = tf.multinomial(tf.log(p_left_and_right), num_samples=1)
#Calculate the probability 
y = 1. - tf.to_float(action)
cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(labels=y, logits=logits)
optimizer = tf.train.AdamOptimizer(learning_rate)
#Cacluate the gradients
grads_and_vars = optimizer.compute_gradients(cross_entropy)
gradients = [grad for grad, variable in grads_and_vars]
gradient_placeholders = []
grads_and_vars_feed = []
for grad, variable in grads_and_vars:
    gradient_placeholder = tf.placeholder(tf.float32, shape=None)
    gradient_placeholders.append(gradient_placeholder)
    grads_and_vars_feed.append((gradient_placeholder, variable))
#apply the gradient as operation    
training_op = optimizer.apply_gradients(grads_and_vars_feed)

init = tf.global_variables_initializer()
saver = tf.train.Saver()

You can find the entire code here. Above code was copied from the book Hands-on Machine Learning with Scikit-Learn and TensorFlow and modified. You can find the entire code here.

#100DaysofMLCode #logisticregression #PolicyGradient

DAY 59-100 DAYS MLCODE: RL – Policy Gradient

DAY 59-100 DAYS MLCODE: RL – Policy Gradient

Policy Gradient

Recent Posts

Archives