DAY 58-100 DAYS MLCODE: RL Part 2

My Tech World

DAY 58-100 DAYS MLCODE: RL Part 2

January 7, 2019 100-Days-Of-ML-Code blog 0

In the previous blog, we created a simple example of reinforcement learning using a simple policy, in this blog, we’ll use a neural network to decide the action.

Since the frozen lake was having a shape of 4*4 that means the agent can be at one space at any time. That means out input will be of structure 1*16 one-hot coded vector. At any time, action will produce one output

# 1. Specify the network architecture
input_nos = 16 # No of possible action
hidden_lyrs = 4 #
output_nos = 4 # only outputs the probability of

Initialize the variances:

initializer = tf.variance_scaling_initializer()

Build the neural network of 4 hidden layers

# 2. Build the neural network
X = tf.placeholder(tf.float32, shape=[None, input_nos])
hidden = tf.layers.dense(X, hidden_lyrs, activation=tf.nn.elu,
kernel_initializer=initializer)
outputs = tf.layers.dense(hidden, output_nos, activation=tf.nn.sigmoid,
kernel_initializer=initializer)

# 3. Select a random action based on the estimated probabilities
action = tf.argmax(outputs,1)

Now train the model to see how it is learning

init = tf.global_variables_initializer()

n_max_steps = 1000
obs_steps = []
with tf.Session() as sess:
    init.run()
    obs = env.reset()
    for step in range(n_max_steps):
        obs_steps.append(obs)
        input = np.identity(16)[obs:obs+1]
        action_val = action.eval(feed_dict={X: input})
        obs, reward, done, info = env.step(action_val[0])
        if done:
            break

env.close()

Print the array obs_steps space to see how model has executed. It is not complete as does not have loss function.

print(obs_steps)

Output: [0, 4, 8, 9, 13, 14, 14, 13, 9]

In conclusion, this is the simple example of reinforcement learning having neural network based policies. you can find the entire code here.