DAY 58-100 DAYS MLCODE: RL Part 2
In the previous blog, we created a simple example of reinforcement learning using a simple policy, in this blog, we’ll use a neural network to decide the action.
Since the frozen lake was having a shape of 4*4 that means the agent can be at one space at any time. That means out input will be of structure 1*16 one-hot coded vector. At any time, action will produce one output
# 1. Specify the network architecture
input_nos = 16 # No of possible action
hidden_lyrs = 4 #
output_nos = 4 # only outputs the probability of
Initialize the variances:
initializer = tf.variance_scaling_initializer()
Build the neural network of 4 hidden layers
# 2. Build the neural network
X = tf.placeholder(tf.float32, shape=[None, input_nos])
hidden = tf.layers.dense(X, hidden_lyrs, activation=tf.nn.elu,
kernel_initializer=initializer)
outputs = tf.layers.dense(hidden, output_nos, activation=tf.nn.sigmoid,
kernel_initializer=initializer)
# 3. Select a random action based on the estimated probabilities
action = tf.argmax(outputs,1)
Now train the model to see how it is learning
init = tf.global_variables_initializer()
n_max_steps = 1000
obs_steps = []
with tf.Session() as sess:
init.run()
obs = env.reset()
for step in range(n_max_steps):
obs_steps.append(obs)
input = np.identity(16)[obs:obs+1]
action_val = action.eval(feed_dict={X: input})
obs, reward, done, info = env.step(action_val[0])
if done:
break
env.close()
Print the array obs_steps space to see how model has executed. It is not complete as does not have loss function.
print(obs_steps)
Output: [0, 4, 8, 9, 13, 14, 14, 13, 9]
In conclusion, this is the simple example of reinforcement learning having neural network based policies. you can find the entire code here.