DAY 57-100 DAYS MLCODE: REINFORCEMENT LEARNING

My Tech World

DAY 57-100 DAYS MLCODE: REINFORCEMENT LEARNING

January 6, 2019 100-Days-Of-ML-Code blog 0

In the previous few blogs, we discussed Autoencoders, now we’ll start working on Reinforcement Learning. As per Wikipedia, Reinforcement Learning is :

An area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward

Wikipedia

Reinforcement Learning falls in between the supervise learning (where we have labeled data) and unsupervised learning ( Where we don’t have label data). It has the sparse and time delayed label which is reward and objective of this type of learning is to maximize the reward by performing set of actions.

In 2013, DeepMind released a paper called “Playing Atari with Deep Reinforcement Learning” which demonstrated that a system can learn to play the Atari game and it was demonstrated for 2600 video games. Another paper “Human-level control through deep reinforcement learning”, released by Deep Mind in 2015 where they applied the same model to  49 different games and achieved superhuman performance in half of them.

Reinforcement Learning

Consider the example of Frozen game environment :

Frozen Lake environment consists of a 4×4 grid of blocks, and can be one of the following:

  • Start Block
  • Goal Block
  • Sage Frozen Block
  • A Dangerous hole

The surface is described using a grid like the following:

 
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)

The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.

Now suppose you want your model to learn the way to play the above game. We can create a neural network with state of agent as input and can produce the action as output like whether agent should move left, right, up or down. Off-course this will take lots of data and time to train, instead of that if we can get the occasional feedback that we did the right thing and can then figure out everything else ourselves. This is called reinforcement learning.

Before we start working on an example of Reinforcement Learning, we need to learn about the Gym. Gym is a toolkit for developing and comparing reinforcement learning algorithms. Let’s start with basic of Gym.

Create a Gym environment for the selected game. In our case its Frozen Lake

import gym
env = gym.make(‘FrozenLake-v0’)

In the above example Make function created the environment. After the environment is created, we have to reset the environment to initialize the states and action of an agent. This function returns the first observation of the agent.

obs = env.reset()

Render function of environment displays environment.

env.render()

Output:
SFFF
FHFH
FFFH
HFFG

Let’s try another example to create environment of MsPacman version 0

env2 = gym.make(‘MsPacman-v0’)
obs2 = env2.reset()

We can check the value of observation which was return by reset function:

print(obs2)
print(obs2.shape)

Output: (210, 160, 3)

This means that observation has the size of 210, 160 and 3. Let’s plot using matplotlib library to view the environment :

plt.figure(figsize=(5,4))
plt.imshow(obs2)
plt.axis(“off”)

MS PackMan V0- Initial state
MS PackMan V0- Initial state

We can also check the number of discrete action by checking the attribute action_space like below:

print(“Frozen Lake Actions: ” +str(env.action_space)) #For Frozen Lake
print(“Frozen Lake Actions: ” +str(env.action_space)) #For MSPackMan

Output:
Frozen Lake Actions: Discrete(4)
Frozen Lake Actions: Discrete(9)

Now let’s perform some action and see the change in the status of PackMan game. Packman environment has Discrete(9) actions means that the possible actions are integers 0 through 8, which represents the 9 possible positions of the joystick (0=center, 1=up, 2=right, 3=left, 4=down, 5=upper-right, 6=upper-left, 7=lower-right, 8=lower-left).

for step in range(110):
env2.step(5) #Upper Right
for step in range(40):
env2.step(4) #Down

Plot the environment current observation:

plt.figure(figsize=(5,4))
plt.imshow(env2.render(mode=”rgb_array”))
plt.axis(“off”)

MS Packman after performing action
MSPacMan current state after actions

So these are the basic knowledge to create the environment and perfrom basic actions. Let’s get back to our example of a frozen lake and create a simple model using reinforcement learning.

Our Learning policy will be simple. Define a basic policy to choose the actions using the previous knowdlege with some noise.


def basic_policy(env, previous_knowledge, counter, s ):
return np.argmax(previous_knowledge[s,:] + np.random.randn(1,env.action_space.n)*(1./(counter+1)))

Now create a simple model. The Long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. Our equation to update the action will be like below:
Q(s,a) = r + γ(max(Q(s’,a’))

#Initialize table with all zeros
actions = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
#Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0
#The Action-Table learning algorithm
while j < 99: j+=1 #Choose an action by greedily (with noise) picking from action table a = basic_policy(env, actions, i, s) #Get new state and reward from environment s1,r,d,_ = env.step(a) #Update Action -Table with new knowledge actions[s,a] = actions[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a]) rAll += r s = s1 if d == True: break rList.append(rAll)

Print the average time taken to reach the goal:

print(“Score over time: ” + str(sum(rList)/num_episodes))

Output: Score over time: 0.0355

In next blog, we’ll try to create same model using neural network.

In conclusion, reinforcement learning is the beginning of the general AI also it can be used in several real-life problems where we can create an environment and agent is rewarded or penalized with the action taken by the agent. you can find today’s code here.