DAY 57-100 DAYS MLCODE: REINFORCEMENT LEARNING
In the previous few blogs, we discussed Autoencoders, now we’ll start working on Reinforcement Learning. As per Wikipedia, Reinforcement Learning is :
An area of machine learning concerned with how software agents ought to take actions in an environment so as to maximize some notion of cumulative reward
Wikipedia
Reinforcement Learning falls in between the supervise learning (where we have labeled data) and unsupervised learning ( Where we don’t have label data). It has the sparse and time delayed label which is reward and objective of this type of learning is to maximize the reward by performing set of actions.
In 2013, DeepMind released a paper called “Playing Atari with Deep Reinforcement Learning” which demonstrated that a system can learn to play the Atari game and it was demonstrated for 2600 video games. Another paper “Human-level control through deep reinforcement learning”, released by Deep Mind in 2015 where they applied the same model to 49 different games and achieved superhuman performance in half of them.
Reinforcement Learning
Consider the example of Frozen game environment :
Frozen Lake environment consists of a 4×4 grid of blocks, and can be one of the following:
- Start Block
- Goal Block
- Sage Frozen Block
- A Dangerous hole
The surface is described using a grid like the following:
SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)
The episode ends when you reach the goal or fall in a hole. You receive a reward of 1 if you reach the goal, and zero otherwise.
Now suppose you want your model to learn the way to play the above game. We can create a neural network with state of
Before we start working on an example of Reinforcement Learning, we need to learn about the Gym.
Create a Gym environment for the selected game. In our case its Frozen Lake
import gym
env = gym.make(‘FrozenLake-v0’)
In the above example Make function created the environment. After the environment is created, we have to reset the environment to initialize the states and action of an agent. This function returns the first observation of the agent.
obs = env.reset()
Render function of environment displays environment.
env.render()
Output:
SFFF
FHFH
FFFH
HFFG
Let’s try another example to create environment of MsPacman version 0
env2 = gym.make(‘MsPacman-v0’)
obs2 = env2.reset()
We can check the value of observation which was return by reset function:
print(obs2)
print(obs2.shape)
Output: (210, 160, 3)
This means that observation has the size of 210, 160 and 3. Let’s plot using matplotlib library to view the environment :
plt.figure(figsize=(5,4))
plt.imshow(obs2)
plt.axis(“off”)
We can also check the number of discrete action by checking the attribute action_space like below:
print(“Frozen Lake Actions: ” +str(env.action_space)) #For Frozen Lake
print(“Frozen Lake Actions: ” +str(env.action_space)) #For MSPackMan
Output:
Frozen Lake Actions: Discrete(4)
Frozen Lake Actions: Discrete(9)
Now let’s perform some action and see the change in the status of PackMan game. Packman environment has Discrete(9) actions means that the possible actions are integers 0 through 8, which represents the 9 possible positions of the joystick (0=center, 1=up, 2=right, 3=left, 4=down, 5=upper-right, 6=upper-left, 7=lower-right, 8=lower-left).
for step in range(110):
env2.step(5) #Upper Right
for step in range(40):
env2.step(4) #Down
Plot the environment current observation:
plt.figure(figsize=(5,4))
plt.imshow(env2.render(mode=”rgb_array”))
plt.axis(“off”)
So these are the basic knowledge to create the environment and
Our Learning policy will be simple. Define a basic policy to choose the actions using the previous
def basic_policy(env, previous_knowledge, counter, s ):
return np.argmax(previous_knowledge[s,:] + np.random.randn(1,env.action_space.n)*(1./(counter+1)))
Now create a simple model. The Long-term reward for a given action is equal to the immediate reward from the current action combined with the expected reward from the best future action taken at the following state. Our equation to update the action will be like below:
Q(s,a) = r + γ(max(Q(s’,a’))
#Initialize table with all zeros
actions = np.zeros([env.observation_space.n,env.action_space.n])
# Set learning parameters
lr = .8
y = .95
num_episodes = 2000
#create lists to contain total rewards and steps per episode
#jList = []
rList = []
for i in range(num_episodes):
#Reset environment and get first new observation
s = env.reset()
rAll = 0
d = False
j = 0
#The Action-Table learning algorithm
while j < 99: j+=1 #Choose an action by greedily (with noise) picking from action table a = basic_policy(env, actions, i, s) #Get new state and reward from environment s1,r,d,_ = env.step(a) #Update Action -Table with new knowledge actions[s,a] = actions[s,a] + lr*(r + y*np.max(Q[s1,:]) - Q[s,a]) rAll += r s = s1 if d == True: break rList.append(rAll)
Print the average time taken to reach the goal:
print(“Score over time: ” + str(sum(rList)/num_episodes))
Output: Score over time: 0.0355
In next blog, we’ll try to create same model using neural network.
In conclusion, reinforcement learning is the beginning of the general AI also it can be used in several real-life problems where we can create an environment and agent is rewarded or penalized with the action taken by the agent. you can find today’s code here.