DAY 60-100 DAYS MLCODE: Markov Decision Process
In the previous blog, we discussed the REINFORCE algorithm, in this
Markov Chain
Markov process is named after the Russian Mathematician Andrey Markov. It is a stochastic process that satisfies the Markov property (“memorylessness”). This means that predict the future state of a process using the current state and without using any prior history.
Markov chain is a type of Markov process.
A diagram representing a two-state Markov process, with the states, labeled E and A. Each number represents the probability of the Markov process changing from one state to another state, with the direction indicated by the arrow. For example, if the Markov process is in state A, then the probability it changes to state E is 0.4, while the probability it remains in state A is 0.6.
As per wikipedia Markov Decision Process is :
A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
Wikipedia
Markov decision process was first introduced in 1957 by Richard Bellman. It is like a Markov chain except agent can perform one of the several possible actions and the transition probability depends on the chosen action. So the aim is to maximize the reward when transitioning from one state to another.
Above is example of MDP with three states S0, S1 and S2 and two actions ao and a1. If the initial state of agent is S0 then it can perform two action a0 with with 50% chance of coming back to the same state or take action a1 with certainty to reach to state S2. Consider that we are now agent in state S1, so if agent perform action a0, there is 70% chance that it will be rewarded with + 5 points.
Bellman discovered the way to calculate the optimal state value of a state s noted as V*(s) which is a summation of all the discounted reward achieve by the agent by reaching the state s and performing optimally. Below is the equation of Optimal state value
Optimal state value does help us to find the maximum reward but it does not tell agent that what action to perform.
Bellman provided another equation which helped to answer the above question. It is called optimal state-action value or called Q-values. Below is the equation of the Q -values.
Once we have the optimal Q-values, we can define the optimal policy by
Now let’s develop a small example of above MDP process.
Markov Decision process implementation
Declare the transition probability of the above image. If the agent is in s0 and an action a0 then probability 0.7 to state s0 and 0.3 to state s1, and 0.0 for stat 2. This will be the first cell T[1][1]. Now if the agent is in s0 and action is a1 then the probability is 1 for state s0 and all other probability is
T = [
[[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]],
[[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],
[None, [0.8, 0.1, 0.1], None],
]
Reward will be like below same explanation as transitional probability
R = [
[[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
[[0, 0, 0], [0, 0, 0], [0, 0, -50]],
[[0, 0, 0], [+20, 0, 0], [0, 0, 0]],
]
And possible actions are like below. At state s0, we have all three actions a0, a1 and
a = [[0, 1, 2], [0, 2], [1]]
Define a class to represent the MDP environment
class MDP_Environment(object):
def __init__(self, start_state=0):
self.start_state=start_state
self.reset()
def reset(self):
self.total_R = 0
self.state = self.start_state
def step(self, action):
next_state = np.random.choice(range(3), p=T[self.state][action])
reward = R[self.state][action][next_state]
self.state = next_state
self.total_R += reward
return self.state, reward
Develop a routine for running episodes:
def run_episode(policy, n_steps, start_state=0, display=True):
env = MDPEnvironment()
if display:
print("States (+R):", end=" ")
for step in range(n_steps):
if display:
if step == 10:
print("...", end=" ")
elif step < 10:
print(env.state, end=" ")
action = policy(env.state)
state, reward = env.step(action)
if display and step < 10:
if reward:
print("({})".format(reward), end=" ")
if display:
print("Total R =", env.total_R)
return env.total_R
Declare a policy to select the random state
def random_policy(state):
return np.random.choice(a[state])
Run the 1000 episodes using random policy
all_totals = []
policy = random_policy
print(policy.__name__)
for episode in range(1000):
all_totals.append(run_episode(policy, n_steps=100, display=(episode<5)))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_totals), np.std(all_totals), np.min(all_totals), np.max(all_totals)))
print()
Output:
random_policy States (+R): 0 1 1 1 (-50) 2 (20) 0 0 (10) 0 (10) 0 0 … Total R = -300
States (+R): 0 0 0 1 (-50) 2 (20) 0 0 (10) 0 0 (10) 0 (10) … Total R = -220
States (+R): 0 0 0 0 0 0 0 (10) 0 0 0 (10) … Total R = -280
States (+R): 0 0 (10) 0 (10) 0 (10) 0 0 1 (-50) 2 2 1 … Total R = -310
States (+R): 0 0 (10) 0 0 0 0 0 (10) 0 0 0 … Total R = -240
Summary: mean=-226.1, std=112.989433, min=-640, max=90
This is the end of this blog, we’ll use this knowledge in our next blog. You can find the entire code here.
#100DaysofMLCode #MDP #ReinforcementLearning Markov Decision Process Q-Learning