DAY 60-100 DAYS MLCODE: Markov Decision Process

In the previous blog, we discussed the REINFORCE algorithm, in this blog we’ll discuss Markov Decision Process. This will help us to understand other algorithms where Gradient Policy algorithm itself try to optimize the policy to maximize the reward.

Markov Chain

Markov process is named after the Russian Mathematician Andrey Markov. It is a stochastic process that satisfies the Markov property (“memorylessness”). This means that predict the future state of a process using the current state and without using any prior history.

Markov chain is a type of Markov process.

A diagram representing a two-state Markov process, with the states, labeled E and A. Each number represents the probability of the Markov process changing from one state to another state, with the direction indicated by the arrow. For example, if the Markov process is in state A, then the probability it changes to state E is 0.4, while the probability it remains in state A is 0.6.

Markov Process with two states

As per wikipedia Markov Decision Process is :

A Markov decision process (MDP) is a discrete time stochastic control process. It provides a mathematical framework for modeling decision making in situations where outcomes are partly random and partly under the control of a decision maker.
Wikipedia

Markov decision process was first introduced in 1957 by Richard Bellman. It is like a Markov chain except agent can perform one of the several possible actions and the transition probability depends on the chosen action. So the aim is to maximize the reward when transitioning from one state to another.

MDP with three state

Above is example of MDP with three states S0, S1 and S2 and two actions ao and a1. If the initial state of agent is S0 then it can perform two action a0 with with 50% chance of coming back to the same state or take action a1 with certainty to reach to state S2. Consider that we are now agent in state S1, so if agent perform action a0, there is 70% chance that it will be rewarded with + 5 points.

Bellman discovered the way to calculate the optimal state value of a state s noted as V*(s) which is a summation of all the discounted reward achieve by the agent by reaching the state s and performing optimally. Below is the equation of Optimal state value

Optimal state valie — Optimal state value

Optimal state value does help us to find the maximum reward but it does not tell agent that what action to perform.

Bellman provided another equation which helped to answer the above question. It is called optimal state-action value or called Q-values. Below is the equation of the Q -values.

Once we have the optimal Q-values, we can define the optimal policy by argmaxQ*(s, a).

Now let’s develop a small example of above MDP process.

Markov Decision process implementation

Declare the transition probability of the above image. If the agent is in s0 and an action a0 then probability 0.7 to state s0 and 0.3 to state s1, and 0.0 for stat 2. This will be the first cell T[1][1]. Now if the agent is in s0 and action is a1 then the probability is 1 for state s0 and all other probability is zro so T[1][2] = [1.0, 0 , 0] etc.

T = [
        [[0.7, 0.3, 0.0], [1.0, 0.0, 0.0], [0.8, 0.2, 0.0]], 
        [[0.0, 1.0, 0.0], None, [0.0, 0.0, 1.0]],
        [None, [0.8, 0.1, 0.1], None],
    ]

Reward will be like below same explanation as transitional probability

R = [
        [[+10, 0, 0], [0, 0, 0], [0, 0, 0]],
        [[0, 0, 0], [0, 0, 0], [0, 0, -50]],
        [[0, 0, 0], [+20, 0, 0], [0, 0, 0]],
    ]

And possible actions are like below. At state s0, we have all three actions a0, a1 and a2 . On S1 we have a0 and a2 and on stat 2 we have only a1

a = [[0, 1, 2], [0, 2], [1]]

Define a class to represent the MDP environment

class MDP_Environment(object):
    def __init__(self, start_state=0):
        self.start_state=start_state   
        self.reset()
    def reset(self):
        self.total_R = 0
        self.state = self.start_state
    def step(self, action):
        next_state = np.random.choice(range(3), p=T[self.state][action])
        reward = R[self.state][action][next_state]
        self.state = next_state
        self.total_R += reward
        return self.state, reward

Develop a routine for running episodes:

def run_episode(policy, n_steps, start_state=0, display=True):
    env = MDPEnvironment()
    if display:
        print("States (+R):", end=" ")
    for step in range(n_steps):
        if display:
            if step == 10:
                print("...", end=" ")
            elif step < 10:
                print(env.state, end=" ")
        action = policy(env.state)
        state, reward = env.step(action)
        if display and step < 10:
            if reward:
                print("({})".format(reward), end=" ")
    if display:
        print("Total R =", env.total_R)
    return env.total_R

Declare a policy to select the random state

def random_policy(state):
    return np.random.choice(a[state])

Run the 1000 episodes using random policy

all_totals = []
policy = random_policy
print(policy.__name__)
for episode in range(1000):
    all_totals.append(run_episode(policy, n_steps=100, display=(episode<5)))
print("Summary: mean={:.1f}, std={:1f}, min={}, max={}".format(np.mean(all_totals), np.std(all_totals), np.min(all_totals), np.max(all_totals)))
print()

Output:
random_policy States (+R): 0 1 1 1 (-50) 2 (20) 0 0 (10) 0 (10) 0 0 … Total R = -300
States (+R): 0 0 0 1 (-50) 2 (20) 0 0 (10) 0 0 (10) 0 (10) … Total R = -220
States (+R): 0 0 0 0 0 0 0 (10) 0 0 0 (10) … Total R = -280
States (+R): 0 0 (10) 0 (10) 0 (10) 0 0 1 (-50) 2 2 1 … Total R = -310
States (+R): 0 0 (10) 0 0 0 0 0 (10) 0 0 0 … Total R = -240
Summary: mean=-226.1, std=112.989433, min=-640, max=90

This is the end of this blog, we’ll use this knowledge in our next blog. You can find the entire code here.

#100DaysofMLCode #MDP #ReinforcementLearning Markov Decision Process Q-Learning

DAY 60-100 DAYS MLCODE: Markov Decision Process