DAY 61-100 DAYS MLCODE: Multi-Armed Bandit Problem

My Tech World

DAY 61-100 DAYS MLCODE: Multi-Armed Bandit Problem

January 10, 2019 100-Days-Of-ML-Code blog 0

In the previous blogs, we discussed Reinforcement Learning, in this blog we will try to solve the Multi-Armed Bandit Problem using reinforcement learning technique.

Multi-armed Bandit problem is classic RL problem and it check the allocation of resource the maximize the gain. Our goal for the multi-armed bandit problem is to have a such strategy to select the “best” bandit out of a group of bandits.

Let’s consider we have 10 arms bandit like 10 slot machine, and we have a probability of failure means the lower the failure probability, higher the reward from bandit.

bandit_probs = [0.20, 0.40, 0.30, 0.10, 0.50,
                    0.70, 0.80, 0.35, 0.75, 0.65]  # bandit probabilities of failure

Nn above case, index[4] has the lowest failure probability means highest success rate.

Now lets design out reward function, first get the random probability and see if the probability is greater than Bandit then reward is 1 else 0.


def pull_bandit(action):
    #Get a random probability of the arm.
    rndm_prb = np.random.randn(1)
    #If the randomly selectd probability is less than the bandit probability then 1
    if rndm_prb > bandit_probs[action]:
        #return a positive reward.
        return 1
    else:
        #return a negative reward.
        return -1

Next step is to construct the RL Model. Let’s define the weight and action for the model. Since we have 10 arms, let’s initialize the weight to 1 and then action will be decided based on the maximum reward basis.

weights = tf.Variable(tf.zeros([num_bandits]))
selected_action = tf.argmax(weights,0)

Declare the placeholder for the reward and action

reward_y = tf.placeholder(shape=[1],dtype=tf.float32)
action_X = tf.placeholder(shape=[1],dtype=tf.int32)

Now let’s feed the reward and chosen action into the network to compute the loss, and update the loss for the network.

# We feed the reward and chosen action into the network
#to compute the loss, and use it to update the network.
responsible_weight = tf.slice(weights,action_X,[1])
loss = -(tf.log(responsible_weight)*reward_y)
optimizer = tf.train.GradientDescentOptimizer(learning_rate = learning_rate)
#Training operation is to minimize the loss
training_op = optimizer.minimize(loss)

Let’s train our model :

# Lets train the model
with tf.Session() as sess:
    sess.run(init)
    i = 0
    while i < episode_no :
        
        #Choose either a random action or one from our network.
        if np.random.rand(1) < e:
            action = np.random.randint(num_bandits)
        else:
            action = sess.run(selected_action)
        
        reward = pull_bandit(action) #Get our reward from picking one of the bandits.
        
        #Update the network.
        _,resp,ww = sess.run([training_op,responsible_weight,weights], feed_dict={reward_y:[reward],action_X:[action]})
        
        #Update our running tally of scores.
        total_reward[action] += reward
        if i % 100 == 0:
            print("Running reward for the " + str(num_bandits) + " bandits: " + str(total_reward))
        i+=1
print("Best bandit arm as per Agent is :" + str(np.argmax(ww)+1) )
if np.argmax(ww) == np.argmax(-np.array(bandit_probs)):
    print ("Agent has successfully detected the best Bandit Arm!")
else:
    print ("Oops! wrong suggestion")

output:
Running reward for the 10 bandits: [2. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
Running reward for the 10 bandits: [-1. -4. -4. -4. -4. -4. -4. -4. -3. -3.]
Running reward for the 10 bandits: [-2. -6. -5. -5. -6. -5. -5. -5. -5. -5.]
Running reward for the 10 bandits: [-7. -6. -5. -7. -7. -6. -7. -6. -6. -6.]
Running reward for the 10 bandits: [-11. -10. -10. -10. -10. -9. -9. -8. -9. -9.]
Running reward for the 10 bandits: [-11. -12. -12. -12. -12. -12. -13. -12. -12. -13.]
Running reward for the 10 bandits: [-15. -14. -15. -15. -15. -15. -14. -15. -15. -14.]
Running reward for the 10 bandits: [-21. -20. -19. -21. -20. -20. -20. -20. -20. -20.]
Running reward for the 10 bandits: [-24. -23. -23. -22. -23. -23. -23. -24. -23. -23.]
Running reward for the 10 bandits: [-27. -26. -27. -28. -26. -26. -26. -25. -26. -26.]
Running reward for the 10 bandits: [-32. -30. -28. -31. -30. -30. -30. -30. -30. -30.]
Running reward for the 10 bandits: [-33. -33. -32. -33. -32. -32. -32. -32. -32. -34.]
Running reward for the 10 bandits: [-38. -37. -37. -37. -36. -37. -36. -37. -36. -36.]
Running reward for the 10 bandits: [-42. -40. -40. -41. -40. -40. -40. -40. -38. -40.]
Running reward for the 10 bandits: [-41. -44. -45. -45. -43. -45. -44. -43. -43. -44.]
Running reward for the 10 bandits: [-47. -46. -46. -40. -46. -45. -45. -46. -45. -45.]
Running reward for the 10 bandits: [-47. -46. -47. -49. -46. -46. -47. -46. -45. -46.]
Running reward for the 10 bandits: [-53. -50. -46. -51. -49. -49. -48. -50. -49. -48.]
Running reward for the 10 bandits: [-62. -53. -54. -57. -54. -57. -54. -53. -52. -51.]
Running reward for the 10 bandits: [-62. -53. -54. -67. -54. -57. -53. -52. -52. -53.]
Best bandit arm as per Agent is :4
Agent has successfully detected the best Bandit Arm!

This is a very simple example of the Reinforcement learning using simple policy. You can find the entire code here.