DAY 20-100 DAYS MLCODE: Decision Tree part 4

My Tech World

DAY 20-100 DAYS MLCODE: Decision Tree part 4

November 29, 2018 100-Days-Of-ML-Code blog 0

This is a continuation of our previous blog about training decision trees on Moon dataset. In this blog, we’ll try to train a model on a smaller subset of the training dataset. For example, we created the 100000 training instance of the moon data set. Now in this blog, we are going to train 100 training instances together. That means entire training set is divided into 1,000 subsets and each subset has 100 instances.

Lets start by first splitting the data into 1,000 subsets using SciKit Learn ShuffleSplit class:

from sklearn.model_selection import ShuffleSplit

trees_no = 1000
instances_no = 100

sub_sets = []

shuffle = ShuffleSplit(n_splits= trees_no, test_size=len(X_train) - instances_no, random_state=42)
for train_index, test_index in shuffle.split(X_train):
    X_mini_train = X_train[train_index]
    y_mini_train = y_train[train_index]
    sub_sets.append((X_mini_train, y_mini_train))

The above code will prepare the subset of training instances.

Let’s now train these subsets using the best hyper parameter values calculated on yesterday’s blog. Once we train these subsets , measure the accuracy of each trees on the training set.

from sklearn.base import clone

forest = [clone(grid_search_cv.best_estimator_) for _ in range(trees_no)]

accuracy_scores = []

for tree, (X_mini_train, y_mini_train) in zip(forest, sub_sets):
    tree.fit(X_mini_train, y_mini_train)
    y_pred = tree.predict(X_test)
    accuracy_scores.append(accuracy_score(y_test, y_pred))

np.mean(accuracy_scores)

Output: 0.8297994999999999

This is definitely lower than the our previous decision tree classifier.

Instead of rely on the above way, let’s predict the output of 1,000 trees for each instance of test data. Once we have all the predictions, select the one which is more frequent in the output. This is like voting, which predicated value got most vote from trees ( in our case 1,000).

Let’s prepare the prediction list.

Y_pred = np.empty([trees_no, len(X_test)], dtype=np.uint8)

for tree_index, tree in enumerate(forest):
    Y_pred[tree_index] = tree.predict(X_test)

Now we have all the predictions, let’s get the most frequent values using the mode() function of Stats class of SciKit-Learn.

from scipy.stats import mode

y_pred_max_votes, votes_no = mode(Y_pred, axis=0)

Accuracy

Now let’s calculate the accuracy of the model.

accuracy_score(y_test, y_pred_max_votes.reshape([-1]))

Output: 0.901

This is slightly better than ours yesterday model.  And the way we have divided the training set and created more than one tree , is nothing but a Rand Forest algorithm and we have training our model using Random Forest technique. In next blog we’ll discuss in more detail about Random Forest.

In conclusion, if you divide the dataset into subsets of training instances such a way that you can train a small number of training instances. You can then use the most voted prediction as your final prediction and this technique is called Random forest. You can find the entire codes of yesterday’s and today’s blog here.