DAY 19-100 DAYS MLCODE: DECISION TREES classifier

My Tech World

DAY 19-100 DAYS MLCODE: DECISION TREES classifier

November 29, 2018 100-Days-Of-ML-Code blog 0

In the previous blog, we completed the decision trees. In this blog, we will create a Decision tree classifier using the Decision tree and will use Grid search to find the best value for parameters.

Lets create the Moon Dataset using the make_moons class of SciKit Learn

from sklearn.datasets import make_moons  #Import library
X, y = make_moons(n_samples=10000, noise=0.35, random_state=42)  #generate 10000 data instance

Once data is loaded , let’s split the data into train and test set

from sklearn.model_selection import train_test_split  #Import train_test_split call to devidie the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 42)  #Split the data

Print the size of training and testing data

print(f"Training data: {len(X_train)}, Training labels: {len(y_train)}, Testing data: {len(X_test)}, Testing labels: {len(y_test)}")

Training data: 8000, Training labels: 8000, Testing data: 2000, Testing labels: 2000

Visualize the data

Let’s plot the training data vs training labels. X-axis if the first feature and Y axis is 2nd features.

import matplotlib.pyplot as plt
plt.title("Moon Data set")
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, 
               edgecolors='k')
e
Moon data set

Lets train the decision trees classifier

from sklearn.tree import DecisionTreeClassifier  #Import the Decision tree classifier class
dtree_clf = DecisionTreeClassifier()      #create model
dtree_clf.fit(X_train, y_train)        #Train the model

DecisionTreeClassifier(class_weight=None, criterion=’gini’, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter=’best’)

Let’s plot the decision boundary of the classifier

import numpy as np
from matplotlib.colors import ListedColormap
x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.02),
                         np.arange(y_min, y_max, 0.02))
figure = plt.figure(figsize=(10, 10))
cm = plt.cm.RdBu
cm_bright = ListedColormap(['#FF0000', '#0000FF'])

# Plot the training points
plt.scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap=cm_bright,
           edgecolors='k')
# Plot the testing points
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap=cm_bright, alpha=0.6,
           edgecolors='k')
Z = dtree_clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, cmap=cm, alpha=.2)
Decision Trees classifier
Decision Classifier

Use the Grid Search to find the best hyper-parameters using GridSearchCV class of SciKit-Learn. Let’s use the Max_Leaf_reang from 2 to 100 and min_samples_split’: [2, 3, 4] .

from sklearn.model_selection import GridSearchCV

params = {'max_leaf_nodes': list(range(2, 100)), 'min_samples_split': [2, 3, 4]}
grid_search_cv = GridSearchCV(dtree_clf , params, n_jobs=-1, verbose=1)

grid_search_cv.fit(X_train, y_train)
Fitting 3 folds for each of 294 candidates, totalling 882 fits
[Parallel(n_jobs=-1)]: Done 882 out of 882 | elapsed:    6.6s finished
GridSearchCV(cv=None, error_score='raise',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99], 'min_samples_split': [2, 3, 4]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=1)

Let’s find the best estimated model parameters:

grid_search_cv.best_estimator_

DecisionTreeClassifier(class_weight=None, criterion=’gini’, max_depth=None, max_features=None, max_leaf_nodes=22, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter=’best’)

Since Grid search already trained the model with best hyper parameters setting, we can just measure the accuracy. If you want to stop the auto training, pass the value of refit=False

Accuracy

Let’s measure the performance of our Model

from sklearn.metrics import accuracy_score  #Import Librady
y_pred = grid_search_cv.predict(X_test)   # Run the prediction
accuracy_score(y_test, y_pred)     #Measure accuracy

Output: 0.895

Our model has predicted with accuracy of 89.5. This does not looks bad.

In conclusion, this blog has a simple implementation of decision tree classifier using SciKit-Learn. You can find the code here.