Saturday, November 28, 2020

Visualizing Decision tree in Python (with codes included)

List of libraries required to be installed (if not already installed). Here installation is done through Jupyter Notebook. For terminal use only "pip install library-name".

#import sys

#!{sys.executable} -m pip install numpy

#!{sys.executable} -m pip install pandas

#!{sys.executable} -m pip install sklearn

#!{sys.executable} -m pip install hvplot

#!{sys.executable} -m pip install six

#!{sys.executable} -m pip install pydotplus

#!{sys.executable} -m pip install  python-graphviz

Importing libraries:

import numpy as np 

import pandas as pd

from sklearn.tree import DecisionTreeClassifier

from sklearn import tree

import hvplot.pandas

About the dataset

Imagine that you are a medical researcher compiling data for a study. You have collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of 5 medications, Drug A, Drug B, Drug c, Drug x and y.

Part of your job is to build a model to find out which drug might be appropriate for a future patient with the same illness. The feature sets of this dataset are Age, Sex, Blood Pressure, and Cholesterol of patients, and the target is the drug that each patient responded to.

It is a sample of binary classifier, and you can use the training part of the dataset to build a decision tree, and then use it to predict the class of a unknown patient, or to prescribe it to a new patient.

Let's see the sample of dataset:

The shape of dataset is (200,6)

Pre-processing

Using my_data as the Drug.csv data read by pandas, declare the following variables:

  • as the Feature Matrix (data of my_data)
  • as the response vector (target)
  • Remove the column containing the target name since it doesn't contain numeric values.

As you may figure out, some features in this dataset are categorical such as Sex or BP. Unfortunately, Sklearn Decision Trees do not handle categorical variables. But still we can convert these features to numerical values. pandas.get_dummies() Convert categorical variable into dummy/indicator variables.

Now we can fill the target variable.











Setting up the Decision Tree


We will be using train/test split on our decision tree. Let's import train_test_split from sklearn.cross_validation.


from sklearn.model_selection import train_test_split


Now train_test_split will return 4 different parameters. We will name them:
X_trainset, X_testset, y_trainset, y_testset

The train_test_split will need the parameters:
X, y, test_size=0.3, and random_state=3.

The X and y are the arrays required before the split, the test_size represents the ratio of the testing dataset, and the random_state ensures that we obtain the same splits.




Modeling


We will first create an instance of the DecisionTreeClassifier called drugTree.
Inside of the classifier, specify criterion="entropy" so we can see the information gain of each node.

We can use "gini" criterion as well. Result will be the same. Gini is actually a default criteria for decision tree classifier.












From the graph below, we can see the accuracy score is highest at max-depth=4 and it remains constant

thereafter so we have used max-depth=4 in this case.















We can also plot max-depth vs accuracy score for both testset and trainset:

For testset:


















For trainset:


















Now, plotting the graph of max-depth vs accuracy score for both trainset and testset:


















Visualization

















The final image of decision tree is given below:






























Another easy method to generate decision tree:
























Drawing Decision path (more readable form of decision tree):

























Important features for classification

Listing the features in their rank of importance for classifying the data.












































The github link to the program can be found here.

0 comments:

Post a Comment