Introduction

First of all: Hello Everyone!

My name is Moein Verkiani but You're going to know me as Moligarch or if you are Persian, I'm Kian! in this file we are going to do some hands-on ML exercise on IRIS dataset. It's good to know What is Overfitting and how to avoid it? after finishing these practices.

Now, let's prepare our environment for further operations:

Import libraries
Modify environment variable
Define dataset

#prepare environment
%reset -f
import os
from sklearn import datasets, tree, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.preprocessing import StandardScaler, label_binarize
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report, roc_curve, auc
from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold, LeaveOneOut,LeavePOut
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multiclass import OneVsRestClassifier
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn
import io
from itertools import cycle 


iris_data = pd.read_csv('iris.csv')
iris = datasets.load_iris()
Xsk = iris['data']
ysk = iris['target']
os.environ["MKL_NUM_THREADS"] = "1" 
os.environ["NUMEXPR_NUM_THREADS"] = "1" 
os.environ["OMP_NUM_THREADS"] = "1"

print(type(Xsk),Xsk.shape,type(ysk),ysk.shape)
#print(X,y)

<class 'numpy.ndarray'> (150, 4) <class 'numpy.ndarray'> (150,)

Basic Statistics

In this part We're going to calculate some basic statistics metrics in order to understand our dataset better.

IRIS dataset is a collection of 4 features (petal/sepal lenght and width) and 1 target that contains 3 spicies:

Setosa
Versicolor
Virginica

so that statistics metrics mentioned before, have better definition if we run them on these spiceis separatly.

#Calculate Features mean, with respect to kind of flower
X_arr=np.array(pd.DataFrame(Xsk , columns=iris['feature_names']))
setosa_mean = [np.mean(X_arr[:50, i]) for i in range(4)]
versicolor_mean = [np.mean(X_arr[50:100, i]) for i in range(4)]
virginica_mean = [np.mean(X_arr[100:150, i]) for i in range(4)]

spicies = {'setosa': setosa_mean, 'versicolor': versicolor_mean, 'virginica': virginica_mean}

Xmean_df = pd.DataFrame(spicies, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Mean\n',Xmean_df)

Features Mean
               setosa  versicolor  virginica
sepal length   5.006       5.936      6.588
sepal width    3.428       2.770      2.974
petal length   1.462       4.260      5.552
petal width    0.246       1.326      2.026

#Calculate Features Standard Deviation

setosa_std = [np.std(X_arr[:50, i]) for i in range(4)]
versicolor_std = [np.std(X_arr[50:100, i]) for i in range(4)]
virginica_std = [np.std(X_arr[100:150, i]) for i in range(4)]
X_std=[np.std(X_arr[:150, i]) for i in range(4)]
categ = {'Total':X_std, 'setosa': setosa_std, 'versicolor': versicolor_std, 'virginica': virginica_std}

Xstd_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Standard Deviation\n',Xstd_df)

Features Standard Deviation
                  Total    setosa  versicolor  virginica
sepal length  0.825301  0.348947    0.510983   0.629489
sepal width   0.434411  0.375255    0.310644   0.319255
petal length  1.759404  0.171919    0.465188   0.546348
petal width   0.759693  0.104326    0.195765   0.271890

#Calculate Features Variance

setosa_var = [np.var(X_arr[:50, i]) for i in range(4)]
versicolor_var = [np.var(X_arr[50:100, i]) for i in range(4)]
virginica_var = [np.var(X_arr[100:150, i]) for i in range(4)]
X_var=[np.var(X_arr[:150, i]) for i in range(4)]
categ = {'Total':X_var, 'setosa': setosa_var, 'versicolor': versicolor_var, 'virginica': virginica_var}

Xvar_df=pd.DataFrame(categ, index=['sepal length', 'sepal width', 'petal length', 'petal width'])
print('Features Variance\n',Xvar_df)

Features Variance
                  Total    setosa  versicolor  virginica
sepal length  0.681122  0.121764    0.261104   0.396256
sepal width   0.188713  0.140816    0.096500   0.101924
petal length  3.095503  0.029556    0.216400   0.298496
petal width   0.577133  0.010884    0.038324   0.073924

Scale

When your data has different values, and even different measurement units, it can be difficult to compare them.What is kilograms compared to meters? Or altitude compared to time?

The answer to this problem is scaling. We can scale data into new values that are easier to compare.

The standardization method uses this formula:

z = (x - u) / s

Where z is the new value, x is the original value, u is the mean and s is the standard deviation.

sklearn do all of this with a single command:

scale = StandardScaler()

scaledX = scale.fit_transform(Xsk)
print(scaledX,scaledX.shape)

[[-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00 -1.31979479e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.38535265e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
 [-1.50652052e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  1.24920112e+00 -1.34022653e+00 -1.31544430e+00]
 [-5.37177559e-01  1.93979142e+00 -1.16971425e+00 -1.05217993e+00]
 [-1.50652052e+00  7.88807586e-01 -1.34022653e+00 -1.18381211e+00]
 [-1.02184904e+00  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
 [-1.74885626e+00 -3.62176246e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.44707648e+00]
 [-5.37177559e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
 [-1.26418478e+00  7.88807586e-01 -1.22655167e+00 -1.31544430e+00]
 [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.44707648e+00]
 [-1.87002413e+00 -1.31979479e-01 -1.51073881e+00 -1.44707648e+00]
 [-5.25060772e-02  2.16998818e+00 -1.45390138e+00 -1.31544430e+00]
 [-1.73673948e-01  3.09077525e+00 -1.28338910e+00 -1.05217993e+00]
 [-5.37177559e-01  1.93979142e+00 -1.39706395e+00 -1.05217993e+00]
 [-9.00681170e-01  1.01900435e+00 -1.34022653e+00 -1.18381211e+00]
 [-1.73673948e-01  1.70959465e+00 -1.16971425e+00 -1.18381211e+00]
 [-9.00681170e-01  1.70959465e+00 -1.28338910e+00 -1.18381211e+00]
 [-5.37177559e-01  7.88807586e-01 -1.16971425e+00 -1.31544430e+00]
 [-9.00681170e-01  1.47939788e+00 -1.28338910e+00 -1.05217993e+00]
 [-1.50652052e+00  1.24920112e+00 -1.56757623e+00 -1.31544430e+00]
 [-9.00681170e-01  5.58610819e-01 -1.16971425e+00 -9.20547742e-01]
 [-1.26418478e+00  7.88807586e-01 -1.05603939e+00 -1.31544430e+00]
 [-1.02184904e+00 -1.31979479e-01 -1.22655167e+00 -1.31544430e+00]
 [-1.02184904e+00  7.88807586e-01 -1.22655167e+00 -1.05217993e+00]
 [-7.79513300e-01  1.01900435e+00 -1.28338910e+00 -1.31544430e+00]
 [-7.79513300e-01  7.88807586e-01 -1.34022653e+00 -1.31544430e+00]
 [-1.38535265e+00  3.28414053e-01 -1.22655167e+00 -1.31544430e+00]
 [-1.26418478e+00  9.82172869e-02 -1.22655167e+00 -1.31544430e+00]
 [-5.37177559e-01  7.88807586e-01 -1.28338910e+00 -1.05217993e+00]
 [-7.79513300e-01  2.40018495e+00 -1.28338910e+00 -1.44707648e+00]
 [-4.16009689e-01  2.63038172e+00 -1.34022653e+00 -1.31544430e+00]
 [-1.14301691e+00  9.82172869e-02 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  3.28414053e-01 -1.45390138e+00 -1.31544430e+00]
 [-4.16009689e-01  1.01900435e+00 -1.39706395e+00 -1.31544430e+00]
 [-1.14301691e+00  1.24920112e+00 -1.34022653e+00 -1.44707648e+00]
 [-1.74885626e+00 -1.31979479e-01 -1.39706395e+00 -1.31544430e+00]
 [-9.00681170e-01  7.88807586e-01 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  1.01900435e+00 -1.39706395e+00 -1.18381211e+00]
 [-1.62768839e+00 -1.74335684e+00 -1.39706395e+00 -1.18381211e+00]
 [-1.74885626e+00  3.28414053e-01 -1.39706395e+00 -1.31544430e+00]
 [-1.02184904e+00  1.01900435e+00 -1.22655167e+00 -7.88915558e-01]
 [-9.00681170e-01  1.70959465e+00 -1.05603939e+00 -1.05217993e+00]
 [-1.26418478e+00 -1.31979479e-01 -1.34022653e+00 -1.18381211e+00]
 [-9.00681170e-01  1.70959465e+00 -1.22655167e+00 -1.31544430e+00]
 [-1.50652052e+00  3.28414053e-01 -1.34022653e+00 -1.31544430e+00]
 [-6.58345429e-01  1.47939788e+00 -1.28338910e+00 -1.31544430e+00]
 [-1.02184904e+00  5.58610819e-01 -1.34022653e+00 -1.31544430e+00]
 [ 1.40150837e+00  3.28414053e-01  5.35408562e-01  2.64141916e-01]
 [ 6.74501145e-01  3.28414053e-01  4.21733708e-01  3.95774101e-01]
 [ 1.28034050e+00  9.82172869e-02  6.49083415e-01  3.95774101e-01]
 [-4.16009689e-01 -1.74335684e+00  1.37546573e-01  1.32509732e-01]
 [ 7.95669016e-01 -5.92373012e-01  4.78571135e-01  3.95774101e-01]
 [-1.73673948e-01 -5.92373012e-01  4.21733708e-01  1.32509732e-01]
 [ 5.53333275e-01  5.58610819e-01  5.35408562e-01  5.27406285e-01]
 [-1.14301691e+00 -1.51316008e+00 -2.60315415e-01 -2.62386821e-01]
 [ 9.16836886e-01 -3.62176246e-01  4.78571135e-01  1.32509732e-01]
 [-7.79513300e-01 -8.22569778e-01  8.07091462e-02  2.64141916e-01]
 [-1.02184904e+00 -2.43394714e+00 -1.46640561e-01 -2.62386821e-01]
 [ 6.86617933e-02 -1.31979479e-01  2.51221427e-01  3.95774101e-01]
 [ 1.89829664e-01 -1.97355361e+00  1.37546573e-01 -2.62386821e-01]
 [ 3.10997534e-01 -3.62176246e-01  5.35408562e-01  2.64141916e-01]
 [-2.94841818e-01 -3.62176246e-01 -8.98031345e-02  1.32509732e-01]
 [ 1.03800476e+00  9.82172869e-02  3.64896281e-01  2.64141916e-01]
 [-2.94841818e-01 -1.31979479e-01  4.21733708e-01  3.95774101e-01]
 [-5.25060772e-02 -8.22569778e-01  1.94384000e-01 -2.62386821e-01]
 [ 4.32165405e-01 -1.97355361e+00  4.21733708e-01  3.95774101e-01]
 [-2.94841818e-01 -1.28296331e+00  8.07091462e-02 -1.30754636e-01]
 [ 6.86617933e-02  3.28414053e-01  5.92245988e-01  7.90670654e-01]
 [ 3.10997534e-01 -5.92373012e-01  1.37546573e-01  1.32509732e-01]
 [ 5.53333275e-01 -1.28296331e+00  6.49083415e-01  3.95774101e-01]
 [ 3.10997534e-01 -5.92373012e-01  5.35408562e-01  8.77547895e-04]
 [ 6.74501145e-01 -3.62176246e-01  3.08058854e-01  1.32509732e-01]
 [ 9.16836886e-01 -1.31979479e-01  3.64896281e-01  2.64141916e-01]
 [ 1.15917263e+00 -5.92373012e-01  5.92245988e-01  2.64141916e-01]
 [ 1.03800476e+00 -1.31979479e-01  7.05920842e-01  6.59038469e-01]
 [ 1.89829664e-01 -3.62176246e-01  4.21733708e-01  3.95774101e-01]
 [-1.73673948e-01 -1.05276654e+00 -1.46640561e-01 -2.62386821e-01]
 [-4.16009689e-01 -1.51316008e+00  2.38717193e-02 -1.30754636e-01]
 [-4.16009689e-01 -1.51316008e+00 -3.29657076e-02 -2.62386821e-01]
 [-5.25060772e-02 -8.22569778e-01  8.07091462e-02  8.77547895e-04]
 [ 1.89829664e-01 -8.22569778e-01  7.62758269e-01  5.27406285e-01]
 [-5.37177559e-01 -1.31979479e-01  4.21733708e-01  3.95774101e-01]
 [ 1.89829664e-01  7.88807586e-01  4.21733708e-01  5.27406285e-01]
 [ 1.03800476e+00  9.82172869e-02  5.35408562e-01  3.95774101e-01]
 [ 5.53333275e-01 -1.74335684e+00  3.64896281e-01  1.32509732e-01]
 [-2.94841818e-01 -1.31979479e-01  1.94384000e-01  1.32509732e-01]
 [-4.16009689e-01 -1.28296331e+00  1.37546573e-01  1.32509732e-01]
 [-4.16009689e-01 -1.05276654e+00  3.64896281e-01  8.77547895e-04]
 [ 3.10997534e-01 -1.31979479e-01  4.78571135e-01  2.64141916e-01]
 [-5.25060772e-02 -1.05276654e+00  1.37546573e-01  8.77547895e-04]
 [-1.02184904e+00 -1.74335684e+00 -2.60315415e-01 -2.62386821e-01]
 [-2.94841818e-01 -8.22569778e-01  2.51221427e-01  1.32509732e-01]
 [-1.73673948e-01 -1.31979479e-01  2.51221427e-01  8.77547895e-04]
 [-1.73673948e-01 -3.62176246e-01  2.51221427e-01  1.32509732e-01]
 [ 4.32165405e-01 -3.62176246e-01  3.08058854e-01  1.32509732e-01]
 [-9.00681170e-01 -1.28296331e+00 -4.30827696e-01 -1.30754636e-01]
 [-1.73673948e-01 -5.92373012e-01  1.94384000e-01  1.32509732e-01]
 [ 5.53333275e-01  5.58610819e-01  1.27429511e+00  1.71209594e+00]
 [-5.25060772e-02 -8.22569778e-01  7.62758269e-01  9.22302838e-01]
 [ 1.52267624e+00 -1.31979479e-01  1.21745768e+00  1.18556721e+00]
 [ 5.53333275e-01 -3.62176246e-01  1.04694540e+00  7.90670654e-01]
 [ 7.95669016e-01 -1.31979479e-01  1.16062026e+00  1.31719939e+00]
 [ 2.12851559e+00 -1.31979479e-01  1.61531967e+00  1.18556721e+00]
 [-1.14301691e+00 -1.28296331e+00  4.21733708e-01  6.59038469e-01]
 [ 1.76501198e+00 -3.62176246e-01  1.44480739e+00  7.90670654e-01]
 [ 1.03800476e+00 -1.28296331e+00  1.16062026e+00  7.90670654e-01]
 [ 1.64384411e+00  1.24920112e+00  1.33113254e+00  1.71209594e+00]
 [ 7.95669016e-01  3.28414053e-01  7.62758269e-01  1.05393502e+00]
 [ 6.74501145e-01 -8.22569778e-01  8.76433123e-01  9.22302838e-01]
 [ 1.15917263e+00 -1.31979479e-01  9.90107977e-01  1.18556721e+00]
 [-1.73673948e-01 -1.28296331e+00  7.05920842e-01  1.05393502e+00]
 [-5.25060772e-02 -5.92373012e-01  7.62758269e-01  1.58046376e+00]
 [ 6.74501145e-01  3.28414053e-01  8.76433123e-01  1.44883158e+00]
 [ 7.95669016e-01 -1.31979479e-01  9.90107977e-01  7.90670654e-01]
 [ 2.24968346e+00  1.70959465e+00  1.67215710e+00  1.31719939e+00]
 [ 2.24968346e+00 -1.05276654e+00  1.78583195e+00  1.44883158e+00]
 [ 1.89829664e-01 -1.97355361e+00  7.05920842e-01  3.95774101e-01]
 [ 1.28034050e+00  3.28414053e-01  1.10378283e+00  1.44883158e+00]
 [-2.94841818e-01 -5.92373012e-01  6.49083415e-01  1.05393502e+00]
 [ 2.24968346e+00 -5.92373012e-01  1.67215710e+00  1.05393502e+00]
 [ 5.53333275e-01 -8.22569778e-01  6.49083415e-01  7.90670654e-01]
 [ 1.03800476e+00  5.58610819e-01  1.10378283e+00  1.18556721e+00]
 [ 1.64384411e+00  3.28414053e-01  1.27429511e+00  7.90670654e-01]
 [ 4.32165405e-01 -5.92373012e-01  5.92245988e-01  7.90670654e-01]
 [ 3.10997534e-01 -1.31979479e-01  6.49083415e-01  7.90670654e-01]
 [ 6.74501145e-01 -5.92373012e-01  1.04694540e+00  1.18556721e+00]
 [ 1.64384411e+00 -1.31979479e-01  1.16062026e+00  5.27406285e-01]
 [ 1.88617985e+00 -5.92373012e-01  1.33113254e+00  9.22302838e-01]
 [ 2.49201920e+00  1.70959465e+00  1.50164482e+00  1.05393502e+00]
 [ 6.74501145e-01 -5.92373012e-01  1.04694540e+00  1.31719939e+00]
 [ 5.53333275e-01 -5.92373012e-01  7.62758269e-01  3.95774101e-01]
 [ 3.10997534e-01 -1.05276654e+00  1.04694540e+00  2.64141916e-01]
 [ 2.24968346e+00 -1.31979479e-01  1.33113254e+00  1.44883158e+00]
 [ 5.53333275e-01  7.88807586e-01  1.04694540e+00  1.58046376e+00]
 [ 6.74501145e-01  9.82172869e-02  9.90107977e-01  7.90670654e-01]
 [ 1.89829664e-01 -1.31979479e-01  5.92245988e-01  7.90670654e-01]
 [ 1.28034050e+00  9.82172869e-02  9.33270550e-01  1.18556721e+00]
 [ 1.03800476e+00  9.82172869e-02  1.04694540e+00  1.58046376e+00]
 [ 1.28034050e+00  9.82172869e-02  7.62758269e-01  1.44883158e+00]
 [-5.25060772e-02 -8.22569778e-01  7.62758269e-01  9.22302838e-01]
 [ 1.15917263e+00  3.28414053e-01  1.21745768e+00  1.44883158e+00]
 [ 1.03800476e+00  5.58610819e-01  1.10378283e+00  1.71209594e+00]
 [ 1.03800476e+00 -1.31979479e-01  8.19595696e-01  1.44883158e+00]
 [ 5.53333275e-01 -1.28296331e+00  7.05920842e-01  9.22302838e-01]
 [ 7.95669016e-01 -1.31979479e-01  8.19595696e-01  1.05393502e+00]
 [ 4.32165405e-01  7.88807586e-01  9.33270550e-01  1.44883158e+00]
 [ 6.86617933e-02 -1.31979479e-01  7.62758269e-01  7.90670654e-01]] (150, 4)

print(type(iris_data))

<class 'pandas.core.frame.DataFrame'>

Data Visualization

If you are trying to discuss or illustrate something to your Colleges,Co Worker, Your managers or etc. you need to SHOW them what you mean! so although we know Data Talks Everywhere!, without Data visualization you are just using 30-40% of Data potential. it also helps you to understand relation between datasets better (not in all case I believe!)

So let's dig deeper.

# set up a figure twice as wide as it is tall
fig = plt.figure(figsize=(12,6))
# =============
# First subplot
# =============
# set up the axes for the first plot
ax = fig.add_subplot(1, 2, 1, projection='3d')

x1 = Xsk[:,0]
x2 = Xsk[:,1]

ax.scatter(x1, x2, ysk, marker='o')
ax.set_xlabel('Sepal L')
ax.set_ylabel('Sepal W"')
ax.set_zlabel('Category')
# ==============
# Second subplot
# ==============
# set up the axes for the second plot
ax = fig.add_subplot(1, 2, 2, projection='3d')

x3 = Xsk[:,2]
x4 = Xsk[:,3]

ax.scatter(x3, x4, ysk, marker='x')
ax.set_xlabel('Petal L')
ax.set_ylabel('Petal W"')
ax.set_zlabel('Category')
plt.show()

#compare any feature with respect to all features
sn.pairplot(iris_data)

<seaborn.axisgrid.PairGrid at 0x263e366f8b0>

plt.hist(ysk, 25)
#plt.show()

plt.title("Data Distribution")
plt.xlabel("Class")
plt.ylabel("Frequency")
plt.show()

	sepal.length	sepal.width	petal.length	petal.width	variety
0	5.100000	3.500000	1.400000	0.200000	Setosa
1	4.900000	3.000000	1.400000	0.200000	Setosa
2	4.700000	3.200000	1.300000	0.200000	Setosa
3	4.600000	3.100000	1.500000	0.200000	Setosa
4	5.000000	3.600000	1.400000	0.200000	Setosa
5	5.400000	3.900000	1.700000	0.400000	Setosa
6	4.600000	3.400000	1.400000	0.300000	Setosa
7	5.000000	3.400000	1.500000	0.200000	Setosa
8	4.400000	2.900000	1.400000	0.200000	Setosa
9	4.900000	3.100000	1.500000	0.100000	Setosa
10	5.400000	3.700000	1.500000	0.200000	Setosa
11	4.800000	3.400000	1.600000	0.200000	Setosa
12	4.800000	3.000000	1.400000	0.100000	Setosa
13	4.300000	3.000000	1.100000	0.100000	Setosa
14	5.800000	4.000000	1.200000	0.200000	Setosa
15	5.700000	4.400000	1.500000	0.400000	Setosa
16	5.400000	3.900000	1.300000	0.400000	Setosa
17	5.100000	3.500000	1.400000	0.300000	Setosa
18	5.700000	3.800000	1.700000	0.300000	Setosa
19	5.100000	3.800000	1.500000	0.300000	Setosa
20	5.400000	3.400000	1.700000	0.200000	Setosa
21	5.100000	3.700000	1.500000	0.400000	Setosa
22	4.600000	3.600000	1.000000	0.200000	Setosa
23	5.100000	3.300000	1.700000	0.500000	Setosa
24	4.800000	3.400000	1.900000	0.200000	Setosa
25	5.000000	3.000000	1.600000	0.200000	Setosa
26	5.000000	3.400000	1.600000	0.400000	Setosa
27	5.200000	3.500000	1.500000	0.200000	Setosa
28	5.200000	3.400000	1.400000	0.200000	Setosa
29	4.700000	3.200000	1.600000	0.200000	Setosa
30	4.800000	3.100000	1.600000	0.200000	Setosa
31	5.400000	3.400000	1.500000	0.400000	Setosa
32	5.200000	4.100000	1.500000	0.100000	Setosa
33	5.500000	4.200000	1.400000	0.200000	Setosa
34	4.900000	3.100000	1.500000	0.200000	Setosa
35	5.000000	3.200000	1.200000	0.200000	Setosa
36	5.500000	3.500000	1.300000	0.200000	Setosa
37	4.900000	3.600000	1.400000	0.100000	Setosa
38	4.400000	3.000000	1.300000	0.200000	Setosa
39	5.100000	3.400000	1.500000	0.200000	Setosa
40	5.000000	3.500000	1.300000	0.300000	Setosa
41	4.500000	2.300000	1.300000	0.300000	Setosa
42	4.400000	3.200000	1.300000	0.200000	Setosa
43	5.000000	3.500000	1.600000	0.600000	Setosa
44	5.100000	3.800000	1.900000	0.400000	Setosa
45	4.800000	3.000000	1.400000	0.300000	Setosa
46	5.100000	3.800000	1.600000	0.200000	Setosa
47	4.600000	3.200000	1.400000	0.200000	Setosa
48	5.300000	3.700000	1.500000	0.200000	Setosa
49	5.000000	3.300000	1.400000	0.200000	Setosa
50	7.000000	3.200000	4.700000	1.400000	Versicolor
51	6.400000	3.200000	4.500000	1.500000	Versicolor
52	6.900000	3.100000	4.900000	1.500000	Versicolor
53	5.500000	2.300000	4.000000	1.300000	Versicolor
54	6.500000	2.800000	4.600000	1.500000	Versicolor
55	5.700000	2.800000	4.500000	1.300000	Versicolor
56	6.300000	3.300000	4.700000	1.600000	Versicolor
57	4.900000	2.400000	3.300000	1.000000	Versicolor
58	6.600000	2.900000	4.600000	1.300000	Versicolor
59	5.200000	2.700000	3.900000	1.400000	Versicolor
60	5.000000	2.000000	3.500000	1.000000	Versicolor
61	5.900000	3.000000	4.200000	1.500000	Versicolor
62	6.000000	2.200000	4.000000	1.000000	Versicolor
63	6.100000	2.900000	4.700000	1.400000	Versicolor
64	5.600000	2.900000	3.600000	1.300000	Versicolor
65	6.700000	3.100000	4.400000	1.400000	Versicolor
66	5.600000	3.000000	4.500000	1.500000	Versicolor
67	5.800000	2.700000	4.100000	1.000000	Versicolor
68	6.200000	2.200000	4.500000	1.500000	Versicolor
69	5.600000	2.500000	3.900000	1.100000	Versicolor
70	5.900000	3.200000	4.800000	1.800000	Versicolor
71	6.100000	2.800000	4.000000	1.300000	Versicolor
72	6.300000	2.500000	4.900000	1.500000	Versicolor
73	6.100000	2.800000	4.700000	1.200000	Versicolor
74	6.400000	2.900000	4.300000	1.300000	Versicolor
75	6.600000	3.000000	4.400000	1.400000	Versicolor
76	6.800000	2.800000	4.800000	1.400000	Versicolor
77	6.700000	3.000000	5.000000	1.700000	Versicolor
78	6.000000	2.900000	4.500000	1.500000	Versicolor
79	5.700000	2.600000	3.500000	1.000000	Versicolor
80	5.500000	2.400000	3.800000	1.100000	Versicolor
81	5.500000	2.400000	3.700000	1.000000	Versicolor
82	5.800000	2.700000	3.900000	1.200000	Versicolor
83	6.000000	2.700000	5.100000	1.600000	Versicolor
84	5.400000	3.000000	4.500000	1.500000	Versicolor
85	6.000000	3.400000	4.500000	1.600000	Versicolor
86	6.700000	3.100000	4.700000	1.500000	Versicolor
87	6.300000	2.300000	4.400000	1.300000	Versicolor
88	5.600000	3.000000	4.100000	1.300000	Versicolor
89	5.500000	2.500000	4.000000	1.300000	Versicolor
90	5.500000	2.600000	4.400000	1.200000	Versicolor
91	6.100000	3.000000	4.600000	1.400000	Versicolor
92	5.800000	2.600000	4.000000	1.200000	Versicolor
93	5.000000	2.300000	3.300000	1.000000	Versicolor
94	5.600000	2.700000	4.200000	1.300000	Versicolor
95	5.700000	3.000000	4.200000	1.200000	Versicolor
96	5.700000	2.900000	4.200000	1.300000	Versicolor
97	6.200000	2.900000	4.300000	1.300000	Versicolor
98	5.100000	2.500000	3.000000	1.100000	Versicolor
99	5.700000	2.800000	4.100000	1.300000	Versicolor
100	6.300000	3.300000	6.000000	2.500000	Virginica
101	5.800000	2.700000	5.100000	1.900000	Virginica
102	7.100000	3.000000	5.900000	2.100000	Virginica
103	6.300000	2.900000	5.600000	1.800000	Virginica
104	6.500000	3.000000	5.800000	2.200000	Virginica
105	7.600000	3.000000	6.600000	2.100000	Virginica
106	4.900000	2.500000	4.500000	1.700000	Virginica
107	7.300000	2.900000	6.300000	1.800000	Virginica
108	6.700000	2.500000	5.800000	1.800000	Virginica
109	7.200000	3.600000	6.100000	2.500000	Virginica
110	6.500000	3.200000	5.100000	2.000000	Virginica
111	6.400000	2.700000	5.300000	1.900000	Virginica
112	6.800000	3.000000	5.500000	2.100000	Virginica
113	5.700000	2.500000	5.000000	2.000000	Virginica
114	5.800000	2.800000	5.100000	2.400000	Virginica
115	6.400000	3.200000	5.300000	2.300000	Virginica
116	6.500000	3.000000	5.500000	1.800000	Virginica
117	7.700000	3.800000	6.700000	2.200000	Virginica
118	7.700000	2.600000	6.900000	2.300000	Virginica
119	6.000000	2.200000	5.000000	1.500000	Virginica
120	6.900000	3.200000	5.700000	2.300000	Virginica
121	5.600000	2.800000	4.900000	2.000000	Virginica
122	7.700000	2.800000	6.700000	2.000000	Virginica
123	6.300000	2.700000	4.900000	1.800000	Virginica
124	6.700000	3.300000	5.700000	2.100000	Virginica
125	7.200000	3.200000	6.000000	1.800000	Virginica
126	6.200000	2.800000	4.800000	1.800000	Virginica
127	6.100000	3.000000	4.900000	1.800000	Virginica
128	6.400000	2.800000	5.600000	2.100000	Virginica
129	7.200000	3.000000	5.800000	1.600000	Virginica
130	7.400000	2.800000	6.100000	1.900000	Virginica
131	7.900000	3.800000	6.400000	2.000000	Virginica
132	6.400000	2.800000	5.600000	2.200000	Virginica
133	6.300000	2.800000	5.100000	1.500000	Virginica
134	6.100000	2.600000	5.600000	1.400000	Virginica
135	7.700000	3.000000	6.100000	2.300000	Virginica
136	6.300000	3.400000	5.600000	2.400000	Virginica
137	6.400000	3.100000	5.500000	1.800000	Virginica
138	6.000000	3.000000	4.800000	1.800000	Virginica
139	6.900000	3.100000	5.400000	2.100000	Virginica
140	6.700000	3.100000	5.600000	2.400000	Virginica
141	6.900000	3.100000	5.100000	2.300000	Virginica
142	5.800000	2.700000	5.100000	1.900000	Virginica
143	6.800000	3.200000	5.900000	2.300000	Virginica
144	6.700000	3.300000	5.700000	2.500000	Virginica
145	6.700000	3.000000	5.200000	2.300000	Virginica
146	6.300000	2.500000	5.000000	1.900000	Virginica
147	6.500000	3.000000	5.200000	2.000000	Virginica
148	6.200000	3.400000	5.400000	2.300000	Virginica
149	5.900000	3.000000	5.100000	1.800000	Virginica

iris_data.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sepal.length	sepal.width	petal.length	petal.width	variety
0	5.1	3.5	1.4	0.2	Setosa
1	4.9	3.0	1.4	0.2	Setosa
2	4.7	3.2	1.3	0.2	Setosa
3	4.6	3.1	1.5	0.2	Setosa
4	5.0	3.6	1.4	0.2	Setosa

iris_data.describe()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

	sepal.length	sepal.width	petal.length	petal.width
count	150.000000	150.000000	150.000000	150.000000
mean	5.843333	3.057333	3.758000	1.199333
std	0.828066	0.435866	1.765298	0.762238
min	4.300000	2.000000	1.000000	0.100000
25%	5.100000	2.800000	1.600000	0.300000
50%	5.800000	3.000000	4.350000	1.300000
75%	6.400000	3.300000	5.100000	1.800000
max	7.900000	4.400000	6.900000	2.500000

iris_data.shape

(150, 5)

Classification

Classification in machine learning is the process of recognition, understanding, and grouping of objects and ideas into preset categories. It requires the use of machine learning algorithms that learn how to assign a class label to examples from the problem domain. There are many different types of classification tasks that you may encounter in machine learning and specialized approaches to modeling that may be used for each.

Decision Tree

A Decision Tree is a Flow Chart, and can help you make decisions based on previous experience. we will use Confusion Matrix in order to evaluate the accuracy of our model.

d = {'Setosa': 0, 'Versicolor': 1, 'Virginica': 2}
features=['sepal.length', 'sepal.width', 'petal.length', 'petal.width']
Xtree = iris_data[features]
ytree = iris_data['variety'].map(d)

dfStyler = iris_data.style.set_properties(**{'text-align': 'left'})
dfStyler.set_table_styles([dict(selector='th', props=[('text-align', 'left')])])
print(iris_data)

     sepal.length  sepal.width  petal.length  petal.width    variety
0             5.1          3.5           1.4          0.2     Setosa
1             4.9          3.0           1.4          0.2     Setosa
2             4.7          3.2           1.3          0.2     Setosa
3             4.6          3.1           1.5          0.2     Setosa
4             5.0          3.6           1.4          0.2     Setosa
..            ...          ...           ...          ...        ...
145           6.7          3.0           5.2          2.3  Virginica
146           6.3          2.5           5.0          1.9  Virginica
147           6.5          3.0           5.2          2.0  Virginica
148           6.2          3.4           5.4          2.3  Virginica
149           5.9          3.0           5.1          1.8  Virginica

[150 rows x 5 columns]

dtree = tree.DecisionTreeClassifier()
dtree.fit(Xtree, ytree)

#Plot the tree
plt.figure(figsize=(15,10))
tree.plot_tree(dtree, feature_names=features, fontsize=10)
plt.show()

print(dtree.predict([[5.5, 4, 4, 1.5]]))

[1]


C:\ProgramData\Anaconda3\lib\site-packages\sklearn\base.py:450: UserWarning: X does not have valid feature names, but DecisionTreeClassifier was fitted with feature names
  warnings.warn(

Confusion Matrix

It is a table that is used in classification problems to assess where errors in the model were made.

The rows represent the actual classes the outcomes should have been. While the columns represent the predictions we have made. Using this table it is easy to see which predictions are wrong.

clf = tree.DecisionTreeClassifier()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Decision Tree", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

AUC - ROC Curve

In classification, there are many different evaluation metrics. The most popular is accuracy, which measures how often the model is correct. This is a great metric because it is easy to understand and getting the most correct guesses is often desired. There are some cases where you might consider using another evaluation metric.

Another common metric is AUC, area under the receiver operating characteristic (ROC) curve. The Reciever operating characteristic curve plots the true positive (TP) rate versus the false positive (FP) rate at different classification thresholds. The thresholds are different probability cutoffs that separate the two classes in binary classification. It uses probability to tell us how well a model separates the classes.

clf = tree.DecisionTreeClassifier()
X_train, X_test, y_train, y_test = train_test_split(Xsk, ysk, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Decision Tree", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

# Binarize the /assets/output
y = label_binarize(ysk, classes = clf.classes_)
n_classes = y.shape[1]

# shuffle and split training and test sets
X_train, X_test, y_train, y_test = train_test_split(Xsk, y, test_size=0.33, random_state=0)

# Learn to predict each class against the other
classifier = OneVsRestClassifier(
    clf
)
y_score = classifier.fit(X_train, y_train).predict_proba(X_test)

# Compute ROC curve and ROC area for each class
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

plt.figure()
lw = 2
plt.plot(
    fpr[2],
    tpr[2],
    color="darkorange",
    lw=lw,
    label="ROC curve (area = %0.2f)" % roc_auc[2],
)
plt.plot([0, 1], [0, 1], color="navy", lw=lw, linestyle="--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
#plt.title("Receiver operating characteristic example")
plt.legend(loc="lower right")
plt.show()
#fig.savefig('curves.png')

Cross Validation

When adjusting models we are aiming to increase overall model performance on unseen data. Hyperparameter tuning can lead to much better performance on test sets. However, optimizing parameters to the test set can lead information leakage causing the model to preform worse on unseen data. To correct for this we can perform cross validation.

To better understand CV, we will be performing different methods on the iris dataset.

# K-Fold Cross Validation 

clf = DecisionTreeClassifier(random_state=42)

k_folds = KFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = k_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1.         1.         0.83333333 0.93333333 0.8       ]
Average CV Score:  0.9133333333333333
Number of CV Scores used in Average:  5

# Stratified K-Fold

clf = DecisionTreeClassifier(random_state=42)

sk_folds = StratifiedKFold(n_splits = 5)

scores = cross_val_score(clf, Xsk, ysk, cv = sk_folds)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [0.96666667 0.96666667 0.9        0.93333333 1.        ]
Average CV Score:  0.9533333333333334
Number of CV Scores used in Average:  5

#Leave One Out
X, y = datasets.load_iris(return_X_y=True)

clf = DecisionTreeClassifier(random_state=42)

loo = LeaveOneOut()

scores = cross_val_score(clf, X, y, cv = loo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1.
 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.
 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1.
 1. 1. 1. 1. 1. 1.]
Average CV Score:  0.94
Number of CV Scores used in Average:  150

clf = DecisionTreeClassifier(random_state=42)

lpo = LeavePOut(p=2)

scores = cross_val_score(clf, Xsk, ysk, cv = lpo)

print("Cross Validation Scores: ", scores)
print("Average CV Score: ", scores.mean())
print("Number of CV Scores used in Average: ", len(scores))

Cross Validation Scores:  [1. 1. 1. ... 1. 1. 1.]
Average CV Score:  0.9382997762863534
Number of CV Scores used in Average:  11175

Ensemble

from sklearn.metrics import accuracy_score


X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

dtree = DecisionTreeClassifier(random_state = 22)
dtree.fit(X_train,y_train)

y_pred = dtree.predict(X_test)

print("Train data accuracy:",accuracy_score(y_true = y_train, y_pred = dtree.predict(X_train)))
print("Test data accuracy:",accuracy_score(y_true = y_test, y_pred = y_pred))

Train data accuracy: 1.0
Test data accuracy: 0.9210526315789473

from sklearn.ensemble import BaggingClassifier

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

estimator_range = [2,4,6,8,10,12,14,16,18,20]

models = []
scores = []

for n_estimators in estimator_range:

    # Create bagging classifier
    clf = BaggingClassifier(n_estimators = n_estimators, random_state = 22)

    # Fit the model
    clf.fit(X_train, y_train)

    # Append the model and score to their respective list
    models.append(clf)
    scores.append(accuracy_score(y_true = y_test, y_pred = clf.predict(X_test)))

# Generate the plot of scores against number of estimators
plt.figure(figsize=(9,6))
plt.plot(estimator_range, scores)

# Adjust labels and font (to make visable)
plt.xlabel("n_estimators", fontsize = 18)
plt.ylabel("score", fontsize = 18)
plt.tick_params(labelsize = 16)

# Visualize plot
plt.show()

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.25, random_state = 22)

clf = BaggingClassifier(n_estimators = 12, oob_score = True,random_state = 22)

clf.fit(X_train, y_train)

plt.figure(figsize=(15, 10))

plot_tree(clf.estimators_[0], feature_names = features, fontsize=14)

[Text(0.375, 0.9, 'petal.length <= 2.45\ngini = 0.661\nsamples = 71\nvalue = [35, 44, 33]'),
 Text(0.25, 0.7, 'gini = 0.0\nsamples = 23\nvalue = [35, 0, 0]'),
 Text(0.5, 0.7, 'petal.width <= 1.7\ngini = 0.49\nsamples = 48\nvalue = [0, 44, 33]'),
 Text(0.25, 0.5, 'petal.length <= 5.0\ngini = 0.044\nsamples = 26\nvalue = [0, 43, 1]'),
 Text(0.125, 0.3, 'gini = 0.0\nsamples = 24\nvalue = [0, 42, 0]'),
 Text(0.375, 0.3, 'sepal.length <= 6.15\ngini = 0.5\nsamples = 2\nvalue = [0, 1, 1]'),
 Text(0.25, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(0.5, 0.1, 'gini = 0.0\nsamples = 1\nvalue = [0, 0, 1]'),
 Text(0.75, 0.5, 'petal.length <= 4.85\ngini = 0.059\nsamples = 22\nvalue = [0, 1, 32]'),
 Text(0.625, 0.3, 'gini = 0.0\nsamples = 1\nvalue = [0, 1, 0]'),
 Text(0.875, 0.3, 'gini = 0.0\nsamples = 21\nvalue = [0, 0, 32]')]

SVM

clf = svm.LinearSVC(max_iter=3080)
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size = 0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("SVM", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_SVM.csv")

Random Forest

clf = RandomForestClassifier()

X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Random Forest", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_RF.csv")

Logistic Regression

Grid Search

logit = LogisticRegression(max_iter = 10000)

C = [0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2]

scores = []

for choice in C:
  logit.set_params(C=choice)
  logit.fit(Xsk, ysk)
  scores.append(logit.score(Xsk, ysk))

print(scores)

[0.9666666666666667, 0.9666666666666667, 0.9733333333333334, 0.9733333333333334, 0.98, 0.98, 0.9866666666666667, 0.9866666666666667]

clf = LogisticRegression()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Logistic Regression", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_LR.csv")

Gaussian Naïve Bays

from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size=0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("Gaussian Naïve Bays", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_GNB.csv")

KNN

clf = KNeighborsClassifier(n_neighbors=1,)
X_train, X_test, y_train, y_test = train_test_split(Xtree, ytree, test_size =0.33, random_state=0)
clf.fit(X_train, y_train)

predictions = clf.predict(X_test)
cm = confusion_matrix(y_test, predictions, labels=clf.classes_)

df_cfm = pd.DataFrame(cm, index = clf.classes_, columns = clf.classes_)
plt.figure(figsize = (10,7))
cfm_plot = sn.heatmap(df_cfm, annot=True)
plt.title("K-NN", fontsize = 22, fontweight="bold")
plt.xlabel("Predicted Label", fontsize = 22)
plt.ylabel("True Label", fontsize = 22)
sn.set(font_scale=1.4)
plt.show()

#report = classification_report(y_test, predictions, target_names = clf.classes_, labels=clf.classes_, zero_division = 0, /assets/output_dict=True)
#df = pd.DataFrame(report).transpose()
#df.to_csv("Report_KNN.csv")

Hierarchical Clustering

Hierarchical clustering is an unsupervised learning method for clustering data points. The algorithm builds clusters by measuring the dissimilarities between data. Unsupervised learning means that a model does not have to be trained, and we do not need a "target" variable. This method can be used on any data to visualize and interpret the relationship between individual data points.

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
fig = plt.figure(figsize=(15,5))


data_to_analyze = iris_data[['petal.length', 'petal.width']]

# =============
# First subplot
# =============

ax = fig.add_subplot(1, 2, 1)
groups = AgglomerativeClustering(n_clusters=3, affinity='euclidean', linkage='ward')
groups.fit_predict(data_to_analyze)
plt.scatter(iris_data['petal.length'] ,iris_data['petal.width'], c= groups.labels_, cmap='cool')


# =============
# Secound subplot
# =============

ax = fig.add_subplot(1, 2, 2)
data_to_analyze = list(zip(iris_data['petal.length'], iris_data['petal.width']))
linkage_data = linkage(data_to_analyze, method='ward', metric='euclidean')
dendrogram(linkage_data)
plt.show()

K-means

from sklearn.cluster import KMeans

inertias = []

for i in range(1,11):
    kmeans = KMeans(n_clusters=i)
    kmeans.fit(data_to_analyze)
    inertias.append(kmeans.inertia_)

plt.plot(range(1,11), inertias, marker='o')
plt.title('Elbow method')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()

kmeans = KMeans(n_clusters=3)
kmeans.fit(data_to_analyze)

plt.scatter(iris_data['petal.length'], iris_data['petal.width'], c=kmeans.labels_, cmap='cool')
plt.show()

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
ML_Practice_IRIS.ipynb		ML_Practice_IRIS.ipynb
README.md		README.md
iris.csv		iris.csv
overfitting.md		overfitting.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Basic Statistics

Scale

Data Visualization

Classification

Decision Tree

Confusion Matrix

AUC - ROC Curve

Cross Validation

Ensemble

SVM

Random Forest

Logistic Regression

Grid Search

Gaussian Naïve Bays

KNN

Hierarchical Clustering

K-means

About

Releases

Packages

Languages

moligarch/ML-concepts-methods-iris-dataset

Folders and files

Latest commit

History

Repository files navigation

Introduction

Basic Statistics

Scale

Data Visualization

Classification

Decision Tree

Confusion Matrix

AUC - ROC Curve

Cross Validation

Ensemble

SVM

Random Forest

Logistic Regression

Grid Search

Gaussian Naïve Bays

KNN

Hierarchical Clustering

K-means

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages