Lab 1

Supervised Learning

In this lab, you will learn how to construct a simple machine learning model given a labelled dataset. We will be analysing the Indian Liver Patient Records Dataset, and we will be predicting whether a patient has a liver disease or not.

Data Exploration

In this step, we will be analyzing the data given to us. It gives us an idea of what features are important to the determination of liver disease.

import numpy as np
import pandas as pd

data = pd.read_csv("data.csv") #Reading data csv file
labels = data['Dataset']  # Setting the labels
data
Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens Albumin Albumin_and_Globulin_Ratio Dataset
0 65 Female 0.7 0.1 187 16 18 6.8 3.3 0.90 1
1 62 Male 10.9 5.5 699 64 100 7.5 3.2 0.74 1
2 62 Male 7.3 4.1 490 60 68 7.0 3.3 0.89 1
3 58 Male 1.0 0.4 182 14 20 6.8 3.4 1.00 1
4 72 Male 3.9 2.0 195 27 59 7.3 2.4 0.40 1
5 46 Male 1.8 0.7 208 19 14 7.6 4.4 1.30 1
6 26 Female 0.9 0.2 154 16 12 7.0 3.5 1.00 1
7 29 Female 0.9 0.3 202 14 11 6.7 3.6 1.10 1
8 17 Male 0.9 0.3 202 22 19 7.4 4.1 1.20 2
9 55 Male 0.7 0.2 290 53 58 6.8 3.4 1.00 1
10 57 Male 0.6 0.1 210 51 59 5.9 2.7 0.80 1
11 72 Male 2.7 1.3 260 31 56 7.4 3.0 0.60 1
12 64 Male 0.9 0.3 310 61 58 7.0 3.4 0.90 2
13 74 Female 1.1 0.4 214 22 30 8.1 4.1 1.00 1
14 61 Male 0.7 0.2 145 53 41 5.8 2.7 0.87 1
15 25 Male 0.6 0.1 183 91 53 5.5 2.3 0.70 2
16 38 Male 1.8 0.8 342 168 441 7.6 4.4 1.30 1
17 33 Male 1.6 0.5 165 15 23 7.3 3.5 0.92 2
18 40 Female 0.9 0.3 293 232 245 6.8 3.1 0.80 1
19 40 Female 0.9 0.3 293 232 245 6.8 3.1 0.80 1
20 51 Male 2.2 1.0 610 17 28 7.3 2.6 0.55 1
21 51 Male 2.9 1.3 482 22 34 7.0 2.4 0.50 1
22 62 Male 6.8 3.0 542 116 66 6.4 3.1 0.90 1
23 40 Male 1.9 1.0 231 16 55 4.3 1.6 0.60 1
24 63 Male 0.9 0.2 194 52 45 6.0 3.9 1.85 2
25 34 Male 4.1 2.0 289 875 731 5.0 2.7 1.10 1
26 34 Male 4.1 2.0 289 875 731 5.0 2.7 1.10 1
27 34 Male 6.2 3.0 240 1680 850 7.2 4.0 1.20 1
28 20 Male 1.1 0.5 128 20 30 3.9 1.9 0.95 2
29 84 Female 0.7 0.2 188 13 21 6.0 3.2 1.10 2
... ... ... ... ... ... ... ... ... ... ... ...
553 46 Male 10.2 4.2 232 58 140 7.0 2.7 0.60 1
554 73 Male 1.8 0.9 220 20 43 6.5 3.0 0.80 1
555 55 Male 0.8 0.2 290 139 87 7.0 3.0 0.70 1
556 51 Male 0.7 0.1 180 25 27 6.1 3.1 1.00 1
557 51 Male 2.9 1.2 189 80 125 6.2 3.1 1.00 1
558 51 Male 4.0 2.5 275 382 330 7.5 4.0 1.10 1
559 26 Male 42.8 19.7 390 75 138 7.5 2.6 0.50 1
560 66 Male 15.2 7.7 356 321 562 6.5 2.2 0.40 1
561 66 Male 16.6 7.6 315 233 384 6.9 2.0 0.40 1
562 66 Male 17.3 8.5 388 173 367 7.8 2.6 0.50 1
563 64 Male 1.4 0.5 298 31 83 7.2 2.6 0.50 1
564 38 Female 0.6 0.1 165 22 34 5.9 2.9 0.90 2
565 43 Male 22.5 11.8 143 22 143 6.6 2.1 0.46 1
566 50 Female 1.0 0.3 191 22 31 7.8 4.0 1.00 2
567 52 Male 2.7 1.4 251 20 40 6.0 1.7 0.39 1
568 20 Female 16.7 8.4 200 91 101 6.9 3.5 1.02 1
569 16 Male 7.7 4.1 268 213 168 7.1 4.0 1.20 1
570 16 Male 2.6 1.2 236 131 90 5.4 2.6 0.90 1
571 90 Male 1.1 0.3 215 46 134 6.9 3.0 0.70 1
572 32 Male 15.6 9.5 134 54 125 5.6 4.0 2.50 1
573 32 Male 3.7 1.6 612 50 88 6.2 1.9 0.40 1
574 32 Male 12.1 6.0 515 48 92 6.6 2.4 0.50 1
575 32 Male 25.0 13.7 560 41 88 7.9 2.5 2.50 1
576 32 Male 15.0 8.2 289 58 80 5.3 2.2 0.70 1
577 32 Male 12.7 8.4 190 28 47 5.4 2.6 0.90 1
578 60 Male 0.5 0.1 500 20 34 5.9 1.6 0.37 2
579 40 Male 0.6 0.1 98 35 31 6.0 3.2 1.10 1
580 52 Male 0.8 0.2 245 48 49 6.4 3.2 1.00 1
581 31 Male 1.3 0.5 184 29 32 6.8 3.4 1.00 1
582 38 Male 1.0 0.3 216 21 24 7.3 4.4 1.50 2

583 rows × 11 columns

As we can see below, there are 9 columns, each with largely different ranges. We can observe that there are a total of 583 data points.

data.describe()
Age Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens Albumin Albumin_and_Globulin_Ratio Dataset
count 583.000000 583.000000 583.000000 583.000000 583.000000 583.000000 583.000000 583.000000 579.000000 583.000000
mean 44.746141 3.298799 1.486106 290.576329 80.713551 109.910806 6.483190 3.141852 0.947064 1.286449
std 16.189833 6.209522 2.808498 242.937989 182.620356 288.918529 1.085451 0.795519 0.319592 0.452490
min 4.000000 0.400000 0.100000 63.000000 10.000000 10.000000 2.700000 0.900000 0.300000 1.000000
25% 33.000000 0.800000 0.200000 175.500000 23.000000 25.000000 5.800000 2.600000 0.700000 1.000000
50% 45.000000 1.000000 0.300000 208.000000 35.000000 42.000000 6.600000 3.100000 0.930000 1.000000
75% 58.000000 2.600000 1.300000 298.000000 60.500000 87.000000 7.200000 3.800000 1.100000 2.000000
max 90.000000 75.000000 19.700000 2110.000000 2000.000000 4929.000000 9.600000 5.500000 2.800000 2.000000

Task 1: Querying the dataset

In order to get a certain idea of how the dataset is distributed, we can try querying the dataset. For example:

no_patients = len(data[(data['Gender']=='Male') & (data['Age']<20)])
print("Number of patients who are male and are less than 20 years old: {}"
      .format(no_patients))
Number of patients who are male and are less than 20 years old: 29

Q1. Print the number of male patients and number of female patients

#TODO
no_males = len(data[data['Gender']=='Male'])
no_females = len(data[data['Gender']=='Female'])

print("Number of male patients: {}".format(no_males))
print("Number of female patients: {}".format(no_females))
Number of male patients: 441
Number of female patients: 142

Q2. Print the number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5

#TODO
no_patients = len(data[(data['Age']>50)&(data['Direct_Bilirubin']>0.5)])
print("Number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5: {}"
      .format(no_patients))
Number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5: 90

Q3. Print a dataframe of patients who are younger than 32 or have a level of Alkaline_Phosphotase below 200

#TODO
patients = data[(data['Age']<32)&(data['Alkaline_Phosphotase']<200)]
patients
Age Gender Total_Bilirubin Direct_Bilirubin Alkaline_Phosphotase Alamine_Aminotransferase Aspartate_Aminotransferase Total_Protiens Albumin Albumin_and_Globulin_Ratio Dataset
6 26 Female 0.9 0.2 154 16 12 7.0 3.5 1.00 1
15 25 Male 0.6 0.1 183 91 53 5.5 2.3 0.70 2
28 20 Male 1.1 0.5 128 20 30 3.9 1.9 0.95 2
36 17 Female 0.7 0.2 145 18 36 7.2 3.9 1.18 2
46 21 Male 3.9 1.8 150 36 27 6.8 3.9 1.34 1
60 31 Female 0.8 0.2 158 21 16 6.0 3.0 1.00 1
70 19 Female 0.7 0.2 186 166 397 5.5 3.0 1.20 1
75 29 Female 0.7 0.1 162 52 41 5.2 2.5 0.90 2
81 29 Male 1.0 0.3 75 25 26 5.1 2.9 1.30 1
100 27 Male 0.6 0.2 161 27 28 3.7 1.6 0.76 2
110 24 Female 0.7 0.2 188 11 10 5.5 2.3 0.71 2
112 27 Male 1.2 0.4 179 63 39 6.1 3.3 1.10 2
124 28 Male 0.6 0.1 177 36 29 6.9 4.1 1.40 2
132 18 Female 0.8 0.2 199 34 31 6.5 3.5 1.16 2
134 18 Male 1.8 0.7 178 35 36 6.8 3.6 1.10 1
173 31 Male 0.6 0.1 175 48 34 6.0 3.7 1.60 1
174 31 Male 0.6 0.1 175 48 34 6.0 3.7 1.60 1
175 31 Male 0.8 0.2 198 43 31 7.3 4.0 1.20 1
197 26 Female 0.6 0.2 142 12 32 5.7 2.4 0.75 1
203 21 Male 1.0 0.3 142 27 21 6.4 3.5 1.20 2
204 21 Male 0.7 0.2 135 27 26 6.4 3.3 1.00 2
210 28 Male 0.8 0.3 190 20 14 4.1 2.4 1.40 1
212 22 Male 2.7 1.0 160 82 127 5.5 3.1 1.20 2
225 26 Male 0.6 0.2 120 45 51 7.9 4.0 1.00 1
226 26 Male 1.3 0.4 173 38 62 8.0 4.0 1.00 1
269 26 Male 0.6 0.1 110 15 20 2.8 1.6 1.30 1
275 26 Male 1.9 0.8 180 22 19 8.2 4.1 1.00 2
293 23 Male 1.1 0.5 191 37 41 7.7 4.3 1.20 2
297 25 Female 0.9 0.3 159 24 25 6.9 4.4 1.70 2
298 31 Female 1.1 0.3 190 26 15 7.9 3.8 0.90 1
... ... ... ... ... ... ... ... ... ... ... ...
313 30 Female 0.8 0.2 158 25 22 7.9 4.5 1.30 2
314 26 Male 2.0 0.9 195 24 65 7.8 4.3 1.20 1
315 22 Male 0.9 0.3 179 18 21 6.7 3.7 1.20 2
320 30 Female 0.7 0.2 63 31 27 5.8 3.4 1.40 1
321 30 Female 0.8 0.2 198 30 58 5.2 2.8 1.10 1
327 24 Male 3.3 1.6 174 11 33 7.6 3.9 1.00 2
330 26 Male 2.0 0.9 157 54 68 6.1 2.7 0.80 1
335 13 Female 0.7 0.1 182 24 19 8.9 4.9 1.20 1
352 26 Female 0.7 0.2 144 36 33 8.2 4.3 1.10 1
355 19 Male 1.4 0.8 178 13 26 8.0 4.6 1.30 2
364 21 Male 0.8 0.2 183 33 57 6.8 3.5 1.00 2
373 25 Female 0.7 0.1 140 32 25 7.6 4.3 1.30 2
404 22 Male 0.8 0.2 198 20 26 6.8 3.9 1.30 1
421 26 Male 1.0 0.3 163 48 71 7.1 3.7 1.00 2
432 29 Male 0.7 0.2 165 55 87 7.5 4.6 1.58 1
434 30 Female 0.7 0.2 194 32 36 7.5 3.6 0.92 2
454 28 Male 0.6 0.2 159 15 16 7.0 3.5 1.00 2
455 21 Female 0.6 0.1 186 25 22 6.8 3.4 1.00 1
458 26 Male 6.8 3.2 140 37 19 3.6 0.9 0.30 1
463 25 Male 0.8 0.1 130 23 42 8.0 4.0 1.00 1
466 28 Female 0.6 0.1 137 22 16 4.9 1.9 0.60 2
467 28 Female 1.0 0.3 90 18 108 6.8 3.1 0.80 2
483 30 Male 0.8 0.2 182 46 57 7.8 4.3 1.20 2
491 27 Male 1.0 0.3 180 56 111 6.8 3.9 1.85 2
494 25 Male 0.7 0.2 185 196 401 6.5 3.9 1.50 1
496 24 Male 1.0 0.2 189 52 31 8.0 4.8 1.50 1
524 29 Male 0.8 0.2 156 12 15 6.8 3.7 1.10 2
530 22 Female 1.1 0.3 138 14 21 7.0 3.8 1.10 2
551 29 Male 1.2 0.4 160 20 22 6.2 3.0 0.90 2
581 31 Male 1.3 0.5 184 29 32 6.8 3.4 1.00 1

63 rows × 11 columns

Feel free to try out some other queries here:

#TODO

Task 2: Data Visualization

Sometimes querying isn't enough, and you need to see the data panned out to understand more. Seaborn is a library which is a wrapper over matplotlib and is extremely convenient to use. For example, the below plot shows a box plot of alkaline_phosphotase of all patients.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

sns.boxplot(x=data['Alkaline_Phosphotase']) #Box plot

plt.show()
/home/pratik/.local/lib/python3.5/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)

Q4. Using seaborn, plot a scatter plot between Age and Total_Protiens(mispelled in dataset)

#TODO
sns.regplot(x='Age',y='Total_Protiens',data=data)

plt.show()

Q5. Plot a grouped bar chart comparing the Alamine_Aminotransferase levels of patients with liver disease and patients without liver disease, categorized by gender.(Hint: Use the hue property of barplot for gender,and check this out:https://seaborn.pydata.org/generated/seaborn.barplot.html)

#TODO
sns.barplot(x='Dataset',y='Alamine_Aminotransferase',hue='Gender',data=data)
plt.show()
/home/pratik/.local/lib/python3.5/site-packages/seaborn/categorical.py:1468: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data[hue_mask])

Let's view the correlation heatmap for the different features for some more inspiration.

# Compute the correlation matrix
corr = data.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

Q6. You can try out any other plots here:

#TODO

plt.show()

Feature Selection and Scaling

According to the knowledge that we've gathered above, let's decide on the best features that we should include for creating a model. Using the knowledge that you have about dimensionality and feature selection, pick an appropriate number of features for the dataset. There is no right or wrong here.

Task 3: Feature Selection

Q7. Make a reduced dataset new_data by selecting only relevant columns from the original dataframe.

#TODO 

new_data = data[['Age','Alamine_Aminotransferase','Albumin']]

new_data.head()
Age Alamine_Aminotransferase Albumin
0 65 16 3.3
1 62 64 3.2
2 62 60 3.3
3 58 14 3.4
4 72 27 2.4

Task 4: Create Training and Validation Data Split

Q8. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.

from sklearn.model_selection import train_test_split

#TODO
X_train,X_val,y_train,y_val = train_test_split(new_data,labels,test_size=0.33,random_state=42)

Task 5: Feature Scaling

We always scale the features after splitting the dataset because we want to ensure that the validation data is isolated. This is because the validation data acts as new, unseen data. Any transformation on it will reduce its validity.

Q9. Although there are many methods to scale data, let's use MinMaxScaler from sklearn. Scale the training data.

from sklearn.preprocessing import MinMaxScaler

#TODO
scaler = MinMaxScaler()            #Instantiate the scaler
scaled_X_train = scaler.fit_transform(X_train)    #Fit and transform the data

scaled_X_train
array([[ 0.48051948,  0.01137725,  0.43478261],
       [ 0.64935065,  0.02754491,  0.45652174],
       [ 0.41558442,  0.01317365,  0.56521739],
       ..., 
       [ 0.37662338,  0.05149701,  0.86956522],
       [ 0.11688312,  0.01077844,  0.7826087 ],
       [ 0.11688312,  0.01556886,  0.7173913 ]])

Task 6: Model Creation

Now we are finally ready to create a model and train it. Remember that this is a two-class classification problem. We need to select a classifier, not a regressor. Let's analyze two simple models, DecisionTreeClassifier and Gaussian Naive Bayes Classifier.

Q10. Instantiate and train a DecisionTreeClassifier on the given data

from sklearn.tree import DecisionTreeClassifier
#TODO
clf_dt = DecisionTreeClassifier().fit(scaled_X_train,y_train)              #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function

Q11. Instantiate and train a GaussianNB on the given data

from sklearn.naive_bayes import GaussianNB
#TODO
clf_nb = GaussianNB().fit(scaled_X_train,y_train)                #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function

Model Evaluation

These models are now capable of 'predicting' whether a patient has liver disease or not. But we need to evaluate their performance. Since it is a two-class classification problem, we can use accuracy. However, let us also use some additional metrics for better analysis, precision,recall, and f1score.

Task 7: Performance Metrics

Q12. Using the accuracy_score function, determine the accuracy of the two classifiers.

from sklearn.metrics import accuracy_score
#TODO
scaled_X_val = scaler.transform(X_val)                  #Fit and transform the validation set using the MinMaxScaler

y_pred_dt = clf_dt.predict(scaled_X_val)
y_pred_nb = clf_nb.predict(scaled_X_val)                      

#Use accuracy score and determine accuracy of both classifiers
acc_dt = accuracy_score(y_pred_dt,y_val)                            
acc_nb = accuracy_score(y_pred_nb,y_val) 

print("The accuracy of Decision Tree: {} %".format(acc_dt))
print("The accuracy of Gaussian Naive Bayes: {} %".format(acc_nb))
The accuracy of Decision Tree: 0.6321243523316062 %
The accuracy of Gaussian Naive Bayes: 0.47668393782383417 %

Q13. Determine the precision and recall using precision_score and recall_score.

from sklearn.metrics import precision_score,recall_score
#TODO
prec_dt = precision_score(y_pred_dt,y_val)
prec_nb = precision_score(y_pred_nb,y_val)

recall_dt = recall_score(y_pred_dt,y_val)
recall_nb = recall_score(y_pred_nb,y_val)

print("The precision of Decision Tree: {} %".format(prec_dt))
print("The precision of Gaussian Naive Bayes: {} %".format(prec_nb))
print("The recall of Decision Tree: {} %".format(recall_dt))
print("The recall of Gaussian Naive Bayes: {} %".format(recall_nb))
The precision of Decision Tree: 0.7092198581560284 %
The precision of Gaussian Naive Bayes: 0.3191489361702128 %
The recall of Decision Tree: 0.7692307692307693 %
The recall of Gaussian Naive Bayes: 0.9 %

Q14. Determine the F1-score of the two classifiers.

from sklearn.metrics import f1_score

f1_dt = f1_score(y_pred_dt,y_val)
f1_nb = f1_score(y_pred_nb,y_val)

print("The F1-score of Decision Tree: {} %".format(f1_dt))
print("The F1-score of Gaussian Naive Bayes: {} %".format(f1_nb))
The F1-score of Decision Tree: 0.7380073800738008 %
The F1-score of Gaussian Naive Bayes: 0.47120418848167545 %

We have officially solved a machine learning problem. However, the question of making an effective model is still in question. Try to get the f1 score of any of the classifiers to 85%.