Lab 1¶

Supervised Learning¶

In this lab, you will learn how to construct a simple machine learning model given a labelled dataset. We will be analysing the Indian Liver Patient Records Dataset, and we will be predicting whether a patient has a liver disease or not.

Data Exploration¶

In this step, we will be analyzing the data given to us. It gives us an idea of what features are important to the determination of liver disease.

import numpy as np
import pandas as pd

data = pd.read_csv("data.csv") #Reading data csv file
labels = data['Dataset']  # Setting the labels
data

	Age	Gender	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Dataset
0	65	Female	0.7	0.1	187	16	18	6.8	3.3	0.90	1
1	62	Male	10.9	5.5	699	64	100	7.5	3.2	0.74	1
2	62	Male	7.3	4.1	490	60	68	7.0	3.3	0.89	1
3	58	Male	1.0	0.4	182	14	20	6.8	3.4	1.00	1
4	72	Male	3.9	2.0	195	27	59	7.3	2.4	0.40	1
5	46	Male	1.8	0.7	208	19	14	7.6	4.4	1.30	1
6	26	Female	0.9	0.2	154	16	12	7.0	3.5	1.00	1
7	29	Female	0.9	0.3	202	14	11	6.7	3.6	1.10	1
8	17	Male	0.9	0.3	202	22	19	7.4	4.1	1.20	2
9	55	Male	0.7	0.2	290	53	58	6.8	3.4	1.00	1
10	57	Male	0.6	0.1	210	51	59	5.9	2.7	0.80	1
11	72	Male	2.7	1.3	260	31	56	7.4	3.0	0.60	1
12	64	Male	0.9	0.3	310	61	58	7.0	3.4	0.90	2
13	74	Female	1.1	0.4	214	22	30	8.1	4.1	1.00	1
14	61	Male	0.7	0.2	145	53	41	5.8	2.7	0.87	1
15	25	Male	0.6	0.1	183	91	53	5.5	2.3	0.70	2
16	38	Male	1.8	0.8	342	168	441	7.6	4.4	1.30	1
17	33	Male	1.6	0.5	165	15	23	7.3	3.5	0.92	2
18	40	Female	0.9	0.3	293	232	245	6.8	3.1	0.80	1
19	40	Female	0.9	0.3	293	232	245	6.8	3.1	0.80	1
20	51	Male	2.2	1.0	610	17	28	7.3	2.6	0.55	1
21	51	Male	2.9	1.3	482	22	34	7.0	2.4	0.50	1
22	62	Male	6.8	3.0	542	116	66	6.4	3.1	0.90	1
23	40	Male	1.9	1.0	231	16	55	4.3	1.6	0.60	1
24	63	Male	0.9	0.2	194	52	45	6.0	3.9	1.85	2
25	34	Male	4.1	2.0	289	875	731	5.0	2.7	1.10	1
26	34	Male	4.1	2.0	289	875	731	5.0	2.7	1.10	1
27	34	Male	6.2	3.0	240	1680	850	7.2	4.0	1.20	1
28	20	Male	1.1	0.5	128	20	30	3.9	1.9	0.95	2
29	84	Female	0.7	0.2	188	13	21	6.0	3.2	1.10	2
...	...	...	...	...	...	...	...	...	...	...	...
553	46	Male	10.2	4.2	232	58	140	7.0	2.7	0.60	1
554	73	Male	1.8	0.9	220	20	43	6.5	3.0	0.80	1
555	55	Male	0.8	0.2	290	139	87	7.0	3.0	0.70	1
556	51	Male	0.7	0.1	180	25	27	6.1	3.1	1.00	1
557	51	Male	2.9	1.2	189	80	125	6.2	3.1	1.00	1
558	51	Male	4.0	2.5	275	382	330	7.5	4.0	1.10	1
559	26	Male	42.8	19.7	390	75	138	7.5	2.6	0.50	1
560	66	Male	15.2	7.7	356	321	562	6.5	2.2	0.40	1
561	66	Male	16.6	7.6	315	233	384	6.9	2.0	0.40	1
562	66	Male	17.3	8.5	388	173	367	7.8	2.6	0.50	1
563	64	Male	1.4	0.5	298	31	83	7.2	2.6	0.50	1
564	38	Female	0.6	0.1	165	22	34	5.9	2.9	0.90	2
565	43	Male	22.5	11.8	143	22	143	6.6	2.1	0.46	1
566	50	Female	1.0	0.3	191	22	31	7.8	4.0	1.00	2
567	52	Male	2.7	1.4	251	20	40	6.0	1.7	0.39	1
568	20	Female	16.7	8.4	200	91	101	6.9	3.5	1.02	1
569	16	Male	7.7	4.1	268	213	168	7.1	4.0	1.20	1
570	16	Male	2.6	1.2	236	131	90	5.4	2.6	0.90	1
571	90	Male	1.1	0.3	215	46	134	6.9	3.0	0.70	1
572	32	Male	15.6	9.5	134	54	125	5.6	4.0	2.50	1
573	32	Male	3.7	1.6	612	50	88	6.2	1.9	0.40	1
574	32	Male	12.1	6.0	515	48	92	6.6	2.4	0.50	1
575	32	Male	25.0	13.7	560	41	88	7.9	2.5	2.50	1
576	32	Male	15.0	8.2	289	58	80	5.3	2.2	0.70	1
577	32	Male	12.7	8.4	190	28	47	5.4	2.6	0.90	1
578	60	Male	0.5	0.1	500	20	34	5.9	1.6	0.37	2
579	40	Male	0.6	0.1	98	35	31	6.0	3.2	1.10	1
580	52	Male	0.8	0.2	245	48	49	6.4	3.2	1.00	1
581	31	Male	1.3	0.5	184	29	32	6.8	3.4	1.00	1
582	38	Male	1.0	0.3	216	21	24	7.3	4.4	1.50	2

583 rows × 11 columns

As we can see below, there are 9 columns, each with largely different ranges. We can observe that there are a total of 583 data points.

data.describe()

	Age	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Dataset
count	583.000000	583.000000	583.000000	583.000000	583.000000	583.000000	583.000000	583.000000	579.000000	583.000000
mean	44.746141	3.298799	1.486106	290.576329	80.713551	109.910806	6.483190	3.141852	0.947064	1.286449
std	16.189833	6.209522	2.808498	242.937989	182.620356	288.918529	1.085451	0.795519	0.319592	0.452490
min	4.000000	0.400000	0.100000	63.000000	10.000000	10.000000	2.700000	0.900000	0.300000	1.000000
25%	33.000000	0.800000	0.200000	175.500000	23.000000	25.000000	5.800000	2.600000	0.700000	1.000000
50%	45.000000	1.000000	0.300000	208.000000	35.000000	42.000000	6.600000	3.100000	0.930000	1.000000
75%	58.000000	2.600000	1.300000	298.000000	60.500000	87.000000	7.200000	3.800000	1.100000	2.000000
max	90.000000	75.000000	19.700000	2110.000000	2000.000000	4929.000000	9.600000	5.500000	2.800000	2.000000

Task 1: Querying the dataset¶

In order to get a certain idea of how the dataset is distributed, we can try querying the dataset. For example:

no_patients = len(data[(data['Gender']=='Male') & (data['Age']<20)])
print("Number of patients who are male and are less than 20 years old: {}"
      .format(no_patients))

Number of patients who are male and are less than 20 years old: 29

Q1. Print the number of male patients and number of female patients

#TODO
no_males = len(data[data['Gender']=='Male'])
no_females = len(data[data['Gender']=='Female'])

print("Number of male patients: {}".format(no_males))
print("Number of female patients: {}".format(no_females))

Number of male patients: 441
Number of female patients: 142

Q2. Print the number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5

#TODO
no_patients = len(data[(data['Age']>50)&(data['Direct_Bilirubin']>0.5)])
print("Number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5: {}"
      .format(no_patients))

Number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5: 90

Q3. Print a dataframe of patients who are younger than 32 or have a level of Alkaline_Phosphotase below 200

#TODO
patients = data[(data['Age']<32)&(data['Alkaline_Phosphotase']<200)]
patients

	Age	Gender	Total_Bilirubin	Direct_Bilirubin	Alkaline_Phosphotase	Alamine_Aminotransferase	Aspartate_Aminotransferase	Total_Protiens	Albumin	Albumin_and_Globulin_Ratio	Dataset
6	26	Female	0.9	0.2	154	16	12	7.0	3.5	1.00	1
15	25	Male	0.6	0.1	183	91	53	5.5	2.3	0.70	2
28	20	Male	1.1	0.5	128	20	30	3.9	1.9	0.95	2
36	17	Female	0.7	0.2	145	18	36	7.2	3.9	1.18	2
46	21	Male	3.9	1.8	150	36	27	6.8	3.9	1.34	1
60	31	Female	0.8	0.2	158	21	16	6.0	3.0	1.00	1
70	19	Female	0.7	0.2	186	166	397	5.5	3.0	1.20	1
75	29	Female	0.7	0.1	162	52	41	5.2	2.5	0.90	2
81	29	Male	1.0	0.3	75	25	26	5.1	2.9	1.30	1
100	27	Male	0.6	0.2	161	27	28	3.7	1.6	0.76	2
110	24	Female	0.7	0.2	188	11	10	5.5	2.3	0.71	2
112	27	Male	1.2	0.4	179	63	39	6.1	3.3	1.10	2
124	28	Male	0.6	0.1	177	36	29	6.9	4.1	1.40	2
132	18	Female	0.8	0.2	199	34	31	6.5	3.5	1.16	2
134	18	Male	1.8	0.7	178	35	36	6.8	3.6	1.10	1
173	31	Male	0.6	0.1	175	48	34	6.0	3.7	1.60	1
174	31	Male	0.6	0.1	175	48	34	6.0	3.7	1.60	1
175	31	Male	0.8	0.2	198	43	31	7.3	4.0	1.20	1
197	26	Female	0.6	0.2	142	12	32	5.7	2.4	0.75	1
203	21	Male	1.0	0.3	142	27	21	6.4	3.5	1.20	2
204	21	Male	0.7	0.2	135	27	26	6.4	3.3	1.00	2
210	28	Male	0.8	0.3	190	20	14	4.1	2.4	1.40	1
212	22	Male	2.7	1.0	160	82	127	5.5	3.1	1.20	2
225	26	Male	0.6	0.2	120	45	51	7.9	4.0	1.00	1
226	26	Male	1.3	0.4	173	38	62	8.0	4.0	1.00	1
269	26	Male	0.6	0.1	110	15	20	2.8	1.6	1.30	1
275	26	Male	1.9	0.8	180	22	19	8.2	4.1	1.00	2
293	23	Male	1.1	0.5	191	37	41	7.7	4.3	1.20	2
297	25	Female	0.9	0.3	159	24	25	6.9	4.4	1.70	2
298	31	Female	1.1	0.3	190	26	15	7.9	3.8	0.90	1
...	...	...	...	...	...	...	...	...	...	...	...
313	30	Female	0.8	0.2	158	25	22	7.9	4.5	1.30	2
314	26	Male	2.0	0.9	195	24	65	7.8	4.3	1.20	1
315	22	Male	0.9	0.3	179	18	21	6.7	3.7	1.20	2
320	30	Female	0.7	0.2	63	31	27	5.8	3.4	1.40	1
321	30	Female	0.8	0.2	198	30	58	5.2	2.8	1.10	1
327	24	Male	3.3	1.6	174	11	33	7.6	3.9	1.00	2
330	26	Male	2.0	0.9	157	54	68	6.1	2.7	0.80	1
335	13	Female	0.7	0.1	182	24	19	8.9	4.9	1.20	1
352	26	Female	0.7	0.2	144	36	33	8.2	4.3	1.10	1
355	19	Male	1.4	0.8	178	13	26	8.0	4.6	1.30	2
364	21	Male	0.8	0.2	183	33	57	6.8	3.5	1.00	2
373	25	Female	0.7	0.1	140	32	25	7.6	4.3	1.30	2
404	22	Male	0.8	0.2	198	20	26	6.8	3.9	1.30	1
421	26	Male	1.0	0.3	163	48	71	7.1	3.7	1.00	2
432	29	Male	0.7	0.2	165	55	87	7.5	4.6	1.58	1
434	30	Female	0.7	0.2	194	32	36	7.5	3.6	0.92	2
454	28	Male	0.6	0.2	159	15	16	7.0	3.5	1.00	2
455	21	Female	0.6	0.1	186	25	22	6.8	3.4	1.00	1
458	26	Male	6.8	3.2	140	37	19	3.6	0.9	0.30	1
463	25	Male	0.8	0.1	130	23	42	8.0	4.0	1.00	1
466	28	Female	0.6	0.1	137	22	16	4.9	1.9	0.60	2
467	28	Female	1.0	0.3	90	18	108	6.8	3.1	0.80	2
483	30	Male	0.8	0.2	182	46	57	7.8	4.3	1.20	2
491	27	Male	1.0	0.3	180	56	111	6.8	3.9	1.85	2
494	25	Male	0.7	0.2	185	196	401	6.5	3.9	1.50	1
496	24	Male	1.0	0.2	189	52	31	8.0	4.8	1.50	1
524	29	Male	0.8	0.2	156	12	15	6.8	3.7	1.10	2
530	22	Female	1.1	0.3	138	14	21	7.0	3.8	1.10	2
551	29	Male	1.2	0.4	160	20	22	6.2	3.0	0.90	2
581	31	Male	1.3	0.5	184	29	32	6.8	3.4	1.00	1

63 rows × 11 columns

Feel free to try out some other queries here:

#TODO

Task 2: Data Visualization¶

Sometimes querying isn't enough, and you need to see the data panned out to understand more. Seaborn is a library which is a wrapper over matplotlib and is extremely convenient to use. For example, the below plot shows a box plot of alkaline_phosphotase of all patients.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")

sns.boxplot(x=data['Alkaline_Phosphotase']) #Box plot

plt.show()

/home/pratik/.local/lib/python3.5/site-packages/seaborn/categorical.py:454: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  box_data = remove_na(group_data)

Q4. Using seaborn, plot a scatter plot between Age and Total_Protiens(mispelled in dataset)

#TODO
sns.regplot(x='Age',y='Total_Protiens',data=data)

plt.show()

Q5. Plot a grouped bar chart comparing the Alamine_Aminotransferase levels of patients with liver disease and patients without liver disease, categorized by gender.(Hint: Use the hue property of barplot for gender,and check this out:https://seaborn.pydata.org/generated/seaborn.barplot.html)

#TODO
sns.barplot(x='Dataset',y='Alamine_Aminotransferase',hue='Gender',data=data)
plt.show()

/home/pratik/.local/lib/python3.5/site-packages/seaborn/categorical.py:1468: FutureWarning: remove_na is deprecated and is a private function. Do not use.
  stat_data = remove_na(group_data[hue_mask])

Let's view the correlation heatmap for the different features for some more inspiration.

# Compute the correlation matrix
corr = data.corr()

# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

plt.show()

Q6. You can try out any other plots here:

#TODO

plt.show()

Feature Selection and Scaling¶

According to the knowledge that we've gathered above, let's decide on the best features that we should include for creating a model. Using the knowledge that you have about dimensionality and feature selection, pick an appropriate number of features for the dataset. There is no right or wrong here.

Task 3: Feature Selection¶

Q7. Make a reduced dataset new_data by selecting only relevant columns from the original dataframe.

#TODO 

new_data = data[['Age','Alamine_Aminotransferase','Albumin']]

new_data.head()

	Age	Alamine_Aminotransferase	Albumin
0	65	16	3.3
1	62	64	3.2
2	62	60	3.3
3	58	14	3.4
4	72	27	2.4

Task 4: Create Training and Validation Data Split¶

Q8. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.

from sklearn.model_selection import train_test_split

#TODO
X_train,X_val,y_train,y_val = train_test_split(new_data,labels,test_size=0.33,random_state=42)

Task 5: Feature Scaling¶

We always scale the features after splitting the dataset because we want to ensure that the validation data is isolated. This is because the validation data acts as new, unseen data. Any transformation on it will reduce its validity.

Q9. Although there are many methods to scale data, let's use MinMaxScaler from sklearn. Scale the training data.

from sklearn.preprocessing import MinMaxScaler

#TODO
scaler = MinMaxScaler()            #Instantiate the scaler
scaled_X_train = scaler.fit_transform(X_train)    #Fit and transform the data

scaled_X_train

array([[ 0.48051948,  0.01137725,  0.43478261],
       [ 0.64935065,  0.02754491,  0.45652174],
       [ 0.41558442,  0.01317365,  0.56521739],
       ..., 
       [ 0.37662338,  0.05149701,  0.86956522],
       [ 0.11688312,  0.01077844,  0.7826087 ],
       [ 0.11688312,  0.01556886,  0.7173913 ]])

Task 6: Model Creation¶

Now we are finally ready to create a model and train it. Remember that this is a two-class classification problem. We need to select a classifier, not a regressor. Let's analyze two simple models, DecisionTreeClassifier and Gaussian Naive Bayes Classifier.

Q10. Instantiate and train a DecisionTreeClassifier on the given data

from sklearn.tree import DecisionTreeClassifier
#TODO
clf_dt = DecisionTreeClassifier().fit(scaled_X_train,y_train)              #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function

Q11. Instantiate and train a GaussianNB on the given data

from sklearn.naive_bayes import GaussianNB
#TODO
clf_nb = GaussianNB().fit(scaled_X_train,y_train)                #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function

Model Evaluation¶

These models are now capable of 'predicting' whether a patient has liver disease or not. But we need to evaluate their performance. Since it is a two-class classification problem, we can use accuracy. However, let us also use some additional metrics for better analysis, precision,recall, and f1score.

Task 7: Performance Metrics¶

Q12. Using the accuracy_score function, determine the accuracy of the two classifiers.

from sklearn.metrics import accuracy_score
#TODO
scaled_X_val = scaler.transform(X_val)                  #Fit and transform the validation set using the MinMaxScaler

y_pred_dt = clf_dt.predict(scaled_X_val)
y_pred_nb = clf_nb.predict(scaled_X_val)                      

#Use accuracy score and determine accuracy of both classifiers
acc_dt = accuracy_score(y_pred_dt,y_val)                            
acc_nb = accuracy_score(y_pred_nb,y_val) 

print("The accuracy of Decision Tree: {} %".format(acc_dt))
print("The accuracy of Gaussian Naive Bayes: {} %".format(acc_nb))

The accuracy of Decision Tree: 0.6321243523316062 %
The accuracy of Gaussian Naive Bayes: 0.47668393782383417 %

Q13. Determine the precision and recall using precision_score and recall_score.

from sklearn.metrics import precision_score,recall_score
#TODO
prec_dt = precision_score(y_pred_dt,y_val)
prec_nb = precision_score(y_pred_nb,y_val)

recall_dt = recall_score(y_pred_dt,y_val)
recall_nb = recall_score(y_pred_nb,y_val)

print("The precision of Decision Tree: {} %".format(prec_dt))
print("The precision of Gaussian Naive Bayes: {} %".format(prec_nb))
print("The recall of Decision Tree: {} %".format(recall_dt))
print("The recall of Gaussian Naive Bayes: {} %".format(recall_nb))

The precision of Decision Tree: 0.7092198581560284 %
The precision of Gaussian Naive Bayes: 0.3191489361702128 %
The recall of Decision Tree: 0.7692307692307693 %
The recall of Gaussian Naive Bayes: 0.9 %

Q14. Determine the F1-score of the two classifiers.

from sklearn.metrics import f1_score

f1_dt = f1_score(y_pred_dt,y_val)
f1_nb = f1_score(y_pred_nb,y_val)

print("The F1-score of Decision Tree: {} %".format(f1_dt))
print("The F1-score of Gaussian Naive Bayes: {} %".format(f1_nb))

The F1-score of Decision Tree: 0.7380073800738008 %
The F1-score of Gaussian Naive Bayes: 0.47120418848167545 %

We have officially solved a machine learning problem. However, the question of making an effective model is still in question. Try to get the f1 score of any of the classifiers to 85%.