In this lab, you will learn how to construct a simple machine learning model given a labelled dataset. We will be analysing the Indian Liver Patient Records Dataset, and we will be predicting whether a patient has a liver disease or not.
In this step, we will be analyzing the data given to us. It gives us an idea of what features are important to the determination of liver disease.
import numpy as np
import pandas as pd
data = pd.read_csv("data.csv") #Reading data csv file
labels = data['Dataset'] # Setting the labels
data
As we can see below, there are 9 columns, each with largely different ranges. We can observe that there are a total of 583 data points.
data.describe()
In order to get a certain idea of how the dataset is distributed, we can try querying the dataset. For example:
no_patients = len(data[(data['Gender']=='Male') & (data['Age']<20)])
print("Number of patients who are male and are less than 20 years old: {}"
.format(no_patients))
Q1. Print the number of male patients and number of female patients
#TODO
no_males = len(data[data['Gender']=='Male'])
no_females = len(data[data['Gender']=='Female'])
print("Number of male patients: {}".format(no_males))
print("Number of female patients: {}".format(no_females))
Q2. Print the number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5
#TODO
no_patients = len(data[(data['Age']>50)&(data['Direct_Bilirubin']>0.5)])
print("Number of patients who are older than 50 and have a level of Direct_Bilirubin above 0.5: {}"
.format(no_patients))
Q3. Print a dataframe of patients who are younger than 32 or have a level of Alkaline_Phosphotase below 200
#TODO
patients = data[(data['Age']<32)&(data['Alkaline_Phosphotase']<200)]
patients
Feel free to try out some other queries here:
#TODO
Sometimes querying isn't enough, and you need to see the data panned out to understand more. Seaborn is a library which is a wrapper over matplotlib and is extremely convenient to use. For example, the below plot shows a box plot of alkaline_phosphotase of all patients.
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.boxplot(x=data['Alkaline_Phosphotase']) #Box plot
plt.show()
Q4. Using seaborn, plot a scatter plot between Age and Total_Protiens(mispelled in dataset)
#TODO
sns.regplot(x='Age',y='Total_Protiens',data=data)
plt.show()
Q5. Plot a grouped bar chart comparing the Alamine_Aminotransferase levels of patients with liver disease and patients without liver disease, categorized by gender.(Hint: Use the hue property of barplot for gender,and check this out:https://seaborn.pydata.org/generated/seaborn.barplot.html)
#TODO
sns.barplot(x='Dataset',y='Alamine_Aminotransferase',hue='Gender',data=data)
plt.show()
Let's view the correlation heatmap for the different features for some more inspiration.
# Compute the correlation matrix
corr = data.corr()
# Generate a mask for the upper triangle
mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(11, 9))
# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})
plt.show()
Q6. You can try out any other plots here:
#TODO
plt.show()
According to the knowledge that we've gathered above, let's decide on the best features that we should include for creating a model. Using the knowledge that you have about dimensionality and feature selection, pick an appropriate number of features for the dataset. There is no right or wrong here.
Q7. Make a reduced dataset new_data by selecting only relevant columns from the original dataframe.
#TODO
new_data = data[['Age','Alamine_Aminotransferase','Albumin']]
new_data.head()
Q8. Create training and validation split on data. Check out train_test_split() function from sklearn to do this.
from sklearn.model_selection import train_test_split
#TODO
X_train,X_val,y_train,y_val = train_test_split(new_data,labels,test_size=0.33,random_state=42)
We always scale the features after splitting the dataset because we want to ensure that the validation data is isolated. This is because the validation data acts as new, unseen data. Any transformation on it will reduce its validity.
Q9. Although there are many methods to scale data, let's use MinMaxScaler from sklearn. Scale the training data.
from sklearn.preprocessing import MinMaxScaler
#TODO
scaler = MinMaxScaler() #Instantiate the scaler
scaled_X_train = scaler.fit_transform(X_train) #Fit and transform the data
scaled_X_train
Now we are finally ready to create a model and train it. Remember that this is a two-class classification problem. We need to select a classifier, not a regressor. Let's analyze two simple models, DecisionTreeClassifier and Gaussian Naive Bayes Classifier.
Q10. Instantiate and train a DecisionTreeClassifier on the given data
from sklearn.tree import DecisionTreeClassifier
#TODO
clf_dt = DecisionTreeClassifier().fit(scaled_X_train,y_train) #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function
Q11. Instantiate and train a GaussianNB on the given data
from sklearn.naive_bayes import GaussianNB
#TODO
clf_nb = GaussianNB().fit(scaled_X_train,y_train) #Instantiate a DecisionTreeClassifier model and fit it to the training data using fit function
These models are now capable of 'predicting' whether a patient has liver disease or not. But we need to evaluate their performance. Since it is a two-class classification problem, we can use accuracy. However, let us also use some additional metrics for better analysis, precision,recall, and f1score.
Q12. Using the accuracy_score function, determine the accuracy of the two classifiers.
from sklearn.metrics import accuracy_score
#TODO
scaled_X_val = scaler.transform(X_val) #Fit and transform the validation set using the MinMaxScaler
y_pred_dt = clf_dt.predict(scaled_X_val)
y_pred_nb = clf_nb.predict(scaled_X_val)
#Use accuracy score and determine accuracy of both classifiers
acc_dt = accuracy_score(y_pred_dt,y_val)
acc_nb = accuracy_score(y_pred_nb,y_val)
print("The accuracy of Decision Tree: {} %".format(acc_dt))
print("The accuracy of Gaussian Naive Bayes: {} %".format(acc_nb))
Q13. Determine the precision and recall using precision_score and recall_score.
from sklearn.metrics import precision_score,recall_score
#TODO
prec_dt = precision_score(y_pred_dt,y_val)
prec_nb = precision_score(y_pred_nb,y_val)
recall_dt = recall_score(y_pred_dt,y_val)
recall_nb = recall_score(y_pred_nb,y_val)
print("The precision of Decision Tree: {} %".format(prec_dt))
print("The precision of Gaussian Naive Bayes: {} %".format(prec_nb))
print("The recall of Decision Tree: {} %".format(recall_dt))
print("The recall of Gaussian Naive Bayes: {} %".format(recall_nb))
Q14. Determine the F1-score of the two classifiers.
from sklearn.metrics import f1_score
f1_dt = f1_score(y_pred_dt,y_val)
f1_nb = f1_score(y_pred_nb,y_val)
print("The F1-score of Decision Tree: {} %".format(f1_dt))
print("The F1-score of Gaussian Naive Bayes: {} %".format(f1_nb))
We have officially solved a machine learning problem. However, the question of making an effective model is still in question. Try to get the f1 score of any of the classifiers to 85%.