Pandas¶

In this notebook, we'll learn the basics of data analysis with the Python Pandas library.

Uploading the data¶

We're first going to get some data to play with. We're going to load the titanic dataset from the public link below.

import urllib

# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/titanic.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open('titanic.csv', 'wb') as f:
    f.write(html)

# Checking if the data was uploaded
!ls -l

total 208
-rw-r--r-- 1 arrayslayer arrayslayer 24355 Jan 15 18:18 01_Python.ipynb
-rw-r--r-- 1 arrayslayer arrayslayer 16636 Jan 15 18:42 02_NumPy.ipynb
-rw-r--r-- 1 arrayslayer arrayslayer 61257 Jan 15 15:36 03_Pandas.ipynb
-rw-rw-r-- 1 arrayslayer arrayslayer 16609 Jan 15 17:24 Lab1.zip
-rw-r--r-- 1 arrayslayer arrayslayer 85153 Jan 15 18:44 titanic.csv

Loading the data¶

Now that we have some data to play with, let's load it into a Pandas dataframe. Pandas is a great Python library for data analysis.

import pandas as pd

# Read from CSV to Pandas DataFrame
df = pd.read_csv("titanic.csv", header=0)

pd.re

# First five items
df.head(n=10)

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	1
1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	1
2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	0
3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	0
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	0
5	1	Anderson, Mr. Harry	male	48.0000	0	0	19952	26.5500	E12	S	1
6	1	Andrews, Miss. Kornelia Theodosia	female	63.0000	1	0	13502	77.9583	D7	S	1
7	1	Andrews, Mr. Thomas Jr	male	39.0000	0	0	112050	0.0000	A36	S	0
8	1	Appleton, Mrs. Edward Dale (Charlotte Lamson)	female	53.0000	2	0	11769	51.4792	C101	S	1
9	1	Artagaveytia, Mr. Ramon	male	71.0000	0	0	PC 17609	49.5042	NaN	C	0

These are the diferent features:

pclass: class of travel
name: full name of the passenger
sex: gender
age: numerical age
sibsp: # of siblings/spouse aboard
parch: number of parents/child aboard
ticket: ticket number
fare: cost of the ticket
cabin: location of room
emarked: port that the passenger embarked at (C - Cherbourg, S - Southampton, Q = Queenstown)
survived: survial metric (0 - died, 1 - survived)

Exploratory analysis¶

We're going to use the Pandas library and see how we can explore and process our data.

# Describe features
df.describe()

	pclass	age	sibsp	parch	fare	survived
count	1309.000000	1046.000000	1309.000000	1309.000000	1308.000000	1309.000000
mean	2.294882	29.881135	0.498854	0.385027	33.295479	0.381971
std	0.837836	14.413500	1.041658	0.865560	51.758668	0.486055
min	1.000000	0.166700	0.000000	0.000000	0.000000	0.000000
25%	2.000000	21.000000	0.000000	0.000000	7.895800	0.000000
50%	3.000000	28.000000	0.000000	0.000000	14.454200	0.000000
75%	3.000000	39.000000	1.000000	0.000000	31.275000	1.000000
max	3.000000	80.000000	8.000000	9.000000	512.329200	1.000000

%matplotlib inline

# Histograms
df["age"].hist()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe4f424c940>

# Unique values
df["embarked"].unique()

array(['S', 'C', nan, 'Q'], dtype=object)

# Selecting data by feature
df["name"].head()

0                      Allen, Miss. Elisabeth Walton
1                     Allison, Master. Hudson Trevor
2                       Allison, Miss. Helen Loraine
3               Allison, Mr. Hudson Joshua Creighton
4    Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
Name: name, dtype: object

# Filtering
df[df["sex"]=="female"].head() # only the female data appear

	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
0	1	Allen, Miss. Elisabeth Walton	female	29.0	0	0	24160	211.3375	B5	S	1
2	1	Allison, Miss. Helen Loraine	female	2.0	1	2	113781	151.5500	C22 C26	S	0
4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0	1	2	113781	151.5500	C22 C26	S	0
6	1	Andrews, Miss. Kornelia Theodosia	female	63.0	1	0	13502	77.9583	D7	S	1
8	1	Appleton, Mrs. Edward Dale (Charlotte Lamson)	female	53.0	2	0	11769	51.4792	C101	S	1

# Sorting
df.sort_values("age", ascending=False).head()

	pclass	name	sex	age	sibsp	ticket	fare	cabin	embarked	survived
14	1	Barkworth, Mr. Algernon Henry Wilson	male	80.0	0	27042	30.0000	A23	S	1
61	1	Cavendish, Mrs. Tyrell William (Julia Florence...	female	76.0	1	19877	78.8500	C46	S	1
1235	3	Svensson, Mr. Johan	male	74.0	0	347060	7.7750	NaN	S	0
135	1	Goldschmidt, Mr. George B	male	71.0	0	PC 17754	34.6542	A5	C	0
9	1	Artagaveytia, Mr. Ramon	male	71.0	0	PC 17609	49.5042	NaN	C	0

# Grouping
survived_group = df.groupby("survived")
survived_group.mean()

	pclass	age	sibsp	parch	fare
survived
0	2.500618	30.545369	0.521632	0.328801	23.353831
1	1.962000	28.918228	0.462000	0.476000	49.361184

# Selecting row
df.iloc[0, :] # iloc gets rows (or columns) at particular positions in the index (so it only takes integers)

pclass                                  1
name        Allen, Miss. Elisabeth Walton
sex                                female
age                                    29
sibsp                                   0
parch                                   0
ticket                              24160
fare                              211.338
cabin                                  B5
embarked                                S
survived                                1
Name: 0, dtype: object

# Selecting specific value
df.iloc[0, 1]

'Allen, Miss. Elisabeth Walton'

# Selecting by index
df.loc[0] # loc gets rows (or columns) with particular labels from the index

pclass                                  1
name        Allen, Miss. Elisabeth Walton
sex                                female
age                                    29
sibsp                                   0
parch                                   0
ticket                              24160
fare                              211.338
cabin                                  B5
embarked                                S
survived                                1
Name: 0, dtype: object

Preprocessing¶

# Rows with at least one NaN value
df[pd.isnull(df).any(axis=1)].head()

	pclass	name	sex	age	ticket	fare	cabin	embarked	survived
9	1	Artagaveytia, Mr. Ramon	male	71.0	PC 17609	49.5042	NaN	C	0
13	1	Barber, Miss. Ellen "Nellie"	female	26.0	19877	78.8500	NaN	S	1
15	1	Baumann, Mr. John D	male	NaN	PC 17318	25.9250	NaN	S	0
23	1	Bidois, Miss. Rosalie	female	42.0	PC 17757	227.5250	NaN	C	1
25	1	Birnbaum, Mr. Jakob	male	25.0	13905	26.0000	NaN	C	0

# Drop rows with Nan values
df = df.dropna() # removes rows with any NaN values
df = df.reset_index() # reset's row indexes in case any rows were dropped
df.head()

	index	pclass	name	sex	age	sibsp	parch	ticket	fare	cabin	embarked	survived
0	0	1	Allen, Miss. Elisabeth Walton	female	29.0000	0	0	24160	211.3375	B5	S	1
1	1	1	Allison, Master. Hudson Trevor	male	0.9167	1	2	113781	151.5500	C22 C26	S	1
2	2	1	Allison, Miss. Helen Loraine	female	2.0000	1	2	113781	151.5500	C22 C26	S	0
3	3	1	Allison, Mr. Hudson Joshua Creighton	male	30.0000	1	2	113781	151.5500	C22 C26	S	0
4	4	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0000	1	2	113781	151.5500	C22 C26	S	0

# Dropping multiple columns
df = df.drop(["name", "cabin", "ticket"], axis=1) # we won't use text features for our initial basic models
df.head()

	index	pclass	sex	age	sibsp	parch	fare	embarked	survived
0	0	1	female	29.0000	0	0	211.3375	S	1
1	1	1	male	0.9167	1	2	151.5500	S	1
2	2	1	female	2.0000	1	2	151.5500	S	0
3	3	1	male	30.0000	1	2	151.5500	S	0
4	4	1	female	25.0000	1	2	151.5500	S	0

# Map feature values
df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)
df["embarked"] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)
df.head()

	index	pclass	sex	age	sibsp	parch	fare	survived
0	0	1	0	29.0000	0	0	211.3375	1
1	1	1	1	0.9167	1	2	151.5500	1
2	2	1	0	2.0000	1	2	151.5500	0
3	3	1	1	30.0000	1	2	151.5500	0
4	4	1	0	25.0000	1	2	151.5500	0

Feature engineering¶

# Lambda expressions to create new features
def get_family_size(sibsp, parch):
    family_size = sibsp + parch
    return family_size

df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
df.head()

	index	pclass	sex	age	sibsp	parch	fare	survived	family_size
0	0	1	0	29.0000	0	0	211.3375	1	0
1	1	1	1	0.9167	1	2	151.5500	1	3
2	2	1	0	2.0000	1	2	151.5500	0	3
3	3	1	1	30.0000	1	2	151.5500	0	3
4	4	1	0	25.0000	1	2	151.5500	0	3

df['family'] = df['sibsp'] + df['parch']

# Reorganize headers
df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]
df.head()

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-25-df1bb562d56b> in <module>()
      1 # Reorganize headers
----> 2 df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]
      3 df.head()

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2680         if isinstance(key, (Series, np.ndarray, Index, list)):
   2681             # either boolean or fancy integer index
-> 2682             return self._getitem_array(key)
   2683         elif isinstance(key, DataFrame):
   2684             return self._getitem_frame(key)

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in _getitem_array(self, key)
   2724             return self._take(indexer, axis=0)
   2725         else:
-> 2726             indexer = self.loc._convert_to_indexer(key, axis=1)
   2727             return self._take(indexer, axis=1)
   2728 

~/anaconda3/lib/python3.7/site-packages/pandas/core/indexing.py in _convert_to_indexer(self, obj, axis, is_setter)
   1325                 if mask.any():
   1326                     raise KeyError('{mask} not in index'
-> 1327                                    .format(mask=objarr[mask]))
   1328 
   1329                 return com._values_from_object(indexer)

KeyError: "['family_size'] not in index"

Saving data¶

# Saving dataframe to CSV
df.to_csv("processed_titanic.csv", index=False)

# See your saved file
!ls -l

total 96
-rw-r--r-- 1 root root  6975 Oct  8 17:49 processed_titanic.csv
drwxr-xr-x 2 root root  4096 Sep 28 23:32 sample_data
-rw-r--r-- 1 root root 85153 Oct  8 17:48 titanic.csv