In this notebook, we'll learn the basics of data analysis with the Python Pandas library.
We're first going to get some data to play with. We're going to load the titanic dataset from the public link below.
import urllib
# Upload data from GitHub to notebook's local drive
url = "https://raw.githubusercontent.com/GokuMohandas/practicalAI/master/data/titanic.csv"
response = urllib.request.urlopen(url)
html = response.read()
with open('titanic.csv', 'wb') as f:
f.write(html)
# Checking if the data was uploaded
!ls -l
Now that we have some data to play with, let's load it into a Pandas dataframe. Pandas is a great Python library for data analysis.
import pandas as pd
# Read from CSV to Pandas DataFrame
df = pd.read_csv("titanic.csv", header=0)
pd.re
# First five items
df.head(n=10)
These are the diferent features:
We're going to use the Pandas library and see how we can explore and process our data.
# Describe features
df.describe()
%matplotlib inline
# Histograms
df["age"].hist()
# Unique values
df["embarked"].unique()
# Selecting data by feature
df["name"].head()
# Filtering
df[df["sex"]=="female"].head() # only the female data appear
# Sorting
df.sort_values("age", ascending=False).head()
# Grouping
survived_group = df.groupby("survived")
survived_group.mean()
# Selecting row
df.iloc[0, :] # iloc gets rows (or columns) at particular positions in the index (so it only takes integers)
# Selecting specific value
df.iloc[0, 1]
# Selecting by index
df.loc[0] # loc gets rows (or columns) with particular labels from the index
# Rows with at least one NaN value
df[pd.isnull(df).any(axis=1)].head()
# Drop rows with Nan values
df = df.dropna() # removes rows with any NaN values
df = df.reset_index() # reset's row indexes in case any rows were dropped
df.head()
# Dropping multiple columns
df = df.drop(["name", "cabin", "ticket"], axis=1) # we won't use text features for our initial basic models
df.head()
# Map feature values
df['sex'] = df['sex'].map( {'female': 0, 'male': 1} ).astype(int)
df["embarked"] = df['embarked'].dropna().map( {'S':0, 'C':1, 'Q':2} ).astype(int)
df.head()
# Lambda expressions to create new features
def get_family_size(sibsp, parch):
family_size = sibsp + parch
return family_size
df["family_size"] = df[["sibsp", "parch"]].apply(lambda x: get_family_size(x["sibsp"], x["parch"]), axis=1)
df.head()
df['family'] = df['sibsp'] + df['parch']
# Reorganize headers
df = df[['pclass', 'sex', 'age', 'sibsp', 'parch', 'family_size', 'fare', 'embarked', 'survived']]
df.head()
# Saving dataframe to CSV
df.to_csv("processed_titanic.csv", index=False)
# See your saved file
!ls -l