Titanic
Contents
This is a self-correcting activity generated by nbgrader. Fill in any place that says
YOUR CODE HERE
orYOUR ANSWER HERE
. Run subsequent cells to check your code.
Titanic¶
The goal of this activity is to predicts which passengers survived the Titanic shipwreck. It uses the famous Kaggle Titanic dataset which is a staple of ML challenges.
Here is a description of this dataset:
Variable |
Definition |
Key |
---|---|---|
PassengerId |
Passenger ID |
0 = No, 1 = Yes |
Survived |
Survival |
0 = No, 1 = Yes |
pclass |
Ticket class |
1 = 1st, 2 = 2nd, 3 = 3rd |
Name |
Last and first names |
|
sex |
Sex |
|
Age |
Age in years |
|
sibsp |
# of siblings / spouses aboard the Titanic |
|
parch |
# of parents / children aboard the Titanic |
|
ticket |
Ticket number |
|
fare |
Passenger fare |
|
cabin |
Cabin number |
|
embarked |
Port of Embarkation |
C = Cherbourg, Q = Queenstown, S = Southampton |
Environment setup¶
Question¶
Import the necessary packages.
# YOUR CODE HERE
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# YOUR CODE HERE
Data loading and analysis¶
Question¶
Use pandas to import the dataset as CSV data from URL https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv.
Print dataset shape.
# YOUR CODE HERE
Data preprocessing¶
Question¶
Remove from the dataset columns that seem non-informative for Machine Learning.
Hint: there are 4 of them.
# YOUR CODE HERE
Question¶
The Age feature should be very interesting for predicting survival. However, several values are missing.
Use the pandas fillna() function to replace all NaN
values by -1 for the Age feature.
# YOUR CODE HERE
Question¶
Use the pandas cut() function to segment the Age feature into categories, accodring to the provided labels and intervals.
age_labels = ['Missing', 'Child', 'Teenager', 'Young adult', 'Adult', 'Senior']
age_intervals = [-2, 0, 12, 18, 35, 60, 100]
# YOUR CODE HERE
Question¶
Apply the following function to one-hot encode categorical features “Age”, “Sex”, “Embarked”, “SibSp” and “Pclass”.
def apply_dummies(df, column_name):
# Codage binaire dans un nouveau DataFrame
dummies_features = pd.get_dummies(df[column_name], prefix=column_name)
# Concaténation du DataFrame avec les nouvelles colonnes
df = pd.concat([df, dummies_features], axis=1)
# Suppression de la colonne initiale
df = df.drop(columns=[column_name])
return df
# YOUR CODE HERE
Model training¶
Question¶
Split dataset between training and test sets, using a 20% ratio for test. Print shapes of all sets.
# YOUR CODE HERE
Question¶
Train several Machine Learning models:
a Logistic Regression classifier;
a Decision Tree;
a MultiLayer Perceptron.
# YOUR CODE HERE
# YOUR CODE HERE
# YOUR CODE HERE
Models evaluation¶
Question¶
Print the confusion matrix for your models.
# Plot the confusion matrix for a model and a dataset
def plot_conf_mat(model, x, y):
with sns.axes_style("white"): # Temporary hide Seaborn grid lines
display = plot_confusion_matrix(
model, x, y, values_format="d", cmap=plt.cm.Blues
)
# YOUR CODE HERE