Heart disease
Contents
This is a self-correcting activity generated by nbgrader. Fill in any place that says
YOUR CODE HERE
orYOUR ANSWER HERE
. Run subsequent cells to check your code.
Heart disease¶
In this activity, you’ll use a small dataset provided by the Cleveland Clinic Foundation for Heart Disease.
Each row describes a patient, and each column describes an attribute. You will use this information to predict whether a patient has heart disease.
Below is a description of each column.
Column |
Description |
Feature Type |
Data Type |
---|---|---|---|
Age |
Age in years |
Numerical |
integer |
Sex |
(1 = male; 0 = female) |
Categorical |
integer |
CP |
Chest pain type (0, 1, 2, 3, 4) |
Categorical |
integer |
Trestbpd |
Resting blood pressure (in mm Hg on admission to the hospital) |
Numerical |
integer |
Chol |
Serum cholestoral in mg/dl |
Numerical |
integer |
FBS |
(fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) |
Categorical |
integer |
RestECG |
Resting electrocardiographic results (0, 1, 2) |
Categorical |
integer |
Thalach |
Maximum heart rate achieved |
Numerical |
integer |
Exang |
Exercise induced angina (1 = yes; 0 = no) |
Categorical |
integer |
Oldpeak |
ST depression induced by exercise relative to rest |
Numerical |
float |
Slope |
The slope of the peak exercise ST segment |
Numerical |
integer |
CA |
Number of major vessels (0-3) colored by flourosopy |
Numerical |
integer |
Thal |
3 = normal; 6 = fixed defect; 7 = reversable defect |
Categorical |
string |
Target |
Diagnosis of heart disease (1 = true; 0 = false) |
Classification |
integer |
Environment setup¶
import platform
print(f"Python version: {platform.python_version()}")
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams["figure.figsize"] = 10, 8
%config InlineBackend.figure_format = "retina"
sns.set()
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")
# You may add other imports here as needed
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import (
ConfusionMatrixDisplay,
classification_report,
RocCurveDisplay,
)
from sklearn.model_selection import cross_val_score
Step 1: loading the data¶
Question¶
Load the dataset into a pandas DataFrame named df_heart
.
csv_url = "https://raw.githubusercontent.com/bpesquet/mlkatas/master/_datasets/heart.csv"
# YOUR CODE HERE
print(f"df_heart: {df_heart.shape}")
assert df_heart.shape == (301, 14)
Step 2: prepare the data¶
Question¶
Use the following cells to discover data.
# Print info about the dataset
# YOUR CODE HERE
# Print the first 10 data samples
# YOUR CODE HERE
# # Print descriptive statistics for all numerical attributes
# YOUR CODE HERE
# Print distribution of target values
# YOUR CODE HERE
Question¶
Use the following cells to prepare data for training:
Split data between training and test sets with a 20% ratio.
Store inputs and labels in the
x_train
andy_train
variables.Preprocess training input data as needed.
# Split dataset between training and test
# YOUR CODE HERE
print(f"Training dataset: {df_train.shape}")
print(f"Test dataset: {df_test.shape}")
assert df_train.shape == (240, 14)
assert df_test.shape == (61, 14)
# Split training dataset between inputs and target
# YOUR CODE HERE
print(f"Training data: {df_x_train.shape}")
print(f"Training labels: {y_train.shape}")
assert df_x_train.shape == (240, 13)
assert y_train.shape == (240,)
# Print numerical and categorical features
num_features = df_x_train.select_dtypes(include=[np.number]).columns
print(num_features)
cat_features = df_x_train.select_dtypes(include=[object]).columns
print(cat_features)
# Print distribution for the "thal" feature
# YOUR CODE HERE
# Preprocess data to have similar scales and only numerical values
# YOUR CODE HERE
# Print preprocessed data shape and first sample
print(f"x_train: {x_train.shape}")
print(x_train[0])
assert x_train.shape == (240, 15)
Step 3: train and evaluate a model¶
Question¶
Use the following cells to:
Train a SGD classifier on the training data.
Evaluate its accuracy using K-fold cross-validation.
Compute the precision, recall and f1-score metrics.
Plot its confusion matrix and ROC curve.
# Fit a SGD classifier to the training set
# YOUR CODE HERE
# Use cross-validation to evaluate accuracy, using 3 folds
# Store the result in the cv_acc variable
# YOUR CODE HERE
print(f"CV accuracy: {cv_acc}")
assert np.mean(cv_acc) > 0.70
# Plot the confusion matrix for a model and a dataset
def plot_conf_mat(model, x, y):
with sns.axes_style("white"): # Temporary hide Seaborn grid lines
display = ConfusionMatrixDisplay.from_estimator(
model, x, y, values_format="d", cmap=plt.cm.Blues
)
# Plot confusion matrix for the SGD classifier
# YOUR CODE HERE
# Compute precision, recall and f1-score for the SGD classifier
# YOUR CODE HERE
# Plot ROC curve for the SGD classifier
# YOUR CODE HERE
Bonus¶
Train another classifier and plot confusion matrices and ROC curves for both.