Breast cancer
Contents
This is a self-correcting activity generated by nbgrader. Fill in any place that says
YOUR CODE HERE
orYOUR ANSWER HERE
. Run subsequent cells to check your code.
Breast cancer¶
In this activity, you’ll use a K-Nearest Neighbors classifier to help diagnose breast tumors.
The Breast Cancer dataset is used for multivariate binary classification between benign and maligant tumors. There are 569 total samples with 30 features each. Features were computed from a digitized image of a fine needle aspirate of a breast mass. They describe characteristics of the cell nuclei present in the image.
Environment setup¶
# Import base packages
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
# Setup plots
%matplotlib inline
plt.rcParams['figure.figsize'] = 10, 8
%config InlineBackend.figure_format = 'retina'
sns.set()
# Import ML packages
import sklearn
print(f'scikit-learn version: {sklearn.__version__}')
from sklearn.datasets import load_breast_cancer
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
Step 1: Loading the data¶
dataset = load_breast_cancer()
# Put data in a pandas DataFrame
df_breast_cancer = pd.DataFrame(dataset.data, columns=dataset.feature_names)
# Add target and class to DataFrame
df_breast_cancer['target'] = dataset.target
df_breast_cancer['class'] = dataset.target_names[dataset.target]
# Show 10 random samples
df_breast_cancer.sample(n=10)
Step 2: Preparing the data¶
Question¶
Compute the number of features of the dataset into the num_features
variable.
# YOUR CODE HERE
print(f'Number of features: {num_features}')
assert num_features == 30
Question¶
In order to evaluate class distribution, compute the number of benign and malignant tumors into the num_benign
and num_malignant
variables respectively.
# YOUR CODE HERE
print(f'Benign count: {num_benign}. Malignant count: {num_malignant}')
assert num_benign == 357
assert num_malignant == 212
# Store input and labels
x = dataset.data
y = dataset.target
print(f'x: {x.shape}. y: {y.shape}')
Question¶
Split the dataset into training and test sets with a 25% ratio. Use variables x_train
, y_train
, x_test
and y_test
.
# YOUR CODE HERE
print(f'x_train: {x_train.shape}. y_train: {y_train.shape}')
print(f'x_test: {x_test.shape}. y_test: {y_test.shape}')
assert x_train.shape == (426, 30)
assert y_train.shape == (426, )
assert x_test.shape == (143, 30)
assert y_test.shape == (143,)
Question¶
Scale features by standardization while preventing information leakage from the test set.
# YOUR CODE HERE
mean_train = x_train.mean()
std_train = x_train.std()
print(f'mean_train: {mean_train}. std_train: {std_train}')
assert np.abs(np.max(mean_train)) < 10**-6
assert np.abs(np.max(std_train - 1)) < 10**-6
Step 3: Creating a classifier¶
Question¶
Create a KNeighborsClassifier
instance using only one nearest neighbor, store it into the model
variable, and fit the training data.
# YOUR CODE HERE
Step 4: Evaluating the classifier¶
# Compute accuracy on training and test sets
train_acc = model.score(x_train, y_train)
test_acc = model.score(x_test, y_test)
print(f'Training accuracy: {train_acc * 100:.2f}%')
print(f'Test accuracy: {test_acc * 100:.2f}%')
Question¶
Display precision, recall and f1-score for the classifier on test data. Interpret the results.
# YOUR CODE HERE
Question¶
Go back to step 3 and try to find the best value for the k
number of nearest neighbors.