# Compare Supervised Learning Models

In this notebook we use 6 toy datasets (3 for regression and 3 for classification) to compare the different algorithms and their hyperparameter settings.

Execute the following cells until you see the different datasets and then, after each chapter describing a type of model, come back to this notebook to test the respective model on the datasets and experiment with the model's hyperparameter settings.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.datasets import make_moons
# don't get unneccessary warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [None]:
# You do not need to understand what happens in these functions,
# just execute the cell so you can use the functions below

n_train_reg = 100
n_train_clf = 300

def plot_regression(X, y, model=None):
    # plot a regression dataset (and model predictions)
    plt.figure()
    plt.scatter(X[:, 0], y, s=10, c='#3090C7', alpha=0.7, label='data samples')
    if model is not None:
        X_plot = np.linspace(np.min(X), np.max(X), 1000)
        plt.plot(X_plot, model.predict(X_plot[:, np.newaxis]), '#15317E', linewidth=1., alpha=0.9, label='prediction')
        plt.legend()
    plt.xlabel('x (feature)')
    plt.ylabel('y (target)')
    plt.title('Regression Problem')
    
def plot_classification(X, Y, model=None):
    # plot a classification dataset (and model predictions)
    plt.figure()
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 250),
                         np.linspace(y_min, y_max, 250))
    cm = plt.cm.RdBu
    cm_bright = ListedColormap(['#FF0000', '#0000FF'])
    if model is not None:
        try:
            Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])
            alpha = 0.8
        except:
            # decision tree
            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
            alpha = 0.4
        # Put the result into a color plot
        Z = Z.reshape(xx.shape)
        plt.contourf(xx, yy, Z, cmap=cm, alpha=alpha)
    # Plot the training points
    plt.scatter(X[:, 0], X[:, 1], s=20, c=Y, cmap=cm_bright, label="data samples")
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.xlabel("feature 1")
    plt.ylabel("feature 2")
    plt.title("Classification Problem")
    plt.colorbar()

def get_linear_regression():
    # generate noisy linear regression dataset
    np.random.seed(15)
    X = np.random.rand(n_train_reg, 1)
    y = -2.5 + 5*X
    y += np.random.randn(n_train_reg, 1) * 0.4
    return X, y.flatten()

def get_linear_outlier():
    # generate linear regression dataset with outliers
    np.random.seed(15)
    X = np.random.rand(n_train_reg, 1)
    y = -2.5 + 5*X
    y += np.random.randn(n_train_reg, 1) * 0.05
    y[(X>0.7) & (X<0.73)] = 10
    return X, y.flatten()

def get_nonlinear_regression():
    # generate noisy non-linear regression dataset
    np.random.seed(15)
    X = np.random.rand(n_train_reg, 1) * np.pi * 2.
    y = np.sin(X)
    y += np.random.randn(n_train_reg, 1) * 0.2
    return X, y.flatten()

def get_linear_classification_1f():
    # generate classification dataset with 1 informative feature
    np.random.seed(15)
    mean = [0, 0]
    cov = [[1, 0], [0, 10]]
    X = np.zeros((n_train_clf, 2))
    X[:n_train_clf//2] = np.random.multivariate_normal(mean, cov, n_train_clf//2)
    mean = [5, 0]
    X[n_train_clf//2:] = np.random.multivariate_normal(mean, cov, n_train_clf//2)
    y = np.zeros(n_train_clf, dtype=int)
    y[n_train_clf//2:] = 1
    rndidx = np.random.permutation(len(y))
    return X[rndidx], y[rndidx]

def get_linear_classification_2f():
    # generate classification dataset with 2 informative features
    np.random.seed(15)
    mean = [0, 4]
    cov = np.array([[1, 8], [8, 10]])
    cov = np.dot(cov, cov.T)/10
    X = np.zeros((n_train_clf, 2))
    X[:n_train_clf//2] = np.random.multivariate_normal(mean, cov, n_train_clf//2)
    mean = [4, 0]
    X[n_train_clf//2:] = np.random.multivariate_normal(mean, cov, n_train_clf//2)
    y = np.zeros(n_train_clf, dtype=int)
    y[n_train_clf//2:] = 1
    rndidx = np.random.permutation(len(y))
    return X[rndidx], y[rndidx]

def get_nonlinear_classification():
    # generate non-linear classification dataset
    return make_moons(n_samples=n_train_clf, noise=0.3, random_state=1)

## Datasets

Here you can have a look at the 3 regression and 3 classification datasets on which we'll compare the different models. The regression dataset only has one input feature, while the classification dataset has two and the target (i.e., class label) is indicated by the color of the dots.

**Questions:**
- Why are the first two regression and classification datasets linear and the last ones non-linear?

In [None]:
# generate & plot regression datasets
X_reg_1, y_reg_1 = get_linear_regression()
X_reg_2, y_reg_2 = get_linear_outlier()
X_reg_3, y_reg_3 = get_nonlinear_regression()
plot_regression(X_reg_1, y_reg_1)
plot_regression(X_reg_2, y_reg_2)
plot_regression(X_reg_3, y_reg_3)

In [None]:
# generate & plot classification datasets
X_clf_1, y_clf_1 = get_linear_classification_1f()
X_clf_2, y_clf_2 = get_linear_classification_2f()
X_clf_3, y_clf_3 = get_nonlinear_classification()
plot_classification(X_clf_1, y_clf_1)
plot_classification(X_clf_2, y_clf_2)
plot_classification(X_clf_3, y_clf_3)

## Linear Models

After reading the chapter on linear models, test them here on different datasets (by changing the number at the end of the dataset variable, e.g., `X_reg_1` -> `X_reg_2`) and experiment with their hyperparameter settings (in the comments you'll find a description of the different hyperparameters and which values you can test for them).

**Questions:**
- Compare the linear regression and ridge regression models on the regression dataset with outliers (i.e., `X_reg_2, y_reg_2`): what do you observe?
- What happens when you increase the value for `alpha` for the ridge regression model? (first think about it, then confirm your guess by actually changing the parameter)

In [None]:
from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression

In [None]:
# Linear Regression
X, y = X_reg_1, y_reg_1  # change the numbers here to test the model on a different dataset
model = LinearRegression()
model.fit(X, y)
plot_regression(X, y, model)
print(f"f(x) = {model.intercept_:.3f} + {model.coef_[0]:.3f} * x")

In [None]:
# Ridge Regression:
# alpha (> 0): regularization (higher values = more regularization)
X, y = X_reg_1, y_reg_1
model = Ridge(alpha=1.)
model.fit(X, y)
plot_regression(X, y, model)
print(f"f(x) = {model.intercept_:.3f} + {model.coef_[0]:.3f} * x")

In [None]:
# Logistic Regression (for classification problems!):
# C (> 0): regularization (smaller values = more regularization)
# penalty: change to "l1" to get sparse weights (only if you have many features; needs a different solver)
X, y = X_clf_2, y_clf_2
model = LogisticRegression(penalty="l2", C=100.)
model.fit(X, y)
plot_classification(X, y, model)  # the shaded area indicates the predicted probability for each class
print(f"f(x) = sigmoid({model.intercept_[0]:.3f} + {model.coef_[0, 0]:.3f} * x_1 + {model.coef_[0, 1]:.3f} * x_2)")

## Decision Trees

After reading the chapter on decision trees, test them here on different datasets and experiment with their hyperparameter settings.

**Questions:**
- On the 3rd regression dataset with `max_depth=2`, why do you get exactly 4 plateaus in the prediction?
- On the 3rd regression dataset, what happens if you leave `min_samples_leaf` at 10 and then increase `max_depth` step by step from 2 to 10 or even higher values? How do you explain this behavior and what would you need to do to get a tree that fits the data in a more fine granular way?
- Compare the prediction of the decision tree classifier on the 2nd dataset (which is basically a rotation of the 1st dataset, i.e., still a simple linear classification problem!) to the prediction made by the logistic regression model on this dataset: What do you observe and why?

In [None]:
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier

In [None]:
# Decision Tree for regression:
# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)
# min_samples_leaf (>= 1): how many training points are in one prediction bucket
X, y = X_reg_3, y_reg_3
model = DecisionTreeRegressor(max_depth=2, min_samples_leaf=10)
model.fit(X, y)
plot_regression(X, y, model)

In [None]:
# Decision Tree for classification:
# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)
# min_samples_leaf (>= 1): how many training points are in one prediction bucket
X, y = X_clf_1, y_clf_1
model = DecisionTreeClassifier(max_depth=2, min_samples_leaf=10)
model.fit(X, y)
plot_classification(X, y, model)

## Ensemble Methods (Random Forest)

After reading the chapter on ensemble methods, test the random forest here on different datasets and experiment with the hyperparameter settings (same hyperparameters as the decision tree and the additional parameter `n_estimators` for the number of trees in the forest).

**Questions:**
- What do you observe when you compare a random forest with multiple estimators to a single decision tree with the same hyperparameter settings (especially for more specific trees, i.e., large `max_depth` and small `min_samples_leaf`)?

In [None]:
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

In [None]:
# Random Forest for regression:
# n_estimators (>= 1): how many decision trees to train (don't set this too high, gets computationally expensive)
# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)
# min_samples_leaf (>= 1): how many training points are in one prediction bucket
X, y = X_reg_3, y_reg_3
model = RandomForestRegressor(n_estimators=100, max_depth=2, min_samples_leaf=10)
model.fit(X, y)
plot_regression(X, y, model)

In [None]:
# Random Forest for classification:
# n_estimators (>= 1): how many decision trees to train
# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)
# min_samples_leaf (>= 1): how many training points are in one prediction bucket
X, y = X_clf_2, y_clf_2
model = RandomForestClassifier(n_estimators=100, max_depth=2, min_samples_leaf=10)
model.fit(X, y)
plot_classification(X, y, model)

## Similarity-based Models (kNN)

After reading the chapter on k-nearest neighbors, test the method here on different datasets and experiment with the hyperparameter settings.

**Questions:**
- On the 3rd regression dataset for a larger number of nearest neighbors (e.g., 20), what do you observe for the prediction at the edges of the input domain and why?
- Especially for binary classification problems, why does it make sense to always use an odd number of nearest neighbors?

In [None]:
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier

In [None]:
# k-Nearest Neighbors for regression:
# n_neighbors (>= 1): how many nearest neighbors are used for the prediction
X, y = X_reg_3, y_reg_3
model = KNeighborsRegressor(n_neighbors=10)
model.fit(X, y)
plot_regression(X, y, model)

In [None]:
# k-Nearest Neighbors for classification:
# n_neighbors (>= 1): how many nearest neighbors are used for the prediction
X, y = X_clf_3, y_clf_3
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X, y)
plot_classification(X, y, model)

## Kernel Methods

After reading the chapter on kernel methods, test a SVM here on different datasets and experiment with the hyperparameter settings.

**Questions:**
- How do the values of the hyperparameters `gamma` and `C` interact? 
- What do you observe when you leave `gamma` at its default value `'scale'`?

In [None]:
from sklearn.svm import SVR, SVC

In [None]:
# Support Vector Regression:
# kernel: kernel function to compute similarities (default: "rbf")
# gamma (> 0): width of rbf kernel (larger values --> more focused on individual points)
# C (> 0): regularization (smaller values = more regularization)
X, y = X_reg_3, y_reg_3
model = SVR(kernel='rbf', gamma=100., C=1.)
model.fit(X, y)
plot_regression(X, y, model)

In [None]:
# Support Vector Classification:
# kernel: kernel function to compute similarities (default: "rbf")
# gamma (> 0): width of rbf kernel (larger values --> more focused on individual points)
# C (> 0): regularization (smaller values = more regularization)
X, y = X_clf_3, y_clf_3
model = SVC(kernel='rbf', gamma=.005, C=1.)
model.fit(X, y)
plot_classification(X, y, model)