exercise notebooks + data

2026-01-14 12:14:38 +01:00 · 2021-08-18 19:07:05 +02:00
parent c424a8e970
commit bfce988d21
31 changed files with 405502 additions and 3 deletions
--- a/README.md
+++ b/README.md
@@ -4,7 +4,8 @@ This repository contains the Python exercises accompanying the theory from my [m
 Also have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf), which includes a summary of the most important steps when developing a machine learning solution, incl. code snippets.
-If you're unfamiliar with Python, please have a look at [this tutorial](https://github.com/cod3licious/python_tutorial) first, which also includes some notes on what you need to install to work on the exercises on your computer. If you have a Google account, you can also run the code in the cloud using **Google Colab**:
+If you're unfamiliar with Python, please have a look at [this tutorial](https://github.com/cod3licious/python_tutorial) first, which also includes some notes on how to install Python and Jupyter Notebook on your own computer (please make sure you're using Python 3 and all libraries listed in the `requirements.txt` file are installed and up to date (you can also verify this with the `test_installation.ipynb` notebook).
 If you have a Google account, you can also run the code in the cloud using **Google Colab**:
 [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/cod3licious/ml_exercises)
 If you have any questions, please drop me a line at `hey[at]franziskahorn.de`.
@@ -12,6 +13,86 @@ If you have any questions, please drop me a line at `hey[at]franziskahorn.de`.
 Have fun!
-### Overview
+### Course Overview
 (You can also find the course syllabus on the last page of the course description.)
 #### Part 1:
 ##### Block 1.1:
 - [ ] Read the whole chapter: ["Introduction: Solving Problems with ML"](https://franziskahorn.de/mlbook/_introduction_solving_problems_with_ml.html)
 - [ ] Answer [Quiz 1](https://forms.gle/uzdzytpsYf9sFG946)
 ##### Block 1.2:
 - [ ] Read the whole chapter: ["ML with Python"](https://franziskahorn.de/mlbook/_ml_with_python.html)
 - [ ] Install Python on your computer and complete the [Python tutorial](https://github.com/cod3licious/python_tutorial)
 ##### Block 1.3:
 - [ ] Read the whole chapter: ["Data & Preprocessing"](https://franziskahorn.de/mlbook/_data_preprocessing.html)
 - [ ] Answer [Quiz 2](https://forms.gle/uzdzytpsYf9sFG946)
 - [ ] Read the first part of the chapter ["ML Algorithms: Unsupervised & Supervised Learning"](https://franziskahorn.de/mlbook/_ml_algorithms_unsupervised_supervised_learning.html)
 #### Part 2:
 ##### Block 2.1:
 - [ ] Read the section: ["UL: Dimensionality Reduction"](https://franziskahorn.de/mlbook/_ul_dimensionality_reduction.html)
 - [ ] Work through [Notebook 1: visualize text](/exercises/1_visualize_text.ipynb)
 ##### Block 2.2:
 - [ ] Read the section: ["UL: Outlier / Anomaly Detection"](https://franziskahorn.de/mlbook/_ul_outlier_anomaly_detection.html)
 - [ ] Read the section: ["UL: Clustering"](https://franziskahorn.de/mlbook/_ul_clustering.html)
 - [ ] Work through [Notebook 2: image quantization](/exercises/2_image_quantization.ipynb)
 ##### Block 2.3:
 - [ ] Read the section: ["Supervised Learning: Overview"](https://franziskahorn.de/mlbook/_supervised_learning_overview.html)
 - [ ] Answer [Quiz 3](https://forms.gle/uzdzytpsYf9sFG946)
 #### Part 3:
 ##### Block 3.1:
 - [ ] Read the sections: ["SL: Linear Models"](https://franziskahorn.de/mlbook/_sl_linear_models.html) - ["SL: Kernel Methods"](https://franziskahorn.de/mlbook/_sl_kernel_methods.html)
 - [ ] In parallel, work through the respective sections of [Notebook 3: supervised comparison](/exercises/3_supervised_comparison.ipynb)
 ##### Block 3.2:
 - [ ] Read the section: ["Information Retrieval (Similarity Search)"](https://franziskahorn.de/mlbook/_information_retrieval_similarity_search.html) and review the sections on [TF-IDF feature vectors](https://franziskahorn.de/mlbook/_feature_extraction.html) and [cosine similarity](https://franziskahorn.de/mlbook/_computing_similarities.html)
 - [ ] Work through [Notebook 4: information retrieval](/exercises/4_information_retrieval.ipynb)
 ##### Block 3.3:
 - [ ] Read the section: ["SL: Neural Networks"](https://franziskahorn.de/mlbook/_sl_neural_networks.html)
 - [ ] Work through [Notebook 5: MNIST with torch](/exercises/5_mnist_torch.ipynb) (recommended) or [MNIST with keras](/exercises/5_mnist_keras.ipynb) (in case others in your organization are already working with TensorFlow)
 - [ ] Read the sections: ["Time Series Forecasting"](https://franziskahorn.de/mlbook/_time_series_forecasting.html) and ["Recommender Systems (Pairwise Data)"](https://franziskahorn.de/mlbook/_recommender_systems_pairwise_data.html)
 #### Part 4:
 ##### Block 4.1:
 - [ ] Read the whole chapter: ["Avoiding Common Pitfalls"](https://franziskahorn.de/mlbook/_avoiding_common_pitfalls.html)
 - [ ] Answer [Quiz 4](/exercises/1_visualize_text.ipynb)
 ##### Block 4.2:
 - [ ] Work through [Notebook 6: analyze toy dataset](/exercises/6_analyze_toydata.ipynb)
 ##### Block 4.3:
 - [ ] _Case Study!_ [Notebook 7: predicting hard drive failures](/exercises/7_hard_drive_failures.ipynb) (plan at least 5 hours for this!)
 #### Part 5:
 ##### Block 5.1:
 - [ ] Read the whole chapter: ["ML Algorithms: Reinforcement Learning"](https://franziskahorn.de/mlbook/_ml_algorithms_reinforcement_learning.html)
 ##### Block 5.2:
 - [ ] Answer [Quiz 5](https://forms.gle/uzdzytpsYf9sFG946)
 - [ ] Read the whole chapter: ["Conclusion: Using ML in Practice"](https://franziskahorn.de/mlbook/_conclusion_using_ml_in_practice.html)
 - [ ] [Exercise: plan your next ML project](https://forms.gle/uzdzytpsYf9sFG946)
 Stay tuned - exercises will come in the next few days!
--- a/data/articles.json
+++ b/data/articles.json
--- a/data/articles_short.json
+++ b/data/articles_short.json
--- a/data/hdf_data/df_2016_0.csv
+++ b/data/hdf_data/df_2016_0.csv
--- a/data/hdf_data/df_2016_1.csv
+++ b/data/hdf_data/df_2016_1.csv
--- a/data/hdf_data/df_2016_10.csv
+++ b/data/hdf_data/df_2016_10.csv
--- a/data/hdf_data/df_2016_14.csv
+++ b/data/hdf_data/df_2016_14.csv
--- a/data/hdf_data/df_2016_2.csv
+++ b/data/hdf_data/df_2016_2.csv
--- a/data/hdf_data/df_2016_3.csv
+++ b/data/hdf_data/df_2016_3.csv
--- a/data/hdf_data/df_2016_5.csv
+++ b/data/hdf_data/df_2016_5.csv
--- a/data/hdf_data/df_2016_7.csv
+++ b/data/hdf_data/df_2016_7.csv
--- a/data/hdf_data/df_2017_0.csv
+++ b/data/hdf_data/df_2017_0.csv
--- a/data/hdf_data/df_2017_1.csv
+++ b/data/hdf_data/df_2017_1.csv
--- a/data/hdf_data/df_2017_10.csv
+++ b/data/hdf_data/df_2017_10.csv
--- a/data/hdf_data/df_2017_14.csv
+++ b/data/hdf_data/df_2017_14.csv
--- a/data/hdf_data/df_2017_2.csv
+++ b/data/hdf_data/df_2017_2.csv
--- a/data/hdf_data/df_2017_3.csv
+++ b/data/hdf_data/df_2017_3.csv
--- a/data/hdf_data/df_2017_5.csv
+++ b/data/hdf_data/df_2017_5.csv
--- a/data/hdf_data/df_2017_7.csv
+++ b/data/hdf_data/df_2017_7.csv
--- a/data/toydata1.csv
+++ b/data/toydata1.csv
--- a/data/toydata2.csv
+++ b/data/toydata2.csv
--- a/exercises/1_visualize_text.ipynb
+++ b/exercises/1_visualize_text.ipynb
@@ -0,0 +1,373 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exploring Text Datasets\n",
    "\n",
    "Large text datasets are often difficult to grasp, because it is hard to see the big picture when reading many individual texts. In this notebook, we create an interactive visualization of paragraphs from Wikipedia articles to show that exploring text datasets this way is much more fun and gives us a better overview of the data.\n",
    "The original data is from [here](https://rajpurkar.github.io/SQuAD-explorer/) and was modified for our purposes.\n",
    "\n",
    "Have a look at the file `articles_short.json` in the data folder. The file extension `.json` stands for _JavaScript Object Notation_ and this is a common format for exchanging data online, e.g., when using third-party API services. Conveniently, this data format can be mapped 1:1 to Python data structures (i.e., nested lists and dictionaries). In our case, the file `articles_short.json` contains the texts of 100 Wikipedia articles, which are organized in a dictionary, where the key of the dict is the title of an article and the corresponding value is a list with the individual paragraphs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import some libraries that you'll need later\n",
    "import json\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import LabelEncoder\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.decomposition import KernelPCA\n",
    "from sklearn.manifold import TSNE\n",
    "import plotly.express as px\n",
    "# suppress unnecessary warnings\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load data -> json is parsed as Python data structure\n",
    "with open(\"../data/articles_short.json\") as f:\n",
    "    articles = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the first element in the dictionary\n",
    "articles[list(articles.keys())[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract all paragraphs with a list comprehension (have a look at the Python tutorial if this is new to you)\n",
    "paragraphs_corpus = [p for a in articles for p in articles[a]]\n",
    "print(f\"Our dataset contains {len(paragraphs_corpus)} paragraphs\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# raw data: Wikipedia article paragraphs, i.e. 1 data point = 1 paragraph\n",
    "paragraphs_corpus[:3]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Transform into Tf-Idf Features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the raw texts into features\n",
    "vectorizer = TfidfVectorizer(strip_accents='unicode')  # strip_accents to use ascii\n",
    "# fit: check vocab of corpus (i.e. dimensionality of bag-of-words vector) & compute IDF weights\n",
    "# transform: compute vector for each document (i.e., count TF and multiply by IDF)\n",
    "X = vectorizer.fit_transform(paragraphs_corpus)\n",
    "print(\"Dimensionality of our data:\", X.shape)  # number of paragraphs x number of words"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# notice how this is not a normal numpy array, but a scipy sparse matrix\n",
    "# this is a more efficient way of storing data that contains a lot of 0 values\n",
    "# by only remembering the indices where the matrix contains non-zero values and what these values are\n",
    "# (since each individual paragraph contains only very few unique words, this makes a lot of sense here)\n",
    "# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g. t-SNE!)\n",
    "X"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize dataset in 2D \n",
    "\n",
    "### ... with Kernel PCA\n",
    "\n",
    "As you've seen above, our feature matrix $X$ contains many more features (i.e., number of unique words in our corpus, 34258) than data points (i.e., paragraphs, 4492). If we were to compute regular PCA, this would mean we need to compute the eigendecomposition of a $34258 \\times 34258$ covariance matrix - you don't want to do this on your laptop! Instead, we can use Kernel PCA, which gives us the same result (if we specify `kernel='linear'`), but computes the eigendecomposition of the similarity matrix, which is only $4492 \\times 4492$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reduce dimensionality with linear kPCA\n",
    "# since tf-idf vectors are l2 normalized, the linear kernel = cosine similaritiy\n",
    "# --> we use 100 components since we feed the reduced data to t-SNE later!\n",
    "kpca = KernelPCA(n_components=100, kernel='linear')\n",
    "X_kpca = kpca.fit_transform(X)\n",
    "print(\"Dimensionality of our data:\", X_kpca.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot 2D PCA visualization\n",
    "# the components are ordered by their eigenvalue (largest first), i.e.,\n",
    "# by taking the first 2 this is the same as if we had compute PCA with n_components=2\n",
    "plt.figure()\n",
    "plt.scatter(X_kpca[:, 0], X_kpca[:, 1], s=2)  # s: size of the dots\n",
    "plt.title(\"PCA embedding of paragraphs\");\n",
    "# each dot is one paragraph, represented in 2D"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate color codes for the plot based on the article titles (-> key in the dict)\n",
    "paragraphs_label = [a for a in articles for p in articles[a]]  # article title for each paragraph\n",
    "print(len(paragraphs_label))  # same as len(paragraphs_corpus)\n",
    "print(paragraphs_label[:3])\n",
    "# map the list of strings to numbers (which we can then use in plt.scatter())\n",
    "p_labels_num = LabelEncoder().fit_transform(paragraphs_label)\n",
    "print(len(p_labels_num))\n",
    "print(p_labels_num[:3])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# same plot as above but with colors\n",
    "plt.figure()\n",
    "plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=p_labels_num, s=2)  # c: list/array of same length as x/y\n",
    "plt.title(\"PCA embedding of paragraphs\");\n",
    "# -> paragraph-dots with the same color belong to the same article"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# interactive plot with plotly (make sure you're not using Internet Explorer but a modern browser!)\n",
    "# generate the tooltip text: split texts into lines\n",
    "hover_texts = [\"<b>\" + paragraphs_label[i] + \"</b><br>\" + \"<br>\".join([\" \".join(p.split()[i:min(i+7, len(p.split()))]) for i in range(0, len(p.split()), 7)]) for i, p in enumerate(paragraphs_corpus)]\n",
    "# create interactive plot and display\n",
    "fig = px.scatter(x=X_kpca[:, 0], y=X_kpca[:, 1], color=p_labels_num, hover_name=hover_texts)\n",
    "fig.update_traces(hovertemplate='%{hovertext}')  # only show our text, no additional info\n",
    "# move your mouse over the dots to see what paragraphs are behind them (first line in bold is the article title)\n",
    "fig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the two dots on the top and bottom right are very different from the rest of the paragraphs, i.e., they could be considered outliers and strongly influence the first two components (these are actually not two individual dots, but many on top of each other as all articles start and end with these lines).\n",
    "\n",
    "This means the first principle component here captured whether the paragraph consisted of only `DOCUMENT` plus one additional word, while the second component captured whether this additional word was `BEGIN` or `END`. (The other dimensions then contain additional variance introduced by the fact that the dataset includes paragraphs about different topics.) \n",
    "\n",
    "This plot therefore tells us that we should probably clean up our dataset a bit by removing these beginning and end phrases before doing any other analysis on this dataset. But before we do that, let's look at the eigenvalue spectrum of the dataset:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# lambdas = eigenvalues\n",
    "print(kpca.lambdas_[:10])\n",
    "# plot eigenvalue spectrum\n",
    "plt.figure()\n",
    "plt.plot(range(1, len(kpca.lambdas_)+1), kpca.lambdas_)\n",
    "plt.xlabel(\"PCs\")\n",
    "plt.ylabel(\"Eigenvalue\");\n",
    "# observe how the first value is extremely large"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 1: remove outliers and compute Kernel PCA again\n",
    "\n",
    "1. Remove the `BEGIN DOCUMENT` and `END DOCUMENT` \"paragraphs\" from the dataset, i.e., the first and last elements of the list of paragraphs for each article \n",
    "2. Transform this new list of paragraphs into TF-IDF vectors again\n",
    "3. Compute KernelPCA like before and plot the scatter plot (with colors) again\n",
    "4. Look at the eigenvalue spectrum again - what do you observe?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove outliers (i.e. first and last \"paragraph\" for each article)\n",
    "paragraphs_corpus = ..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform list of paragraphs into TF-IDF vectors again\n",
    "...\n",
    "print(\"Dimensionality of our data:\", X.shape)  # (4292, 34258) -> compare to the original size of X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# compute kPCA again (with same parameter settings as before)\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# scatter plot without outliers (with color! but remember, \n",
    "# the dimensionality of the color vector needs to match the x/y coordinates)\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check eigenvalue spectrum of kPCA again\n",
    "..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Task 2: Visualize dataset with t-SNE\n",
    "\n",
    "1. After you've computed your new kPCA embedding (without outliers), use the code below to compute a t-SNE embedding\n",
    "2. Then create a regular (matplotlib) and an interactive (plotly) scatter plot of the results again and explore\n",
    "\n",
    "Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g. identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use 100D kPCA embedding, since t-SNE can't handle sparse matrices\n",
    "tsne = TSNE(metric='cosine', verbose=2, random_state=42)\n",
    "X_tsne = tsne.fit_transform(X_kpca)\n",
    "print(\"Dimensionality of our data:\", X_tsne.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot 2D t-SNE visualization with matplotlib (with colors!)\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# new hover_texts since you have less paragraphs\n",
    "hover_texts = [\"<b>\" + paragraphs_label[i] + \"</b><br>\" + \"<br>\".join([\" \".join(p.split()[i:min(i+7, len(p.split()))]) for i in range(0, len(p.split()), 7)]) for i, p in enumerate(paragraphs_corpus)]\n",
    "# create interactive plot and display\n",
    "..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/2_image_quantization.ipynb
+++ b/exercises/2_image_quantization.ipynb
@@ -0,0 +1,206 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Color Quantization using K-Means\n",
    "In this notebook, we want to transform a regular RGB image (where each pixel is represented as a Red-Green-Blue triplet) into a [compressed representation](https://en.wikipedia.org/wiki/Color_quantization), where each pixel is represented as a single number (color index) together with a limited color palette (RGB triplets corresponding to the color indices). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from PIL import Image  # library for loading image files\n",
    "from sklearn.cluster import KMeans\n",
    "from sklearn.utils import shuffle\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the original image -> change the path to an image of your choice\n",
    "img_org = Image.open(\"../data/cat.jpg\")\n",
    "img_org"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the image into a numpy array\n",
    "img_array = np.asarray(img_org)\n",
    "print(img_array.shape)  # height x width x 3 (RGB channels)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reshape image into a matrix with RGB values for each pixel\n",
    "h, w, d = img_array.shape\n",
    "X = ...  # TODO: reshape img_array such that X is a matrix of shape n_pixels x 3 RGB channels\n",
    "print(X.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to speed things up a little, we only take a random subsample of the original pixels\n",
    "X_sample = shuffle(X, random_state=0)[:1000]\n",
    "# initialize k-means and set n_clusters to the number of colors you want in your image (e.g. 10)\n",
    "kmeans = ...\n",
    "# fit the model on the data (i.e. find the cluster indices)\n",
    "kmeans.fit(X_sample)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the cluster centers now contain the RGB triplets for each cluster, i.e., our new color palette\n",
    "kmeans.cluster_centers_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# use the predict function of kmeans to compute the cluster index for each data point (i.e. pixel) in X\n",
    "# (cluster indices together with the color palette would be the compressed representation of the image)\n",
    "cluster_idx = ...\n",
    "print(cluster_idx.shape)  # same first dimension as X"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to visualize what the compressed image looks like, map each pixel to the corresponding new color\n",
    "new_X = kmeans.cluster_centers_[cluster_idx]\n",
    "print(new_X.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# cast as integers to get proper RGB values\n",
    "new_X = np.array(new_X, dtype=np.uint8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# reshape back into image format\n",
    "img_new = new_X.reshape(h, w, d)\n",
    "print(img_new.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform into PIL image (and possibly save)\n",
    "img_new = Image.fromarray(img_new)\n",
    "# img_new.save(\"cat_new.png\")  # -> save & share your image with the other participants\n",
    "img_new"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Heuristic to determine the number of clusters _k_\n",
    "\n",
    "The objective that k-means internally optimizes is the average distance of the samples to their assigned cluster centers, i.e., it tries to find clusters such that all the points in the cluster are very close to the respective cluster center.\n",
    "\n",
    "After fitting k-means, the final value of this objective function can be computed with the `score` function on the dataset (this actually gives you the negative value, since this is more convenient for the some optimization algorithms).\n",
    "\n",
    "We can now simply fit k-means with different settings for _k_ and observe how the value of the score function changes as we increase the number of clusters.\n",
    "\n",
    "#### Questions: \n",
    "* What would happen (i.e. what would the score be) if you set _k_ to a very large value, e.g., the number of data points? \n",
    "* Based on the plot that we compute below, what do you think might be a good value for _k_? (Of course, this will be different for every dataset, i.e., in this example, a different image might need more or less colors to look ok.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# how many clusters (i.e. distinct colors) are needed?\n",
    "scores = []\n",
    "for n in range(1, 16):\n",
    "    # compute the value of the k-means objective function for the current k\n",
    "    kmeans = KMeans(n_clusters=n, random_state=0).fit(X_sample)\n",
    "    scores.append(kmeans.score(X_sample))\n",
    "# check out how much the score improves as we use more clusters\n",
    "plt.figure()\n",
    "plt.plot(range(1, 16), scores)\n",
    "plt.xlabel(\"number of clusters\")\n",
    "plt.ylabel(\"score\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/3_supervised_comparison.ipynb
+++ b/exercises/3_supervised_comparison.ipynb
@@ -0,0 +1,489 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Compare Supervised Learning Models\n",
    "\n",
    "In this notebook we use 6 toy datasets (3 for regression and 3 for classification) to compare the different algorithms and their hyperparameter settings.\n",
    "\n",
    "Execute the following cells until you see the different datasets and then, after each chapter describing a type of model, come back to this notebook to test the respective model on the datasets and experiment with the model's hyperparameter settings."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from matplotlib.colors import ListedColormap\n",
    "from sklearn.datasets import make_moons\n",
    "# don't get unneccessary warnings\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# You do not need to understand what happens in these functions,\n",
    "# just execute the cell so you can use the functions below\n",
    "\n",
    "n_train_reg = 100\n",
    "n_train_clf = 300\n",
    "\n",
    "def plot_regression(X, y, model=None):\n",
    "    # plot a regression dataset (and model predictions)\n",
    "    plt.figure()\n",
    "    plt.scatter(X[:, 0], y, s=10, c='#3090C7', alpha=0.7, label='data samples')\n",
    "    if model is not None:\n",
    "        X_plot = np.linspace(np.min(X), np.max(X), 1000)\n",
    "        plt.plot(X_plot, model.predict(X_plot[:, np.newaxis]), '#15317E', linewidth=1., alpha=0.9, label='prediction')\n",
    "        plt.legend()\n",
    "    plt.xlabel('x (feature)')\n",
    "    plt.ylabel('y (target)')\n",
    "    plt.title('Regression Problem')\n",
    "    \n",
    "def plot_classification(X, Y, model=None):\n",
    "    # plot a classification dataset (and model predictions)\n",
    "    plt.figure()\n",
    "    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5\n",
    "    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5\n",
    "    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 250),\n",
    "                         np.linspace(y_min, y_max, 250))\n",
    "    cm = plt.cm.RdBu\n",
    "    cm_bright = ListedColormap(['#FF0000', '#0000FF'])\n",
    "    if model is not None:\n",
    "        try:\n",
    "            Z = model.decision_function(np.c_[xx.ravel(), yy.ravel()])\n",
    "            alpha = 0.8\n",
    "        except:\n",
    "            # decision tree\n",
    "            Z = model.predict(np.c_[xx.ravel(), yy.ravel()])\n",
    "            alpha = 0.4\n",
    "        # Put the result into a color plot\n",
    "        Z = Z.reshape(xx.shape)\n",
    "        plt.contourf(xx, yy, Z, cmap=cm, alpha=alpha)\n",
    "    # Plot the training points\n",
    "    plt.scatter(X[:, 0], X[:, 1], s=20, c=Y, cmap=cm_bright, label=\"data samples\")\n",
    "    plt.xlim(xx.min(), xx.max())\n",
    "    plt.ylim(yy.min(), yy.max())\n",
    "    plt.xlabel(\"feature 1\")\n",
    "    plt.ylabel(\"feature 2\")\n",
    "    plt.title(\"Classification Problem\")\n",
    "    plt.colorbar()\n",
    "\n",
    "def get_linear_regression():\n",
    "    # generate noisy linear regression dataset\n",
    "    np.random.seed(15)\n",
    "    X = np.random.rand(n_train_reg, 1)\n",
    "    y = -2.5 + 5*X\n",
    "    y += np.random.randn(n_train_reg, 1) * 0.4\n",
    "    return X, y.flatten()\n",
    "\n",
    "def get_linear_outlier():\n",
    "    # generate linear regression dataset with outliers\n",
    "    np.random.seed(15)\n",
    "    X = np.random.rand(n_train_reg, 1)\n",
    "    y = -2.5 + 5*X\n",
    "    y += np.random.randn(n_train_reg, 1) * 0.05\n",
    "    y[(X>0.7) & (X<0.73)] = 10\n",
    "    return X, y.flatten()\n",
    "\n",
    "def get_nonlinear_regression():\n",
    "    # generate noisy non-linear regression dataset\n",
    "    np.random.seed(15)\n",
    "    X = np.random.rand(n_train_reg, 1) * np.pi * 2.\n",
    "    y = np.sin(X)\n",
    "    y += np.random.randn(n_train_reg, 1) * 0.2\n",
    "    return X, y.flatten()\n",
    "\n",
    "def get_linear_classification_1f():\n",
    "    # generate classification dataset with 1 informative feature\n",
    "    np.random.seed(15)\n",
    "    mean = [0, 0]\n",
    "    cov = [[1, 0], [0, 10]]\n",
    "    X = np.zeros((n_train_clf, 2))\n",
    "    X[:n_train_clf//2] = np.random.multivariate_normal(mean, cov, n_train_clf//2)\n",
    "    mean = [5, 0]\n",
    "    X[n_train_clf//2:] = np.random.multivariate_normal(mean, cov, n_train_clf//2)\n",
    "    y = np.zeros(n_train_clf, dtype=int)\n",
    "    y[n_train_clf//2:] = 1\n",
    "    rndidx = np.random.permutation(len(y))\n",
    "    return X[rndidx], y[rndidx]\n",
    "\n",
    "def get_linear_classification_2f():\n",
    "    # generate classification dataset with 2 informative features\n",
    "    np.random.seed(15)\n",
    "    mean = [0, 4]\n",
    "    cov = np.array([[1, 8], [8, 10]])\n",
    "    cov = np.dot(cov, cov.T)/10\n",
    "    X = np.zeros((n_train_clf, 2))\n",
    "    X[:n_train_clf//2] = np.random.multivariate_normal(mean, cov, n_train_clf//2)\n",
    "    mean = [4, 0]\n",
    "    X[n_train_clf//2:] = np.random.multivariate_normal(mean, cov, n_train_clf//2)\n",
    "    y = np.zeros(n_train_clf, dtype=int)\n",
    "    y[n_train_clf//2:] = 1\n",
    "    rndidx = np.random.permutation(len(y))\n",
    "    return X[rndidx], y[rndidx]\n",
    "\n",
    "def get_nonlinear_classification():\n",
    "    # generate non-linear classification dataset\n",
    "    return make_moons(n_samples=n_train_clf, noise=0.3, random_state=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Datasets\n",
    "\n",
    "Here you can have a look at the 3 regression and 3 classification datasets on which we'll compare the different models. The regression dataset only has one input feature, while the classification dataset has two and the target (i.e. class label) is indicated by the color of the dots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate & plot regression datasets\n",
    "X_reg_1, y_reg_1 = get_linear_regression()\n",
    "X_reg_2, y_reg_2 = get_linear_outlier()\n",
    "X_reg_3, y_reg_3 = get_nonlinear_regression()\n",
    "plot_regression(X_reg_1, y_reg_1)\n",
    "plot_regression(X_reg_2, y_reg_2)\n",
    "plot_regression(X_reg_3, y_reg_3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate & plot classification datasets\n",
    "X_clf_1, y_clf_1 = get_linear_classification_1f()\n",
    "X_clf_2, y_clf_2 = get_linear_classification_2f()\n",
    "X_clf_3, y_clf_3 = get_nonlinear_classification()\n",
    "plot_classification(X_clf_1, y_clf_1)\n",
    "plot_classification(X_clf_2, y_clf_2)\n",
    "plot_classification(X_clf_3, y_clf_3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Models\n",
    "\n",
    "After reading the chapter on linear models, test them here on different datasets (by changing the number at the end of the dataset variable, e.g., `X_reg_2` -> `X_reg_3`) and experiment with their hyperparameter settings (in the comments you'll find a description of the different hyperparameters and which values you can test for them).\n",
    "\n",
    "**Questions:**\n",
    "- Compare the linear regression and ridge regression models on the regression dataset with outliers: what do you observe?\n",
    "- What happens when you increase the value for `alpha` for the ridge regression model? (first think about it, then confirm your guess by actually changing the parameter)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression, Ridge, LogisticRegression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Linear Regression\n",
    "X, y = X_reg_2, y_reg_2  # change the numbers here to test the model on a different dataset\n",
    "model = LinearRegression()\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)\n",
    "print(f\"f(x) = {model.intercept_:.3f} + {model.coef_[0]:.3f} * x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Ridge Regression:\n",
    "# alpha (> 0): regularization (higher values = more regularization)\n",
    "X, y = X_reg_2, y_reg_2\n",
    "model = Ridge(alpha=1.)\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)\n",
    "print(f\"f(x) = {model.intercept_:.3f} + {model.coef_[0]:.3f} * x\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Logistic Regression (for classification problems!):\n",
    "# C (> 0): regularization (smaller values = more regularization)\n",
    "# penalty: change to \"l1\" to get sparse weights (only if you have many features)\n",
    "X, y = X_clf_2, y_clf_2\n",
    "model = LogisticRegression(penalty=\"l2\", C=100.)\n",
    "model.fit(X, y)\n",
    "plot_classification(X, y, model)  # the shaded area indicates the predicted probability for each class\n",
    "print(f\"f(x) = sigmoid({model.intercept_[0]:.3f} + {model.coef_[0, 0]:.3f} * x_1 + {model.coef_[0, 1]:.3f} * x_2)\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Decision Trees\n",
    "\n",
    "After reading the chapter on decision trees, test them here on different datasets and experiment with their hyperparameter settings.\n",
    "\n",
    "**Questions:**\n",
    "- On the 3rd regression dataset with `max_depth=2`, why do you get exactly 4 plateaus in the prediction?\n",
    "- On the 3rd regression dataset, what happens if you leave `min_samples_leaf` at 10 and then increase `max_depth` step by step from 2 to 10 or even higher values? How do you explain this behavior and what would you need to do to get a tree that fits the data in a more fine granular way?\n",
    "- Compare the prediction of the decision tree classifier on the 2nd dataset (which is basically a rotation of the 1st dataset, i.e., still a simple linear classification problem!) to the prediction made by the logistic regression model on this dataset: What do you observe and why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decision Tree for regression:\n",
    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
    "X, y = X_reg_3, y_reg_3\n",
    "model = DecisionTreeRegressor(max_depth=2, min_samples_leaf=10)\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Decision Tree for classification:\n",
    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
    "X, y = X_clf_1, y_clf_1\n",
    "model = DecisionTreeClassifier(max_depth=2, min_samples_leaf=10)\n",
    "model.fit(X, y)\n",
    "plot_classification(X, y, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Ensemble Methods (Random Forest)\n",
    "\n",
    "After reading the chapter on ensemble methods, test the random forest here on different datasets and experiment with the hyperparameter settings (same hyperparameters as the decision tree and the additional parameter `n_estimators` for the number of trees in the forest).\n",
    "\n",
    "**Questions:**\n",
    "- What do you observe when you compare a random forest with multiple estimators to a single decision tree with the same hyperparameter settings (especially for more specific trees, i.e., large `max_depth` and small `min_samples_leaf`)?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random Forest for regression:\n",
    "# n_estimators (> 1): how many decision trees to train (don't set this too high, gets computationally expensive)\n",
    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
    "X, y = X_reg_3, y_reg_3\n",
    "model = RandomForestRegressor(n_estimators=100, max_depth=2, min_samples_leaf=10)\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Random Forest for classification:\n",
    "# n_estimators (> 1): how many decision trees to train\n",
    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
    "X, y = X_clf_2, y_clf_2\n",
    "model = RandomForestClassifier(n_estimators=100, max_depth=2, min_samples_leaf=10)\n",
    "model.fit(X, y)\n",
    "plot_classification(X, y, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Similarity-based Models (kNN)\n",
    "\n",
    "After reading the chapter on k-nearest neighbors, test the method here on different datasets and experiment with the hyperparameter settings.\n",
    "\n",
    "**Questions:**\n",
    "- On the 3rd regression dataset for a larger number of nearest neighbors (e.g. 20), what do you observe for the prediction at the edges of the input domain and why?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# k-Nearest Neighbors for regression:\n",
    "# n_neighbors (> 1): how many nearest neighbors are used for the prediction\n",
    "X, y = X_reg_3, y_reg_3\n",
    "model = KNeighborsRegressor(n_neighbors=10)\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# k-Nearest Neighbors for classification:\n",
    "# n_neighbors (> 1): how many nearest neighbors are used for the prediction\n",
    "X, y = X_clf_3, y_clf_3\n",
    "model = KNeighborsClassifier(n_neighbors=12)\n",
    "model.fit(X, y)\n",
    "plot_classification(X, y, model)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel Methods\n",
    "\n",
    "After reading the chapter on kernel methods, test a SVM here on different datasets and experiment with the hyperparameter settings.\n",
    "\n",
    "**Questions:**\n",
    "- How do the values of the hyperparameters `gamma` and `C` interact? \n",
    "- What do you observe when you leave `gamma` at its default value `'scale'`?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.svm import SVR, SVC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Support Vector Regression:\n",
    "# kernel: kernel function to compute similarities (default: \"rbf\")\n",
    "# gamma (> 0): width of rbf kernel (larger values --> more focused on individual points)\n",
    "# C (> 0): regularization (smaller values = more regularization)\n",
    "X, y = X_reg_3, y_reg_3\n",
    "model = SVR(kernel='rbf', gamma=100., C=1.)\n",
    "model.fit(X, y)\n",
    "plot_regression(X, y, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Support Vector Classification:\n",
    "# kernel: kernel function to compute similarities (default: \"rbf\")\n",
    "# gamma (> 0): width of rbf kernel (larger values --> more focused on individual points)\n",
    "# C (> 0): regularization (smaller values = more regularization)\n",
    "X, y = X_clf_3, y_clf_3\n",
    "model = SVC(kernel='rbf', gamma=.005, C=1.)\n",
    "model.fit(X, y)\n",
    "plot_classification(X, y, model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/4_information_retrieval.ipynb
+++ b/exercises/4_information_retrieval.ipynb
@@ -0,0 +1,416 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Information Retrieval / NLP example\n",
    "\n",
    "**Idea:** Respond more quickly to customer service requests by using Natural Language Processing (NLP) and Information Retrieval to automatically suggest one or several FAQ articles or response templates given an incoming customer email to speed up the process of drafting a response.\n",
    "\n",
    "Since personal emails are private, unfortunately there are no public datasets available with customer requests, so we instead test the methodology on a [question answering dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset contains 477 wikipedia articles, each split into paragraphs, with several questions associated with each paragraph (i.e. where the answer to the question can be found in the paragraph).\n",
    "\n",
    "This means your task is, given a question, to identify the correct paragraph that contains the answer to this question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.neighbors import NearestNeighbors\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "\n",
    "The original data is again in a JSON format, just a bit more nested than in the 1st notebook (visualize_text), since here we have a list of questions associated with each paragraph. The outer structure is again a dictionary, where the article titles are the keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data & parse it as a Python data structure\n",
    "with open(\"../data/articles.json\") as f:\n",
    "    articles = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out what an article looks like by looking at the entry associated with the first article title\n",
    "# (don't look at all articles at once, this will probably crash your notebook!)\n",
    "# we again have a list with all paragraphs of the article, only that here each paragraph is also a dict\n",
    "# with entries for the paragraph text and a list of associated questions\n",
    "articles[list(articles.keys())[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the total number of paragraphs\n",
    "n_paragraphs = [len(articles[a]) for a in articles]\n",
    "print(f\"{len(articles)} articles with alltogether {sum(n_paragraphs)} paragraphs; on average {np.mean(n_paragraphs):.1f} paragraphs per article\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# take only a subset of the articles to speed up the computation\n",
    "subset = sorted(articles.keys())\n",
    "np.random.seed(25)\n",
    "subset = np.random.permutation(subset)\n",
    "articles = {a: articles[a] for a in subset[:100]}  # same dict structure as before, just fewer articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assign Questions to Paragraphs\n",
    "\n",
    "We first transform both the paragraphs as well as all questions associated with the paragraphs into TF-IDF features and then identify the most similar paragraph for a given question by computing the cosine similarity of the TF-IDF vector for the question to the TF-IDF vectors of all paragraphs to identify the most similar paragraphs, which we then return as the search results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a list of texts and \"labels\" for the paragraphs\n",
    "paragraphs_corpus = [p[\"paragraph\"] for a in articles for p in articles[a]]\n",
    "# for each paragraph and question we note the title of the corresponding article and the number of the paragraph,\n",
    "# which will later in the evaluation help us to see if the returned paragraph is correct for the given question\n",
    "paragraphs_label = [f\"{a} {i}\" for a in articles for i, p in enumerate(articles[a])]\n",
    "# list of questions - note the additional level in the list comprehension to go through all questions of a paragraph\n",
    "questions_corpus = [q for a in articles for p in articles[a] for q in p[\"questions\"]]\n",
    "questions_label = [f\"{a} {i}\" for a in articles for i, p in enumerate(articles[a]) for q in p[\"questions\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform both paragraphs and questions into tf-idf features\n",
    "vectorizer = TfidfVectorizer(strip_accents='unicode')\n",
    "# learn internal parameters of vectorizer (vocabulary, idf weights) from known data\n",
    "vectorizer.fit(paragraphs_corpus)\n",
    "# transform both datasets with the same vectorizer so they have the same feature dimensions\n",
    "X_pars = vectorizer.transform(paragraphs_corpus)\n",
    "# important to only call transform here, so you get the same vector space for both paragraphs and questions\n",
    "X_ques = vectorizer.transform(questions_corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question:** What would the TF-IDF vector look like, if a question contained only words that did not occur in any of the paragraphs?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the dimensions of the feature matrices\n",
    "print(X_pars.shape)  # number of paragraphs x bag-of-words vocabulary\n",
    "print(X_ques.shape)  # number of questions x bag-of-words vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize the nearest neighbors search tree:\n",
    "# return 10 search results for every query based on the cosine similarity\n",
    "nn = NearestNeighbors(n_neighbors=10, metric='cosine')\n",
    "# when calling fit, it builds the search tree to efficiently find the closest paragraphs\n",
    "nn.fit(X_pars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pick a question\n",
    "questions_corpus[888]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# query the nearest neighbors search with the corresponding vector to get the indices of the answer\n",
    "nn.kneighbors(X_ques[888])\n",
    "# the result is a tuple with distances and indices (since the nearest neighbors search internally\n",
    "# works with distances instead of similarities, the first result has the smallest distance, which\n",
    "# is 1-cosine similarity, i.e., it has the highest similarity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check what the answer text is for the first result\n",
    "paragraphs_corpus[65]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# what was really the correct paragraph?\n",
    "questions_label[888]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# corresponding paragraph of the answer\n",
    "paragraphs_label[65]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# but the 3rd result would have been correct\n",
    "paragraphs_label[80]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ask your own question!\n",
    "q = \"Your question here\"\n",
    "# transform the text into a vector (need to pass it as a list to the vectorizer)\n",
    "x = vectorizer.transform([q])\n",
    "# query the nearest neighbors search with this new vector\n",
    "nn.kneighbors(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check what paragraph is behind the index for the best result\n",
    "paragraphs_corpus[...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Systematic performance analysis with the hits@k metric\n",
    "\n",
    "While above we just collected some anecdotal evidence for the performance of our method, of course before deploying this into production we should conduct a more systematic evaluation. For this, we use the _hits@k_ metric, which, for different _k_, checks whether the correct answer was within the first _k_ search results. E.g. in a Google search, is the website you're looking for the 1st result, then this would be a hit@1, or is it on the first page, then it would still count towards the hits@10.\n",
    "\n",
    "In our example, we check both the hits@k for the paragraphs as well as the articles, i.e., we check for every paragraph that was returned as a search result, whether that was actually the correct paragraph, or whether it at least came from the right article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you do not need to understand in detail what happens here,\n",
    "# just execute and look at the final results\n",
    "print(\"computing nearest neighbors...\", end='\\r')\n",
    "nn = NearestNeighbors(n_neighbors=10, metric='cosine').fit(X_pars)\n",
    "nn_results = nn.kneighbors(X_ques, return_distance=False)\n",
    "print(\"computing nearest neighbors... done!\")\n",
    "# score as: right article in top 10; right paragraph in top 10\n",
    "article_hits = [[] for i in range(10)]\n",
    "paragraph_hits = [[] for i in range(10)]\n",
    "for i in range(X_ques.shape[0]):\n",
    "    t_label = questions_label[i]\n",
    "    labels = [paragraphs_label[j] for j in nn_results[i]]\n",
    "    for k in range(10):\n",
    "        if t_label in labels[:k+1]:\n",
    "            paragraph_hits[k].append(1)\n",
    "        else:\n",
    "            paragraph_hits[k].append(0)\n",
    "    labels = [l.split()[0] for l in labels]\n",
    "    t_label = t_label.split()[0]\n",
    "    for k in range(10):\n",
    "        if t_label in labels[:k+1]:\n",
    "            article_hits[k].append(1)\n",
    "        else:\n",
    "            article_hits[k].append(0)\n",
    "for i in [0, 1, 2, 3, 4, 9]:\n",
    "    print(f\"Article Hits @ {i+1:2}: {100*np.mean(article_hits[i]):.1f}\")\n",
    "for i in [0, 1, 2, 3, 4, 9]:\n",
    "    print(f\"Paragraph Hits @ {i+1:2}: {100*np.mean(paragraph_hits[i]):.1f}\")\n",
    "# --> if we show 5 paragraphs, in almost 80% of the cases the correct paragraph is among them;\n",
    "#     if we return only 3 results, in over 90% at least the correct article is identified"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "For these exercises, please work on the complete set of articles, not the subset we used for now, i.e., load the data again. To construct a single text from all paragraphs of an article, we join them together with `\"\\n\".join(list_of_paragraphs)`.\n",
    "\n",
    "### 1. Find the article that is most similar to the article about \"Beer\"\n",
    "To do this, compute the cosine similaritiy between the target article with the title \"Beer\" and all other articles and choose the most similar one (not counting the target article itself ;-)). Similar to how we selected a matching paragraph for each question, this can be done with the `NearestNeighbors` class from `sklearn`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data again\n",
    "with open(\"../data/articles.json\") as f:\n",
    "    articles = json.load(f)\n",
    "\n",
    "# get a list of all article titles\n",
    "article_ids = sorted(articles.keys())\n",
    "\n",
    "# get the corresponding texts of these articles by concatenating all the paragraphs of each article.\n",
    "# the texts in this list are in the same order as the article titles in article_ids.\n",
    "article_corpus = [\"\\n\".join([p[\"paragraph\"] for p in articles[a]]) for a in article_ids]\n",
    "print(len(article_corpus))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now the article texts need to be transformed into a tf-idf feature matrix X\n",
    "X = ...\n",
    "\n",
    "# when you're done, check how many articles and feature dimensions were generated\n",
    "print(X.shape)  # should be (477, 80732)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find out the index of our target article \"Beer\" (remember the titles are in article_ids), \n",
    "# so you know which row in the feature matrix X contains the corresponding tf-idf vector\n",
    "# (so that you can use this vector to get the nearest neighbors)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the most similar article with NearestNeighbors\n",
    "\n",
    "# based on the index of the most similar article, get the title of this article\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Exercises\n",
    "\n",
    "### 2. Find the most similar article to \"Beer\" - without using `NearestNeighbors`!\n",
    "Solve the above task without using the `NearestNeighbors` class, i.e., by computing the similarities and selecting the most similar article yourself. Since the TF-IDF feature vectors are by default length-normalized, the cosine similarity between the articles can be computed with a simple dot-product (but beware: X is by default a sparse matrix, i.e., `np.dot()` wont work, but you can also call `.dot()` at an array or sparse matrix directly). The result of the dot-product will still be sparse, but you can call `.toarray()` on a sparse matrix to convert it into a regular dense numpy array with which you can work as usual. Once you have a vector with similarities, the functions `np.argmax` or `np.argsort` might be helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Find the two most similar articles in the set\n",
    "From all articles, which two have the most similar text? What is their cosine similarity score?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/5_mnist_keras.ipynb
+++ b/exercises/5_mnist_keras.ipynb
--- a/exercises/5_mnist_torch.ipynb
+++ b/exercises/5_mnist_torch.ipynb
@@ -0,0 +1,657 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Analyze (F)MNIST with `torch`\n",
    "\n",
    "Careful: do **not** hit 'Kernel' > 'Restart & Run All', since some of the cells below take a long time to execute if you are not running the code on a GPU, so we already executed them for you. Only run the first few cells that are not yet executed.\n",
    "\n",
    "In this notebook we compare different types of neural network architectures on the MNIST and Fashion MNIST datasets, to see how the performance improves when using a more complicated architecture. Additionally, we compare the networks to a simple logistic regression classifier from `sklearn`, which should have approximately the same accuracy as a linear FFNN (= a FFNN with only one layer mapping from the input directly to the output and no hidden layers, i.e., that has the same number of trainable parameters as the logistic regression model)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:06:30.088245Z",
     "start_time": "2020-11-22T19:06:29.139733Z"
    }
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.metrics import accuracy_score\n",
    "# torch neural network stuff\n",
    "import torch\n",
    "import torch.nn as nn\n",
    "import torch.nn.functional as F\n",
    "import torch.optim as optim\n",
    "# torchvision includes the (F)MNIST datasets\n",
    "from torchvision import datasets, transforms\n",
    "# skorch provides a wrapper for torch networks so we can use them like sklearn models\n",
    "from skorch import NeuralNetClassifier\n",
    "from skorch.callbacks import EpochScoring\n",
    "# set random seeds to get (at least more or less) reproducable results\n",
    "np.random.seed(28)\n",
    "torch.manual_seed(28)\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load and look at the data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:06:30.114557Z",
     "start_time": "2020-11-22T19:06:30.090761Z"
    }
   },
   "outputs": [],
   "source": [
    "# you do not need to understand what these functions do in detail\n",
    "\n",
    "def torch_to_X_y(dataset):\n",
    "    # transform input tensor to numpy array\n",
    "    X = dataset.data.numpy()\n",
    "    # reshape (28 x 28) pixel images to vector\n",
    "    X = X.reshape(X.shape[0], -1).astype('float32')\n",
    "    # the ToTensor transform was not applied to the raw data, so we need to scale ourselves\n",
    "    X /= X.max()\n",
    "    # extract numpy array with targets\n",
    "    y = dataset.targets.numpy()\n",
    "    return X, y\n",
    "\n",
    "def load_data(use_fashion=False):\n",
    "    if use_fashion:\n",
    "        data_train = datasets.FashionMNIST(\"../data\", train=True, download=True, transform=transforms.ToTensor())\n",
    "        data_test = datasets.FashionMNIST(\"../data\", train=False, transform=transforms.ToTensor())\n",
    "    else:\n",
    "        data_train = datasets.MNIST(\"../data\", train=True, download=True, transform=transforms.ToTensor())\n",
    "        data_test = datasets.MNIST(\"../data\", train=False, transform=transforms.ToTensor())\n",
    "    # extract (n_samples x n_features) and (n_samples,) X and y numpy arrays from torch dataset\n",
    "    X_train, y_train = torch_to_X_y(data_train)\n",
    "    X_test, y_test = torch_to_X_y(data_train)\n",
    "    return X_train, X_test, y_train, y_test\n",
    "    \n",
    "def plot_images(x):\n",
    "    n = 10\n",
    "    plt.figure(figsize=(20, 4))\n",
    "    for i in range(1, n+1):\n",
    "        # display original\n",
    "        ax = plt.subplot(2, n, i)\n",
    "        plt.imshow(x[i].reshape(28, 28))\n",
    "        plt.gray()\n",
    "        ax.get_xaxis().set_visible(False)\n",
    "        ax.get_yaxis().set_visible(False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:06:31.076712Z",
     "start_time": "2020-11-22T19:06:30.115973Z"
    }
   },
   "outputs": [],
   "source": [
    "# load and display the data -> see how the images have the same format\n",
    "# MNIST\n",
    "X_train, X_test, y_train, y_test = load_data()\n",
    "plot_images(X_train)\n",
    "# Fashion MNIST\n",
    "X_train, X_test, y_train, y_test = load_data(use_fashion=True)\n",
    "plot_images(X_train)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## See how a `torch` network works"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:06:31.158130Z",
     "start_time": "2020-11-22T19:06:31.078469Z"
    }
   },
   "outputs": [],
   "source": [
    "# FFNN with hidden layers (like the one you saw in the book)\n",
    "class MyNeuralNet(nn.Module):\n",
    "    \n",
    "    def __init__(self, n_in=784, n_hl1=512, n_hl2=256, n_out=10, verbose=False):\n",
    "        # input size is 28x28 pixel images flattened into a 784-dimensional vector\n",
    "        # output size is 10 classes\n",
    "        # hidden layer sizes can be set however you want\n",
    "        super(MyNeuralNet, self).__init__()\n",
    "        self.verbose = verbose\n",
    "        # initialize layers\n",
    "        self.l1 = nn.Linear(n_in, n_hl1)\n",
    "        self.l2 = nn.Linear(n_hl1, n_hl2)\n",
    "        self.lout = nn.Linear(n_hl2, n_out)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        # apply layers in correct order\n",
    "        if self.verbose: print(\"[MyNeuralNet]  input:\", x.shape)\n",
    "        h = F.relu(self.l1(x))              # 784 -> 512 [relu]\n",
    "        if self.verbose: print(\"[MyNeuralNet] 1st hl:\", h.shape)\n",
    "        h = F.relu(self.l2(h))              # 512 -> 256 [relu]\n",
    "        if self.verbose: print(\"[MyNeuralNet] 2nd hl:\", h.shape)\n",
    "        y = F.softmax(self.lout(h), dim=1)  # 256 -> 10  [softmax]\n",
    "        if self.verbose: print(\"[MyNeuralNet] output:\", y.shape)\n",
    "        return y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize the network\n",
    "ffnn = MyNeuralNet(verbose=True)\n",
    "# get an input data batch and convert the numpy array \n",
    "# to a torch tensor to use it with the network directly\n",
    "# (skorch later works with the numpy arrays)\n",
    "x = torch.Tensor(X_train[:16])\n",
    "print(x.shape)  # batch size x features"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# apply network to input, i.e., call forward() to generate the prediction\n",
    "y = ffnn(x)\n",
    "print(y.shape)  # batch size x classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look at the network's output for the first data point\n",
    "# -> since the network wasn't trained yet, the predicted probabilities for all 10 classes are ~0.1\n",
    "# (notice the grad parameter, which indicates that the network kept track of the gradients,\n",
    "# which are needed for later tuning the weights during training)\n",
    "y[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# wrap torch NN in skorch Classifier and initialize\n",
    "net = NeuralNetClassifier(\n",
    "    MyNeuralNet,  # usually the class itself, not an instantiated object\n",
    "    batch_size=32,  # how many samples are used in each training iteration\n",
    "    optimizer=torch.optim.Adadelta,  # the optimizer (i.e. \"what type\" of gradient descent)\n",
    "    lr=1.,  # learning rate of the optimizer\n",
    "    device=\"cuda\" if torch.cuda.is_available() else \"cpu\",  # train the network on a GPU if available\n",
    "    max_epochs=1,  # for how many epochs to train the network\n",
    "    callbacks=[  # additional stuff that should happen after each epoch, e.g., learning rate scheduler\n",
    "        ('tr_acc', EpochScoring(  # or in this case print the accuracy after every epoch\n",
    "            'accuracy',\n",
    "            lower_is_better=False,\n",
    "            on_train=True,\n",
    "            name='train_acc',\n",
    "        )),\n",
    "    ],\n",
    ")\n",
    "\n",
    "# use simple sklearn-like interface to train the network (for 1 epoch)\n",
    "net.fit(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate predictions for the same samples as above\n",
    "# -> this gives class labels directly like sklearn\n",
    "y = net.predict(X_train[:16])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check if the prediction (after training) is correct\n",
    "print(\"true class:\", y_train[0])\n",
    "y[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we can also get the original probabilities\n",
    "y = net.predict_proba(X_train[:16])\n",
    "y[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Define NNs for the classification task\n",
    "\n",
    "In the code below we define 3 different neural network architectures: a linear FFNN, a FFNN with multiple hidden layers, and a CNN, which is an architecture particularly well suited for image classification tasks.\n",
    "\n",
    "You will see that the more complex architectures use an additional operation between layers called `Dropout`. This is a regularization technique used for training neural networks, where a certain percentage of the values in the hidden layer representation of a data point are randomly set to zero. You can think of this as the network suffering from a temporary stroke, which forces the neurons learn redundant representations (i.e., such that one neuron can take over for another neuron that was knocked out), which improves generalization."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:06:31.188855Z",
     "start_time": "2020-11-22T19:06:31.159480Z"
    }
   },
   "outputs": [],
   "source": [
    "# linear FFNN (--> same number of parameters as LogReg model)\n",
    "class LinNN(nn.Module):\n",
    "    \n",
    "    def __init__(self, n_in=784, n_out=10):\n",
    "        super(LinNN, self).__init__()\n",
    "        self.l = nn.Linear(n_in, n_out)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        y = F.softmax(self.l(x), dim=1)  # 784 -> 10 [softmax]\n",
    "        return y\n",
    "    \n",
    "# FFNN with hidden layers  \n",
    "class FFNN(nn.Module):\n",
    "    \n",
    "    def __init__(self, n_in=784, n_hl1=512, n_hl2=256, n_out=10, dropout=0.2):\n",
    "        super(FFNN, self).__init__()\n",
    "        # initialize layers\n",
    "        self.dropout = nn.Dropout(dropout)\n",
    "        self.l1 = nn.Linear(n_in, n_hl1)\n",
    "        self.l2 = nn.Linear(n_hl1, n_hl2)\n",
    "        self.lout = nn.Linear(n_hl2, n_out)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        # apply layers in correct order\n",
    "        h = F.relu(self.l1(x))              # 784 -> 512 [relu]\n",
    "        h = self.dropout(h)\n",
    "        h = F.relu(self.l2(h))              # 512 -> 256 [relu]\n",
    "        h = self.dropout(h)\n",
    "        y = F.softmax(self.lout(h), dim=1)  # 256 -> 10  [softmax]\n",
    "        return y\n",
    "    \n",
    "# Convolutional Neural Net    \n",
    "# based on https://github.com/pytorch/examples/blob/master/mnist/main.py\n",
    "class CNN(nn.Module):\n",
    "    \n",
    "    def __init__(self):\n",
    "        super(CNN, self).__init__()\n",
    "        self.conv1 = nn.Conv2d(1, 32, 3, 1)\n",
    "        self.conv2 = nn.Conv2d(32, 64, 3, 1)\n",
    "        self.dropout1 = nn.Dropout(0.25)\n",
    "        self.dropout2 = nn.Dropout(0.5)\n",
    "        self.fc1 = nn.Linear(9216, 128)\n",
    "        self.fc2 = nn.Linear(128, 10)\n",
    "        \n",
    "    def forward(self, x):\n",
    "        # convolutional and pooling layers\n",
    "        h = self.conv1(x)\n",
    "        h = F.relu(h)\n",
    "        h = self.conv2(h)\n",
    "        h = F.relu(h)\n",
    "        h = F.max_pool2d(h, 2)\n",
    "        h = self.dropout1(h)\n",
    "        # flatten the representation and apply FFNN part for the classification\n",
    "        h = torch.flatten(h, 1)\n",
    "        h = self.fc1(h)\n",
    "        h = F.relu(h)\n",
    "        h = self.dropout2(h)\n",
    "        h = self.fc2(h)\n",
    "        y = F.softmax(h, dim=1)\n",
    "        return y\n",
    "\n",
    "# skorch wrapper with fit/predict methods\n",
    "def eval_net(net_module, X_train, y_train, X_test, y_test, max_epochs=1):\n",
    "    print(\"###\", net_module.__name__)\n",
    "    net = NeuralNetClassifier(\n",
    "        net_module,\n",
    "        batch_size=32,\n",
    "        optimizer=torch.optim.Adadelta,\n",
    "        lr=1.,\n",
    "        device=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
    "        max_epochs=max_epochs,\n",
    "        callbacks=[\n",
    "            ('tr_acc', EpochScoring(\n",
    "                'accuracy',\n",
    "                lower_is_better=False,\n",
    "                on_train=True,\n",
    "                name='train_acc',\n",
    "            )),\n",
    "        ],\n",
    "    )\n",
    "    net.fit(X_train, y_train)\n",
    "    # evaluate on test set\n",
    "    y_pred = net.predict(X_test)\n",
    "    print('Test accuracy:', accuracy_score(y_test, y_pred))\n",
    "    return net"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test on MNIST dataset\n",
    "\n",
    "As you see below, the simple logistic regression classifier is already very good on this easy task, with a test accuracy of over 93.5%.\n",
    "\n",
    "The linear FFNN has almost the same accuracy (90.5%) as the LogReg model (please note the NNs were only trained for a single epoch!) and the multi-layer FFNN is already better than the LogReg model (96.4%), while the CNN beats them all (98.2%), which is expected since this architecture is designed for the image classification task."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:07:16.662634Z",
     "start_time": "2020-11-22T19:06:31.190048Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### LogReg\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/franzi/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 0.93535\n",
      "### LinNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.8914\u001b[0m        \u001b[32m0.4044\u001b[0m       \u001b[35m0.9052\u001b[0m        \u001b[31m0.3264\u001b[0m  2.9920\n",
      "Test accuracy: 0.9051833333333333\n",
      "### FFNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.9205\u001b[0m        \u001b[32m0.2589\u001b[0m       \u001b[35m0.9600\u001b[0m        \u001b[31m0.1438\u001b[0m  5.8516\n",
      "Test accuracy: 0.9642166666666667\n",
      "### CNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.9311\u001b[0m        \u001b[32m0.2313\u001b[0m       \u001b[35m0.9797\u001b[0m        \u001b[31m0.0744\u001b[0m  8.4549\n",
      "Test accuracy: 0.9821833333333333\n"
     ]
    }
   ],
   "source": [
    "# get regular MNIST dataset\n",
    "X_train, X_test, y_train, y_test = load_data()\n",
    "# compare sklearn LogReg classifier\n",
    "print(\"### LogReg\")\n",
    "clf = LogisticRegression(class_weight='balanced', random_state=1, fit_intercept=True)\n",
    "clf.fit(X_train, y_train)\n",
    "print('Test accuracy:', clf.score(X_test, y_test))\n",
    "# and our different NN architectures\n",
    "for net_module in [LinNN, FFNN, CNN]:\n",
    "    if net_module == CNN:\n",
    "        # the CNN operates on the 28x28 pixel images directly\n",
    "        net = eval_net(net_module, X_train.reshape(-1, 1, 28, 28), y_train, X_test.reshape(-1, 1, 28, 28), y_test)\n",
    "    else:\n",
    "        # the FFNNs get the flattened vectors\n",
    "        net = eval_net(net_module, X_train, y_train, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Test on FashionMNIST\n",
    "\n",
    "On the more difficult FMNIST task, the LogReg model has a much lower accuracy of 86.6%. When trained for only a single epoch, both the linear and multi-layer FFNNs have a lower accuracy (82.7 and 83.7% respectively) and only the CNN does a bit better (88.6%). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:07:58.876986Z",
     "start_time": "2020-11-22T19:07:16.665145Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### LogReg\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/franzi/anaconda3/lib/python3.7/site-packages/sklearn/linear_model/_logistic.py:940: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
      "STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
      "\n",
      "Increase the number of iterations (max_iter) or scale the data as shown in:\n",
      "    https://scikit-learn.org/stable/modules/preprocessing.html\n",
      "Please also refer to the documentation for alternative solver options:\n",
      "    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
      "  extra_warning_msg=_LOGISTIC_SOLVER_CONVERGENCE_MSG)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test accuracy: 0.8659833333333333\n",
      "### LinNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.8022\u001b[0m        \u001b[32m0.5794\u001b[0m       \u001b[35m0.8257\u001b[0m        \u001b[31m0.5023\u001b[0m  2.7844\n",
      "Test accuracy: 0.8270166666666666\n",
      "### FFNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.7816\u001b[0m        \u001b[32m0.5942\u001b[0m       \u001b[35m0.8366\u001b[0m        \u001b[31m0.4465\u001b[0m  5.6541\n",
      "Test accuracy: 0.8375666666666667\n",
      "### CNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.8063\u001b[0m        \u001b[32m0.5415\u001b[0m       \u001b[35m0.8842\u001b[0m        \u001b[31m0.3228\u001b[0m  8.7526\n",
      "Test accuracy: 0.8861166666666667\n"
     ]
    }
   ],
   "source": [
    "X_train, X_test, y_train, y_test = load_data(True)\n",
    "# regular sklearn LogReg classifier\n",
    "print(\"### LogReg\")\n",
    "clf = LogisticRegression(class_weight='balanced', random_state=1, fit_intercept=True)\n",
    "clf.fit(X_train, y_train)\n",
    "print('Test accuracy:', clf.score(X_test, y_test))\n",
    "# our different NN\n",
    "for net_module in [LinNN, FFNN, CNN]:\n",
    "    if net_module == CNN:\n",
    "        net = eval_net(net_module, X_train.reshape(-1, 1, 28, 28), y_train, X_test.reshape(-1, 1, 28, 28), y_test)\n",
    "    else:\n",
    "        net = eval_net(net_module, X_train, y_train, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "However, when trained for more epochs, the performance of all models improves, with the accuracy of the linear FFNN now being very close to that of the LogReg model (85.8%), while the multi-layer FFNN is better (89.3%) and the CNN can now solve the task quite well with an accuracy of 94.6%.\n",
    "\n",
    "(See how the training and validation loss decrease over time - observing how these metrics develop can help you judge whether you've set your learning rate correctly.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-11-22T19:12:29.545463Z",
     "start_time": "2020-11-22T19:07:58.880264Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "### LinNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.8015\u001b[0m        \u001b[32m0.5794\u001b[0m       \u001b[35m0.8258\u001b[0m        \u001b[31m0.5047\u001b[0m  2.9007\n",
      "      2       \u001b[36m0.8383\u001b[0m        \u001b[32m0.4715\u001b[0m       \u001b[35m0.8363\u001b[0m        \u001b[31m0.4753\u001b[0m  3.1765\n",
      "      3       \u001b[36m0.8461\u001b[0m        \u001b[32m0.4520\u001b[0m       \u001b[35m0.8401\u001b[0m        \u001b[31m0.4635\u001b[0m  3.3053\n",
      "      4       \u001b[36m0.8497\u001b[0m        \u001b[32m0.4415\u001b[0m       \u001b[35m0.8426\u001b[0m        \u001b[31m0.4576\u001b[0m  2.8845\n",
      "      5       \u001b[36m0.8523\u001b[0m        \u001b[32m0.4347\u001b[0m       \u001b[35m0.8449\u001b[0m        \u001b[31m0.4544\u001b[0m  3.1842\n",
      "      6       \u001b[36m0.8542\u001b[0m        \u001b[32m0.4297\u001b[0m       \u001b[35m0.8460\u001b[0m        \u001b[31m0.4527\u001b[0m  2.9838\n",
      "      7       \u001b[36m0.8555\u001b[0m        \u001b[32m0.4258\u001b[0m       0.8456        \u001b[31m0.4518\u001b[0m  3.0838\n",
      "      8       \u001b[36m0.8569\u001b[0m        \u001b[32m0.4227\u001b[0m       0.8456        \u001b[31m0.4515\u001b[0m  3.0435\n",
      "      9       \u001b[36m0.8580\u001b[0m        \u001b[32m0.4201\u001b[0m       0.8459        0.4515  3.2644\n",
      "     10       \u001b[36m0.8592\u001b[0m        \u001b[32m0.4179\u001b[0m       \u001b[35m0.8462\u001b[0m        0.4517  3.0675\n",
      "     11       \u001b[36m0.8602\u001b[0m        \u001b[32m0.4159\u001b[0m       \u001b[35m0.8464\u001b[0m        0.4521  3.1939\n",
      "     12       \u001b[36m0.8610\u001b[0m        \u001b[32m0.4143\u001b[0m       \u001b[35m0.8468\u001b[0m        0.4526  3.0209\n",
      "     13       \u001b[36m0.8619\u001b[0m        \u001b[32m0.4128\u001b[0m       0.8462        0.4532  3.0769\n",
      "     14       \u001b[36m0.8627\u001b[0m        \u001b[32m0.4115\u001b[0m       0.8462        0.4538  3.1085\n",
      "     15       \u001b[36m0.8635\u001b[0m        \u001b[32m0.4103\u001b[0m       0.8465        0.4544  2.8896\n",
      "Test accuracy: 0.8585333333333334\n",
      "### FFNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.7825\u001b[0m        \u001b[32m0.5919\u001b[0m       \u001b[35m0.8488\u001b[0m        \u001b[31m0.4411\u001b[0m  5.6527\n",
      "      2       \u001b[36m0.8400\u001b[0m        \u001b[32m0.4556\u001b[0m       \u001b[35m0.8496\u001b[0m        \u001b[31m0.4000\u001b[0m  6.2455\n",
      "      3       \u001b[36m0.8499\u001b[0m        \u001b[32m0.4279\u001b[0m       \u001b[35m0.8550\u001b[0m        0.4068  5.3284\n",
      "      4       \u001b[36m0.8554\u001b[0m        \u001b[32m0.4144\u001b[0m       \u001b[35m0.8609\u001b[0m        0.4078  5.9324\n",
      "      5       \u001b[36m0.8595\u001b[0m        \u001b[32m0.4088\u001b[0m       \u001b[35m0.8679\u001b[0m        \u001b[31m0.3969\u001b[0m  5.5903\n",
      "      6       \u001b[36m0.8605\u001b[0m        \u001b[32m0.4040\u001b[0m       \u001b[35m0.8687\u001b[0m        0.4187  5.7188\n",
      "      7       \u001b[36m0.8645\u001b[0m        \u001b[32m0.4004\u001b[0m       0.8608        0.4450  5.9384\n",
      "      8       \u001b[36m0.8657\u001b[0m        \u001b[32m0.3949\u001b[0m       0.8673        \u001b[31m0.3921\u001b[0m  5.8410\n",
      "      9       0.8649        \u001b[32m0.3934\u001b[0m       \u001b[35m0.8748\u001b[0m        0.3986  6.0358\n",
      "     10       \u001b[36m0.8702\u001b[0m        \u001b[32m0.3902\u001b[0m       0.8698        0.4123  5.6180\n",
      "     11       \u001b[36m0.8709\u001b[0m        \u001b[32m0.3887\u001b[0m       \u001b[35m0.8762\u001b[0m        0.3928  5.8379\n",
      "     12       \u001b[36m0.8721\u001b[0m        \u001b[32m0.3871\u001b[0m       0.8751        0.3933  5.9377\n",
      "     13       \u001b[36m0.8734\u001b[0m        \u001b[32m0.3826\u001b[0m       \u001b[35m0.8778\u001b[0m        0.4058  5.4589\n",
      "     14       0.8734        \u001b[32m0.3775\u001b[0m       \u001b[35m0.8798\u001b[0m        0.3961  5.5648\n",
      "     15       \u001b[36m0.8745\u001b[0m        0.3825       0.8788        0.3984  5.7210\n",
      "Test accuracy: 0.8931333333333333\n",
      "### CNN\n",
      "  epoch    train_acc    train_loss    valid_acc    valid_loss     dur\n",
      "-------  -----------  ------------  -----------  ------------  ------\n",
      "      1       \u001b[36m0.8104\u001b[0m        \u001b[32m0.5317\u001b[0m       \u001b[35m0.8889\u001b[0m        \u001b[31m0.3057\u001b[0m  7.9180\n",
      "      2       \u001b[36m0.8770\u001b[0m        \u001b[32m0.3548\u001b[0m       \u001b[35m0.8979\u001b[0m        \u001b[31m0.2899\u001b[0m  8.4600\n",
      "      3       \u001b[36m0.8930\u001b[0m        \u001b[32m0.3103\u001b[0m       \u001b[35m0.9120\u001b[0m        \u001b[31m0.2476\u001b[0m  8.7328\n",
      "      4       \u001b[36m0.9010\u001b[0m        \u001b[32m0.2884\u001b[0m       \u001b[35m0.9121\u001b[0m        0.2549  8.7084\n",
      "      5       \u001b[36m0.9061\u001b[0m        \u001b[32m0.2731\u001b[0m       \u001b[35m0.9153\u001b[0m        \u001b[31m0.2409\u001b[0m  8.2132\n",
      "      6       \u001b[36m0.9121\u001b[0m        \u001b[32m0.2610\u001b[0m       0.9117        0.2664  8.4597\n",
      "      7       \u001b[36m0.9154\u001b[0m        \u001b[32m0.2504\u001b[0m       \u001b[35m0.9179\u001b[0m        \u001b[31m0.2391\u001b[0m  8.5581\n",
      "      8       \u001b[36m0.9185\u001b[0m        \u001b[32m0.2428\u001b[0m       0.9143        0.2494  8.1528\n",
      "      9       \u001b[36m0.9208\u001b[0m        \u001b[32m0.2348\u001b[0m       \u001b[35m0.9184\u001b[0m        0.2415  8.8257\n",
      "     10       \u001b[36m0.9234\u001b[0m        \u001b[32m0.2317\u001b[0m       0.9182        0.2498  8.1160\n",
      "     11       \u001b[36m0.9243\u001b[0m        \u001b[32m0.2288\u001b[0m       0.9172        0.2472  7.9041\n",
      "     12       0.9232        0.2306       0.9183        0.2617  8.7895\n",
      "     13       \u001b[36m0.9260\u001b[0m        \u001b[32m0.2252\u001b[0m       0.9136        0.2523  8.5135\n",
      "     14       \u001b[36m0.9290\u001b[0m        \u001b[32m0.2194\u001b[0m       0.9146        0.2508  8.8612\n",
      "     15       0.9282        \u001b[32m0.2182\u001b[0m       0.9184        0.2503  7.9617\n",
      "Test accuracy: 0.9464833333333333\n"
     ]
    }
   ],
   "source": [
    "# train with more epochs\n",
    "for net_module in [LinNN, FFNN, CNN]:\n",
    "    if net_module == CNN:\n",
    "        net = eval_net(net_module, X_train.reshape(-1, 1, 28, 28), y_train, X_test.reshape(-1, 1, 28, 28), y_test, max_epochs=15)\n",
    "    else:\n",
    "        net = eval_net(net_module, X_train, y_train, X_test, y_test, max_epochs=15)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/6_analyze_toydata.ipynb
+++ b/exercises/6_analyze_toydata.ipynb
@@ -0,0 +1,877 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Putting it all together: Analyzing a Toy Dataset\n",
    "\n",
    "In this example, we're working with an artificial dataset from a production process, where a small fraction of the produced products are faulty. The task is to predict from the conditions under which a product is to be produced, whether the product will be ok or scrap."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# first load some libraries that are needed later\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from scipy.stats import pearsonr\n",
    "# machine learning stuff\n",
    "from sklearn.metrics import accuracy_score, balanced_accuracy_score\n",
    "from sklearn.preprocessing import OneHotEncoder, StandardScaler\n",
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import GridSearchCV, train_test_split\n",
    "from sklearn import tree\n",
    "# interactive plotting (parallel coordinate plot)\n",
    "import plotly.express as px\n",
    "# suppress unnecessary warnings\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "# these are some 'magic' commands for the notebook to automatically load updated libraries\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading the data\n",
    "\n",
    "The data is available as a `.csv` file, which stands for \"comma-separated values\", which is just a text file with one data point per row. You can export this kind of format from Excel (thereby making the data easier to share) and then read it in with the `pandas` library.\n",
    "\n",
    "The toy dataset consists of production data for 3 different types of products. The variables in the dataset are:\n",
    "- `height`, `width`, `depth`: dimensions of the product\n",
    "- `product`: categorical variable with values `1`, `5`, or `17` depending on the type of product that was produced\n",
    "- `faulty`: binary variable that indicates if the produced product is faulty (`1`) or ok (`0`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we are given the dataset toydata1.csv\n",
    "# load the csv file into a dataframe with pandas\n",
    "df = pd.read_csv(\"../data/toydata1.csv\")\n",
    "# look at the raw data (first 5 rows)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# more concise overview (e.g. how many values per column, mean of the values in each column, etc)\n",
    "df.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploratory Analysis\n",
    "\n",
    "To get a better feeling for what we're dealing with here, we examine the different variables in more detail.\n",
    "\n",
    "- Do we have an equal amount of samples for each of the three product types or is one of the subgroups underrepresented?\n",
    "- In what ranges are the features and are there differences amongst the three products?\n",
    "- Are there correlations between the variables?\n",
    "- Can we already identify some variables that tell us that a product is faulty?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot histograms for the different variables\n",
    "df.hist(bins=50, layout=(1, 5), figsize=(15, 2));"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "These histograms show the distribution of the values for each variable, i.e., on the x-axis you see the range of values and on the y-axis how many samples have a value in the respective interval.\n",
    "\n",
    "**Take a second to examine these histograms - what do they already tell you?**\n",
    "- Do we have to worry about underrepresented subgroups due to the different product types?\n",
    "- Where might the 3 peaks in the distribution of the depth variable come from?\n",
    "- What do you notice about the height and width variables?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# verify counts for the categorical variable\n",
    "df[\"product\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# see if the variation in the depth variable is related to the different product types\n",
    "plt.figure()\n",
    "colors = [\"r\", \"b\", \"g\"]\n",
    "# plot one histogram per product type using different colors\n",
    "for i, prod in enumerate(sorted(df[\"product\"].unique())):\n",
    "    plt.hist(df[\"depth\"][df[\"product\"] == prod], bins=20, color=colors[i], alpha=0.7, label=f\"product {prod}\")\n",
    "plt.legend()\n",
    "plt.xlabel(\"depth\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look at the correlation matrix to see the correlations between all variables\n",
    "# for more info on what these numbers mean see here: https://en.wikipedia.org/wiki/Correlation_and_dependence\n",
    "corr_mat = df.corr()\n",
    "# uncomment the part below to see the table in color\n",
    "corr_mat #.style.background_gradient(cmap='coolwarm', axis=None).set_precision(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We've already seen that the depth variable and product variable are connected, which explains their high correlation. The height and width variables also show a fairly high correlation of 0.72 and we had already seen that they also have very similar looking histograms, so lets investigate this further."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# examine the correlation between height and width in more detail with a scatter plot\n",
    "plt.figure(figsize=(5.5, 5))\n",
    "plt.scatter(df[\"height\"], df[\"width\"], alpha=0.3)\n",
    "plt.xlabel(\"height\")\n",
    "plt.ylabel(\"width\")\n",
    "plt.title(f\"Correlation: {pearsonr(df['height'], df['width'])[0]:.3f}\");  # just compute the same correlation again"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Questions:**\n",
    "- If all that someone had told you was that two variables have a linear correlation of 0.7, is this the scatter plot that you would have imagined for the two variables? (You might also want to look at the Wikipedia article again for some other example plots)\n",
    "- Why is the correlation coefficient for these two variables so large?\n",
    "- What would you expect the correlation coefficient to be if you only consider the large blob in the middle?\n",
    "\n",
    "In reality, it often happens that two variables seem to be perfectly correlated (i.e., they have a correlation coefficient of (almost) 1), but when you look closer, then this is just due to the fact that, for example, two sensors are off at the same time, but for the part where they're on, they actually aren't giving redundant values. Therefore be careful before throwing away \"rendundant\" variables and always verify the correlation with a scatter plot!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now check if these variables already give a hint on how to identify the faulty products\n",
    "# (they both also had a fairly high negative correlation with the faulty variable)\n",
    "plt.figure()\n",
    "plt.scatter(df[\"height\"], df[\"width\"], c=df[\"faulty\"], alpha=0.3)  # color the points based on the faulty variable\n",
    "plt.xlabel(\"height\")\n",
    "plt.ylabel(\"width\")\n",
    "plt.colorbar()\n",
    "# and check what the correlation coefficient is without the (0, 0) points\n",
    "plt.title(f\"Correlation: {pearsonr(df['height'][df['height'] > 0], df['width'][df['width'] > 0])[0]:.3f}\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Clearly, not all faulty products are equal: some are within the \"regular\" data (i.e., the purple points), while some are outliers at (0, 0). \n",
    "\n",
    "The department that gave us the data tells us that the points where height=width=0 are products where something went wrong during production and the process was aborted. However, instead of marking the respective values as `NaN`, this was recorded by setting some of the variables to \"impossible\" values. Real data is just messy like that."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make an interactive parallel coordinate plot \n",
    "# (make sure you're using a modern browser for this, i.e., not the Internet Explorer!)\n",
    "# (works with pandas as well, but doesn't look that great: pd.plotting.parallel_coordinates(df, \"faulty\"))\n",
    "fig = px.parallel_coordinates(df, color=\"faulty\")\n",
    "fig\n",
    "# each line corresponds to one sample, where the indivdual values for each variable are marked\n",
    "# at the respective axis and then these values are connected by a line\n",
    "# -> you can select parts of the samples by clicking and draging the mouse over one of the axis (when you see a cross)\n",
    "# e.g., try to select only those samples that do not have a height and width of 0\n",
    "# (a click on the selection removes it again, you can also drag the axis to change their order)\n",
    "# do you notice any patterns?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Supervised Learning\n",
    "\n",
    "Now that we've become more familiar with the dataset, it's time to tackle the real task, i.e., to try to predict whether a product will be faulty. This is a classification problem (each product either belongs to the class \"faulty\" or the class \"ok\", there is no in between)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# \"product\" is a categorical variable; for it to be handled correctly,\n",
    "# we have to transform it into a one-hot encoded vector\n",
    "e = OneHotEncoder(sparse=False, categories='auto')\n",
    "ohe = e.fit_transform(df[\"product\"].to_numpy()[:, None])\n",
    "df = df.join(pd.DataFrame(ohe, columns=[f\"product_{i}\" for i in e.categories_[0]], index=df.index))\n",
    "df.head()  # notice the additional columns with zeros and a one"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# from the dataframe we now extract our features ...\n",
    "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\"]\n",
    "X = df[feature_cols].to_numpy()  # convert df into a numpy array\n",
    "# ... and the vector with labels\n",
    "y = df[[\"faulty\"]].to_numpy()\n",
    "# to evaluate our prediction model, we need to split off a test dataset\n",
    "# later we will use the train_test_split function from sklearn to do this, \n",
    "# but this just goes to show that there is no magic behind it\n",
    "np.random.seed(10)\n",
    "idx = np.random.permutation(len(df))  # shuffled range of values from 0 to len(df)\n",
    "train_idx = idx[:2000]  # 2/3 of the samples are in the training set\n",
    "test_idx = idx[2000:]\n",
    "X_train = X[train_idx]  # pick out the rows from X corresponding to these indices\n",
    "X_test = X[test_idx]\n",
    "y_train = y[train_idx]\n",
    "y_test = y[test_idx]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# see how imbalanced the label distribution in the training and test sets is\n",
    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
    "print(f\"Fraction of ok items in test set: {1-np.mean(y_test):.3f}\")\n",
    "# and check the (balanced) accuracy for a stupid baseline model that always predicts zeros\n",
    "# (notice how the value for the accuray is the same as the fraction of ok items above)\n",
    "print(\"----- Stupid baseline (always predict 'ok'): -----\")\n",
    "print(f\"Accuracy on training data: {accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Accuracy on test data: {accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")\n",
    "# since we have a very unbalanced class distribution in this dataset, the balanced accuracy\n",
    "# is the evaluation metric that we actually care about"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# let's try a (shallow) decision tree!\n",
    "clf = tree.DecisionTreeClassifier(max_depth=2, random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "# same evaluation as for the stupid baseline above\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Questions:** \\\n",
    "Have a look at the values for (balanced) accuracy and compare them to the scores obtained with the stupid baseline: Do you think we're on the right track, i.e., does this seem like a useful model?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now plot the tree\n",
    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The decision tree has its root at the top (where you start) and the leaves (i.e., those nodes that don't branch off anymore) at the bottom (where you stop and make the final prediction). Each node in the tree shows in the first line the variable based on which the next split is made incl. the threshold value (except for leaf nodes), then the current Gini impurity (i.e., how homogeneous the labels of all the samples that ended up in this node are; this is what the decision tree internally optimizes, i.e., notice how the value gets smaller on at least one side after a split), then the fraction of samples that ended up in this node, and the distribution of samples into the different classes, as well as the class that would be predicted for a sample at this point.\n",
    "\n",
    "**Questions:** \\\n",
    "Have a look at the tree and the decisions that are made in it: What has the decision tree actually learned, i.e., which samples does it classify as faulty and which as ok? Does this model help us on our quest to identify production conditions that result in faulty products?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# maybe we just need to give the tree the freedom to make more splits? (i.e., increase its depth)\n",
    "clf = tree.DecisionTreeClassifier(max_depth=100, random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Questions:** \\\n",
    "Is this a better model? If anything, is the model over- or underfitting?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# when the tree is too large (or you're using a random forest),\n",
    "# check the feature importances instead of plotting the tree\n",
    "dict(zip(feature_cols, clf.feature_importances_))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# now let's do what we probably should have done in the beginning and \n",
    "# remove the outliers (i.e., keep only samples with a height > 0)\n",
    "df_new = df[df[\"height\"] > 0.]\n",
    "# create a train/test split again, this time using the sklearn function\n",
    "X_train, X_test, y_train, y_test = train_test_split(df_new[feature_cols].to_numpy(), \n",
    "                                                    df_new[[\"faulty\"]].to_numpy(), \n",
    "                                                    test_size=0.33, random_state=15)\n",
    "# see how imbalanced the label distribution in the training and test sets is\n",
    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
    "print(f\"Fraction of ok items in test set: {1-np.mean(y_test):.3f}\")\n",
    "# and what the stupid baselien is now (since we've removed only 'faulty' points, \n",
    "# the class distributions are even more unbalanced and the accuracy even higher)\n",
    "print(\"----- Stupid baseline (always predict 'ok'): -----\")\n",
    "print(f\"Accuracy on training data: {accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Accuracy on test data: {accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# decision tree on data without outliers\n",
    "clf = tree.DecisionTreeClassifier(max_depth=3, random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Questions:** \\\n",
    "What do you think of the model now?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the tree\n",
    "plt.figure(figsize=(15, 10))\n",
    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);\n",
    "# notice how in the leaf nodes where the tree predicts \"faulty\", there are only very few data points"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Garbage in, garbage out...\n",
    "\n",
    "Clearly, we're missing some important information, as we are unable to identify the non-outlier faulty products. I.e., we need more data (not necessarily more samples, but certainly more features).\n",
    "\n",
    "So we go back to the person that gave us the data and ask if they have an idea what else might be causing the products to break and if there are additional sensor measurements available that we could look at. They give us a new dataset `toydata2.csv`, which additionally contains the variable `temp`, which indicates the temperature at which a product was produced."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load this new data\n",
    "df = pd.read_csv(\"../data/toydata2.csv\")\n",
    "df.head()  # same as before just an additional column"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# look at the variables again -> just like depth, temp has 3 peaks in the distribution\n",
    "df.hist(bins=50, layout=(1,6), figsize=(15,2));"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# see if the variation in temp are indeed related to the different products\n",
    "plt.figure()\n",
    "colors = [\"r\", \"b\", \"g\"]\n",
    "for i, prod in enumerate(sorted(df[\"product\"].unique())):\n",
    "    plt.hist(df[\"temp\"][df[\"product\"] == prod], bins=20, color=colors[i], alpha=0.7, label=f\"product {prod}\")\n",
    "plt.legend()\n",
    "plt.xlabel(\"temp\");"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# make another interactive parallel coordinates plot\n",
    "columns = [\"height\", \"width\", \"depth\", \"product\", \"temp\", \"faulty\"]\n",
    "fig = px.parallel_coordinates(df[columns], color=\"temp\")\n",
    "fig"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By clicking and dragging on the different axis, select the data such that you remove the outliers (i.e., keep only samples with height/width > 0) and then select the faulty products (i.e., with faulty = 1).\n",
    "\n",
    "**Questions:** \\\n",
    "Do you notice any patterns? How would you explain to the stakeholders why some of their products are faulty?\n",
    "\n",
    "(In this case, we can derive the relevant insights already from the plot. However, in real problems, the solution is usually not this obvious, so lets try to see how we could also solve this with ML.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the categorical column again - this time using pandas directly\n",
    "df = pd.concat([df, pd.get_dummies(df[\"product\"], prefix=\"product\")], axis=1)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove outliers again\n",
    "df_new = df[df[\"height\"] > 0.]\n",
    "# let's try with temp as an additional feature\n",
    "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\", \"temp\"]\n",
    "X = df_new[feature_cols].to_numpy()\n",
    "y = df_new[[\"faulty\"]].to_numpy()\n",
    "# split into train/test sets again\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)\n",
    "# see how imbalanced the label distribution in the training and test sets is\n",
    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
    "print(f\"Fraction of ok items in test set: {1-np.mean(y_test):.3f}\")\n",
    "# and check the stupid baseline again (this is the same as before since the data contains the same samples)\n",
    "print(\"----- Stupid baseline (always predict 'ok'): -----\")\n",
    "print(f\"Accuracy on training data: {accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Accuracy on test data: {accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, np.zeros_like(y_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, np.zeros_like(y_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train a decision tree again (the parameters here were set as an initial guess\n",
    "# based on our understanding of the problem as well as the decision tree model)\n",
    "clf = tree.DecisionTreeClassifier(max_depth=6, min_samples_leaf=50, class_weight=\"balanced\", random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Questions:** \\\n",
    "What do you think of the model now? If anything, is the model over- or underfitting?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the tree\n",
    "plt.figure(figsize=(20, 15))\n",
    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As you can see, the tree is quite big and therefore also more tedious to interpret. Additionally, we see that many of the splits right before the leaf nodes are made without any change in the predicted class (i.e., all the nodes remain orange). This happens, because the tree itself only cares about the Gini impurity, which indeed still decreases after these splits. However, since this is not helpful for us, lets prune on the tree by cutting off these unnecessary splits, which can be done by setting the parameter `ccp_alpha`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# prune the tree by setting ccp_alpha\n",
    "clf = tree.DecisionTreeClassifier(max_depth=6, min_samples_leaf=50, class_weight=\"balanced\", ccp_alpha=0.01, random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")\n",
    "# plot the graph\n",
    "plt.figure(figsize=(15, 10))\n",
    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Notice how the (balanced) accuracy stayed the same after the pruning.\n",
    "\n",
    "=> Look at this pruned tree and understand which decisions are made (e.g., manually make the same splits on the parallel coordinates plot), i.e., verify that the tree is reaching the same conclusion as we did before."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Hyperparameter Tuning\n",
    "\n",
    "We started out with some initial hyperparameter settings for the decision tree, which already gave us quite good results. However, lets see if we can do even better by systematically testing different hyperparameter combinations, i.e., use a grid search with cross-validation to find an optimal value for `max_depth` and `min_samples_leaf`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# to use a grid search, we first need to instantiate our model (including the settings we know we want to use)\n",
    "clf = tree.DecisionTreeClassifier(class_weight=\"balanced\", ccp_alpha=0.01, random_state=1)\n",
    "# additionally, we need to define the values we want to try for each parameter \n",
    "# (keys in the dict must match the name of the model parameter!)\n",
    "params = {\n",
    "    \"max_depth\": [2, 3, 4, 5, 6, 7, 8],\n",
    "    \"min_samples_leaf\": [1, 5, 10, 25, 50, 75, 100, 125]\n",
    "}\n",
    "# then pass both the model and the parameter values into the grid search\n",
    "# normally, the grid search would use the internal .score() function of the model to select the best parameters,\n",
    "# however, since for a classifier this is the accuracy, we here need to tell the grid search that\n",
    "# it should select the best model based on the balanced accuracy instead\n",
    "gs = GridSearchCV(clf, params, scoring='balanced_accuracy')\n",
    "# the grid search object then can be used like all the other sklearn models\n",
    "gs.fit(X_train, y_train)\n",
    "# after it is done, we can check which were the best parameter values\n",
    "# -> max_depth=5 is what the tree before after pruning had as well\n",
    "# -> min_samples_leaf=1 does not seem like a good choice \n",
    "# (=> always look at the results for all parameter combinations (as we do below), don't just trust the best settings)\n",
    "print(gs.best_params_)\n",
    "# and evalute this best model on test set (the grid search already trained the best model\n",
    "# on the whole dataset for us and we can call .predict() on the grid search object directly)\n",
    "print(f\"Accuracy on training data: {gs.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {gs.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, gs.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, gs.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# overall cross-validation results (lots of stuff...)\n",
    "gs.cv_results_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# we're really just interested in the mean test scores for each parameter combination\n",
    "for i, p in enumerate(gs.cv_results_[\"params\"]):\n",
    "    print(p, gs.cv_results_[\"mean_test_score\"][i])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# plot the results as a heatmap to make it easier to see the performance differences\n",
    "plt.figure()\n",
    "plt.imshow(gs.cv_results_[\"mean_test_score\"].reshape(len(params[\"max_depth\"]), len(params[\"min_samples_leaf\"])))\n",
    "plt.colorbar()\n",
    "plt.xlabel(\"min_samples_leaf\")\n",
    "plt.ylabel(\"max_depth\")\n",
    "plt.xticks(range(len(params[\"min_samples_leaf\"])), params[\"min_samples_leaf\"])\n",
    "plt.yticks(range(len(params[\"max_depth\"])), params[\"max_depth\"])\n",
    "plt.title(\"Grid Search Results: Balanced Accuracy\");"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** This plot helps us do two things:\n",
    "1. Verify that the parameter search was exhaustive, i.e., that we've covered a good range of values for each parameter such that it is unlikely that we've missed the best settings in our search.\n",
    "2. Select the actual parameter values that we want to use for the final model (instead of blindly trusting the values that the grid search had selected for us): notice how with a depth of 5 or greater, all trees with a `min_samples_leaf` setting of 50 or less have the same performance and the grid search simply picked the first model with the best performance. However, as we know a decision tree with a `min_samples_leaf` setting of 1 could in theory memorize individual points, which is not what we want (although this is unlikely with a depth of only 5 and pruning). Therefore, to ensure that we really get robust results, we should instead choose those parameter settings that result in the most regularized model that still produces good results, i.e., in this case a low value for `max_depth` (5) and a high value for `min_samples_leaf` (50).\n",
    "\n",
    "\n",
    "### Using a Logistic Regression Model\n",
    "\n",
    "Now that we've obtained very good results with a decision tree, lets see if we can do equally well on this dataset with a linear model (i.e., a logistic regression model, since we have a classification problem)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try a different classifier: logistic regression\n",
    "y_train, y_test = y_train.flatten(), y_test.flatten()  # otherwise the model will complain about the shapes\n",
    "# first, try the model with the default parameter settings\n",
    "clf = LogisticRegression()\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# unbalanced class distributions => set parameter class_weight!! \n",
    "# (most sklearn classifiers have this parameter - use it!)\n",
    "clf = LogisticRegression(class_weight=\"balanced\", random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The performance is still a lot lower than what we got with a decision tree... Furthermore, you saw in both cases that the model threw a `ConvergenceWarning`. While this isn't too tragic (usually the results are still quite good), in many cases this warning occurs when the data isn't normally distributed (i.e., violates the model's assumptions) and the results often get better when you transform the data accordingly. Therefore, we now use the `StandardScaler` to ensure each feature has a mean of 0 and a standard deviation of 1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# scale the data\n",
    "scaler = StandardScaler()\n",
    "# training data: fit & transform \n",
    "# (fit: compute mean and std of each feature; transform: subtract mean from each feature and divide by std)\n",
    "X_train = scaler.fit_transform(X_train)\n",
    "# test data: only transform, so the data is comparable!\n",
    "X_test = scaler.transform(X_test)\n",
    "# try logreg again -> much better!\n",
    "clf = LogisticRegression(class_weight=\"balanced\", random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try L1 regularization for feature selection\n",
    "# (the parameter C determines the strength of the regularization -> smaller values = more regularization)\n",
    "clf = LogisticRegression(class_weight=\"balanced\", penalty='l1', C=0.1, solver='liblinear', random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the coefficients tell us why an item was classified as faulty:\n",
    "# higher temperatures lead to faulty items, but we have different offsets for the different products, \n",
    "# i.e., product 3 can handle higher temperatures than product 1\n",
    "# -> features with small coefficients can be removed\n",
    "dict(zip(feature_cols, clf.coef_[0]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# do a manual feature selection based on the coefficients of the L1 regularized model\n",
    "feature_cols = [\"product_1\", \"product_17\", \"temp\"]\n",
    "# construct a new feature matrix and create the train/test split with this new matrix again\n",
    "X = df_new[feature_cols].to_numpy()\n",
    "y = df_new[\"faulty\"].to_numpy()\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)\n",
    "# and don't forget to scale the data again!\n",
    "scaler = StandardScaler()\n",
    "X_train = scaler.fit_transform(X_train)\n",
    "X_test = scaler.transform(X_test)\n",
    "# train the model again with most of the default parameter setting\n",
    "clf = LogisticRegression(class_weight=\"balanced\", random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")\n",
    "# the performance gets even a tiny bit better, i.e., sometimes less data can be more,\n",
    "# because additional features can also introduce noise patterns on which a model might overfit"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# by default the logreg model uses L2 regularization with C=1.\n",
    "# since now we've manually selected the features and know that all of these are important for the task\n",
    "# we can set C to a higher value to use less regularization\n",
    "clf = LogisticRegression(class_weight=\"balanced\", penalty='l2', C=1000., random_state=1)\n",
    "clf = clf.fit(X_train, y_train)\n",
    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "While it was a bit more work to set up the logistic regression model appropriately, incl. extra data preprocessing steps, we now even got a balanced accuracy on the test set that is slightly higher than that of the decision tree (0.938 instead of 0.935)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/exercises/7_hard_drive_failures.ipynb
+++ b/exercises/7_hard_drive_failures.ipynb
@@ -0,0 +1,176 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Predicting hard drive failures\n",
    "\n",
    "**Scenario:** In a data center with many hard drives, occasionally, one of these drives will fail. To prevent possible data loss, it's a data scientist's (i.e. your) task to predict as soon as possible in advance when a drive might fail.\n",
    "\n",
    "The original data can be downloaded from [backblaze](https://www.backblaze.com/b2/hard-drive-test-data.html).\n",
    "It was already cleaned and restructured for your convenience (see `data/hdf_data`). This preprocessing process included:\n",
    "\n",
    "- removing NaNs\n",
    "- keeping only data from the most frequent drive model (to avoid artifacts due to differences in SMART recordings)\n",
    "- creating a dataframe where each drive is one data point with the information whether it failed or not (= class label)\n",
    "\n",
    "The original data consisted of daily SMART statistics measurements for all drives at that time installed in the data center (i.e. for each drive until it failed). Your task is to build a binary classification model, which receives the measurements from all drives every day and should predict which of these drives are likely to fail in the next hours or days. To train such a model, you are given a simplified dataset, which includes only a single measurement per drive, either from some random time point during the year if the drive did not fail (class 0), or the SMART statistics on the day the drive failed (csv files ending in `_0`) or from a few days before the drive failed (e.g. `_1` for 1 day before it failed, `_7` for 7 days, etc). This means by using e.g. the data from `df_2016_0.csv` you can build a model that can predict whether a drive will fail today, while a model trained on the data in `df_2016_7.csv` can predict whether a drive will fail one week from now. (Normally, you would make use of the measurements over time and e.g. track maximum values up to now or do some other feature engineering to improve the performance, but for the sake of simplicity we only use these individual snapshots here.) \n",
    "\n",
    "Use the data from 2016 for training the model and tuning hyperparameters and the data from 2017 for the final evaluation to get a realistic performance estimate of how well the model can handle some slight data drifts etc.\n",
    "\n",
    "More about the SMART attributes used as features in this problem can be found on [Wikipedia](https://en.wikipedia.org/wiki/S.M.A.R.T.)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.dummy import DummyClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import balanced_accuracy_score\n",
    "# don't get unneccessary warnings\n",
    "import warnings\n",
    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
    "\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data with the SMART statistics of the drives.\n",
    "# with the data ending in _0, we can learn to predict if a drive has failed or is working properly;\n",
    "# try e.g. df_2016_7.csv to predict failures a week in advance\n",
    "df = pd.read_csv(\"../data/hdf_data/df_2016_0.csv\")\n",
    "# have a look at what we've loaded\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# construct training and test data from this dataframe - use only the smart statistics as features\n",
    "feat_cols = [c for c in df.columns if c.startswith(\"smart\")]\n",
    "X = df[feat_cols].to_numpy()\n",
    "y = df[\"failure\"].to_numpy()\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)\n",
    "# see how imbalanced the label distribution in the training and test sets is\n",
    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
    "print(f\"Fraction of ok items in test set: {1-np.mean(y_test):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eval_clf(clf, X_train, y_train, X_test, y_test):\n",
    "    \"\"\"\n",
    "    Function to evaluate a trained classifier: prints accuracy and balanced accuracy scores.\n",
    "    \n",
    "    Inputs:\n",
    "        - clf: the trained classifier\n",
    "        - X_train, y_train: the training data\n",
    "        - X_test, y_test: the test data\n",
    "    \"\"\"\n",
    "    print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
    "    print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
    "    print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
    "    print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train a dummy model\n",
    "clf = DummyClassifier(strategy=\"most_frequent\")\n",
    "clf = clf.fit(X_train, y_train)\n",
    "# evaluate the model\n",
    "# later, make sure to pass the correct training and test data, e.g., in case you scaled your data etc.\n",
    "eval_clf(clf, X_train, y_train, X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "-------------------------------------------------------------------------------------\n",
    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project.\n",
    "\n",
    "### (Suggested) Steps\n",
    "\n",
    "#### a) Get a better understanding of the problem\n",
    "- Create a t-SNE plot of the data (from the features; color the dots in the scatter plot with the target variable): Do you think a classification model will do well on this data?\n",
    "- Look at the variables in more detail: Are they normally/uniformly distributed?\n",
    "- Try different kinds of models in place of the `DummyClassifier` (e.g. decision tree, linear model, SVM) and play around with the hyperparameters a little bit to get a better feeling for the problem.\n",
    "- Would outlier detection make sense here? Why (not)?\n",
    "\n",
    "#### b) Improve the prediction performance\n",
    "- Try different normalizations of the data (e.g. using the `StandardScaler`): How do the t-SNE plot and performance of the different models change? Why does a decision tree not improve? Can you apply some other transformations to make the features more normally distributed?\n",
    "- Are any variables highly correlated? How does the performance change when you remove some features? Do you have any other feature engineering ideas? Again observe how your previous results change as you modify the input features!\n",
    "- Systematically find optimal hyperparameters for your models using a `GridSearchCV` and decide what you want to use as your final model.\n",
    "\n",
    "#### c) Final evaluation & model interpretation\n",
    "- Try to better understand what your model is doing: Which variables are the most predictive of failures?\n",
    "- Predict failures multiple days in advance by training and evaluating your models on the other csv files from 2016 (e.g. `df_2016_7.csv` for 7 days before the drive fails). How many days in advance is a reliable prediction possible (e.g. plot \"days before failure\" vs \"balanced accuracy\")?\n",
    "- Evaluate your final model (trained on a complete dataframe from 2016) on the respective data from 2017.\n",
    "\n",
    "#### d) Presentation of results\n",
    "Clean up your code & think about which results you want to present + the story they tell:\n",
    "- What is the best model that you found & its performance?\n",
    "- Which preprocessing steps had the most impact on the performance?\n",
    "- What worked and what didn't for the different models?\n",
    "- Which of the SMART statistics indicate that a drive will fail?\n",
    "- How many days in advance can you predict a hard drive failure?\n",
    "- How well does your model perform on the new data from 2017?\n",
    "- What have you learned in this case study? Did any of the results surprise you?\n",
    "-------------------------------------------------------------------------------------"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }
--- a/requirements.txt
+++ b/requirements.txt
@@ -0,0 +1,10 @@
 numpy>=1.18.5
 pandas>=1.2.1
 scipy>=1.4.1
 scikit-learn>=0.24.1
 matplotlib>=3.3.1
 pillow>=8.1.0
 plotly>=4.9.0
 torch>=1.6.0
 torchvision>=0.8.1
 skorch>=0.9.0
--- a/test_installation.ipynb
+++ b/test_installation.ipynb
@@ -0,0 +1,78 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Test if installation was successful"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check versions of the libraries\n",
    "# they should not be too much behind the ones in the comments...\n",
    "import numpy\n",
    "print(\"numpy\", numpy.__version__)        # >= 1.18.5\n",
    "import pandas\n",
    "print(\"pandas\", pandas.__version__)      # >= 1.2.1\n",
    "import scipy\n",
    "print(\"scipy\", scipy.__version__)        # >= 1.4.1\n",
    "import sklearn\n",
    "print(\"sklearn\", sklearn.__version__)    # >= 0.24.1\n",
    "import matplotlib\n",
    "print(\"matplotlib\", matplotlib.__version__)  # >= 3.3.1\n",
    "import PIL\n",
    "print(\"pillow\", PIL.__version__)         # >= 8.1.0\n",
    "import plotly\n",
    "print(\"plotly\", plotly.__version__)      # >= 4.9.0\n",
    "print(\"Congratulations! Your installation of the basic libraries was successful!\")\n",
    "# the following libraries are needed for the neural network example \n",
    "# (if you're working with the recommended pytorch, not keras/tensorflow)\n",
    "# if you have a computer with a (CUDA-enabled Nvidia) GPU, checkout this site:\n",
    "# https://pytorch.org/get-started/locally/\n",
    "# to install the correct version that can utilize the capabilities of your GPU\n",
    "# (if you're working on a normal laptop and you don't know what GPU means,\n",
    "# don't worry about it and just execute `$ pip install torch torchvision skorch`)\n",
    "import torch\n",
    "print(\"torch\", torch.__version__)        # >= 1.6.0\n",
    "import torchvision\n",
    "print(\"torchvision\", torchvision.__version__)  # >= 0.8.1\n",
    "import skorch\n",
    "print(\"skorch\", skorch.__version__)      # >= 0.9.0\n",
    "print(\"Congratulations! Your installation of the neural network libraries was successful!\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
 }