diff --git a/README.md b/README.md
index 462118f..b2dc5f7 100644
--- a/README.md
+++ b/README.md
@@ -16,9 +16,11 @@ Have fun!
 
 ## Course Overview
 
-For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below.
+For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below. Additionally, you should take notes in the worksheet while working through the materials.
 (You can also find the course syllabus on the last page of the [course description](/course_description.pdf), which explicitly lists all the sections of the book for each block.)
 
+**Important:** Please take note of all questions that arise while working on the materials (e.g., both in the worksheet, as well as the different notebooks, you'll be prompted to answer several questions; if the answer to any of them is still unclear after reading the respective sections of the book, please include them in this list). At the beginning of each group session we'll collect all questions that you still have and discuss them.
+
 ---
 
 ### Part 1: Getting started: What is ML?
@@ -79,7 +81,6 @@ For an optimal learning experience, the chapters from the [machine learning book
 - [ ] Read the whole chapter: ["Avoiding Common Pitfalls"](https://franziskahorn.de/mlbook/_avoiding_common_pitfalls.html)
 
 ##### Block 4.2:
-- [ ] Answer [Quiz 5](https://forms.gle/uZGj54YQHKwckmL46)
 - [ ] Work through [Notebook 6: analyze toy dataset](/exercises/6_analyze_toydata.ipynb)
 
 ##### Block 4.3:
@@ -94,6 +95,7 @@ For an optimal learning experience, the chapters from the [machine learning book
 - [ ] Work through [Notebook 8: RL gridmove](/exercises/8_rl_gridmove.ipynb)
 
 ##### Block 5.2:
+- [ ] Answer [Quiz 5](https://forms.gle/uZGj54YQHKwckmL46)
 - [ ] Read the whole chapter: ["Conclusion"](https://franziskahorn.de/mlbook/_conclusion.html)
 - [ ] Complete the exercise: ["Your next ML Project"](/exercise_your_ml_project.pdf)
 
diff --git a/exercise_your_ml_project.pdf b/exercise_your_ml_project.pdf
index 2d0aa65..8e39d9f 100644
Binary files a/exercise_your_ml_project.pdf and b/exercise_your_ml_project.pdf differ
diff --git a/exercises/1_visualize_text.ipynb b/exercises/1_visualize_text.ipynb
index 9abaa33..0a164c8 100644
--- a/exercises/1_visualize_text.ipynb
+++ b/exercises/1_visualize_text.ipynb
@@ -29,10 +29,7 @@
     "import plotly.express as px\n",
     "# suppress unnecessary warnings\n",
     "import warnings\n",
-    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "warnings.simplefilter(action='ignore', category=FutureWarning)"
    ]
   },
   {
@@ -60,6 +57,7 @@
    "outputs": [],
    "source": [
     "# check the first element in the dictionary\n",
+    "print(\"first key:\", list(articles.keys())[0])\n",
     "articles[list(articles.keys())[0]]"
    ]
   },
@@ -115,7 +113,7 @@
     "# this is a more efficient way of storing data that contains a lot of 0 values\n",
     "# by only remembering the indices where the matrix contains non-zero values and what these values are\n",
     "# (since each individual paragraph contains only very few unique words, this makes a lot of sense here)\n",
-    "# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g. t-SNE!)\n",
+    "# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g., t-SNE!)\n",
     "X"
    ]
   },
@@ -137,8 +135,8 @@
    "outputs": [],
    "source": [
     "# reduce dimensionality with linear kPCA\n",
-    "# since tf-idf vectors are l2 normalized, the linear kernel = cosine similaritiy\n",
-    "# --> we use 100 components since we feed the reduced data to t-SNE later!\n",
+    "# since TF-IDF vectors are length (L2) normalized, the linear kernel = cosine similaritiy\n",
+    "# --> we use 100 components since we feed the reduced data to t-SNE later (-> not sparse)!\n",
     "kpca = KernelPCA(n_components=100, kernel='linear')\n",
     "X_kpca = kpca.fit_transform(X)\n",
     "print(\"Dimensionality of our data:\", X_kpca.shape)"
@@ -152,7 +150,7 @@
    "source": [
     "# plot 2D PCA visualization\n",
     "# the components are ordered by their eigenvalue (largest first), i.e.,\n",
-    "# by taking the first 2 this is the same as if we had compute PCA with n_components=2\n",
+    "# by taking the first 2 this is the same as if we had computed PCA with n_components=2\n",
     "plt.figure()\n",
     "plt.scatter(X_kpca[:, 0], X_kpca[:, 1], s=2)  # s: size of the dots\n",
     "plt.title(\"PCA embedding of paragraphs\");\n",
@@ -304,7 +302,7 @@
     "1. After you've computed your new kPCA embedding (without outliers), use the code below to compute a t-SNE embedding\n",
     "2. Then create a regular (matplotlib) and an interactive (plotly) scatter plot of the results again and explore\n",
     "\n",
-    "Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g. identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
+    "Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g., identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
    ]
   },
   {
@@ -314,7 +312,9 @@
    "outputs": [],
    "source": [
     "# use 100D kPCA embedding, since t-SNE can't handle sparse matrices\n",
-    "tsne = TSNE(metric='cosine', verbose=2, random_state=42)\n",
+    "# (we use the \"cosine\" metric here since this works well for text,\n",
+    "# for other data you can leave this argument at its default value)\n",
+    "tsne = TSNE(metric=\"cosine\", verbose=2, random_state=42)\n",
     "X_tsne = tsne.fit_transform(X_kpca)\n",
     "print(\"Dimensionality of our data:\", X_tsne.shape)"
    ]
diff --git a/exercises/2_image_quantization.ipynb b/exercises/2_image_quantization.ipynb
index 38663c9..5a46b5d 100644
--- a/exercises/2_image_quantization.ipynb
+++ b/exercises/2_image_quantization.ipynb
@@ -5,7 +5,10 @@
    "metadata": {},
    "source": [
     "# Color Quantization using K-Means\n",
-    "In this notebook, we want to transform a regular RGB image (where each pixel is represented as a Red-Green-Blue triplet) into a [compressed representation](https://en.wikipedia.org/wiki/Color_quantization), where each pixel is represented as a single number (color index) together with a limited color palette (RGB triplets corresponding to the color indices). "
+    "In this notebook, we want to transform a regular RGB image (where each pixel is represented as a Red-Green-Blue triplet) into a [compressed representation](https://en.wikipedia.org/wiki/Color_quantization), where each pixel is represented as a single number (color index) together with a limited color palette (RGB triplets corresponding to the color indices). \n",
+    "\n",
+    "Example from Wikipedia (original image and after quantization):\n",
+    "<img src=\"https://upload.wikimedia.org/wikipedia/commons/e/e3/Dithering_example_undithered.png\" alt=\"\" /> <img src=\"https://upload.wikimedia.org/wikipedia/en/4/48/Dithering_example_undithered_16color_palette.png\" alt=\"\" />"
    ]
   },
   {
@@ -18,10 +21,7 @@
     "import matplotlib.pyplot as plt\n",
     "from PIL import Image  # library for loading image files\n",
     "from sklearn.cluster import KMeans\n",
-    "from sklearn.utils import shuffle\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "from sklearn.utils import shuffle"
    ]
   },
   {
@@ -68,7 +68,7 @@
     "X_sample = shuffle(X, random_state=0)[:1000]\n",
     "# initialize k-means and set n_clusters to the number of colors you want in your image (e.g. 10)\n",
     "kmeans = ...\n",
-    "# fit the model on the data (i.e. find the cluster indices)\n",
+    "# fit the model on the data (i.e., find the cluster indices)\n",
     "kmeans.fit(X_sample)"
    ]
   },
@@ -88,7 +88,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# use the predict function of kmeans to compute the cluster index for each data point (i.e. pixel) in X\n",
+    "# use the predict function of kmeans to compute the cluster index for each data point (i.e., pixel) in X\n",
     "# (cluster indices together with the color palette would be the compressed representation of the image)\n",
     "cluster_idx = ...\n",
     "print(cluster_idx.shape)  # same first dimension as X"
@@ -142,17 +142,17 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Heuristic to determine the number of clusters _k_\n",
+    "### Heuristic to determine the number of clusters *k*\n",
     "\n",
     "The objective that k-means internally optimizes is the average distance of the samples to their assigned cluster centers, i.e., it tries to find clusters such that all the points in the cluster are very close to the respective cluster center.\n",
     "\n",
     "After fitting k-means, the final value of this objective function can be computed with the `score` function on the dataset (this actually gives you the negative value, since this is more convenient for the some optimization algorithms).\n",
     "\n",
-    "We can now simply fit k-means with different settings for _k_ and observe how the value of the score function changes as we increase the number of clusters.\n",
+    "We can now simply fit k-means with different settings for *k* and observe how the value of the score function changes as we increase the number of clusters.\n",
     "\n",
     "#### Questions: \n",
-    "* What would happen (i.e. what would the score be) if you set _k_ to a very large value, e.g., the number of data points? \n",
-    "* Based on the plot that we compute below, what do you think might be a good value for _k_? (Of course, this will be different for every dataset, i.e., in this example, a different image might need more or less colors to look ok.)"
+    "* What would happen (i.e., what would the score be) if you set *k* to a very large value, e.g., the number of data points? \n",
+    "* Based on the plot that we compute below, what do you think might be a good value for *k*? (Of course, this will be different for every dataset, i.e., in this example, a different image might need more or less colors to look ok.)"
    ]
   },
   {
diff --git a/exercises/3_supervised_comparison.ipynb b/exercises/3_supervised_comparison.ipynb
index 15fe063..ea8a9e9 100644
--- a/exercises/3_supervised_comparison.ipynb
+++ b/exercises/3_supervised_comparison.ipynb
@@ -24,10 +24,7 @@
     "from sklearn.datasets import make_moons\n",
     "# don't get unneccessary warnings\n",
     "import warnings\n",
-    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "warnings.simplefilter(action='ignore', category=FutureWarning)"
    ]
   },
   {
@@ -148,7 +145,10 @@
    "source": [
     "## Datasets\n",
     "\n",
-    "Here you can have a look at the 3 regression and 3 classification datasets on which we'll compare the different models. The regression dataset only has one input feature, while the classification dataset has two and the target (i.e. class label) is indicated by the color of the dots."
+    "Here you can have a look at the 3 regression and 3 classification datasets on which we'll compare the different models. The regression dataset only has one input feature, while the classification dataset has two and the target (i.e., class label) is indicated by the color of the dots.\n",
+    "\n",
+    "**Questions:**\n",
+    "- Why are the first two regression and classification datasets linear and the last ones non-linear?"
    ]
   },
   {
@@ -240,7 +240,7 @@
    "source": [
     "# Logistic Regression (for classification problems!):\n",
     "# C (> 0): regularization (smaller values = more regularization)\n",
-    "# penalty: change to \"l1\" to get sparse weights (only if you have many features)\n",
+    "# penalty: change to \"l1\" to get sparse weights (only if you have many features; needs a different solver)\n",
     "X, y = X_clf_2, y_clf_2\n",
     "model = LogisticRegression(penalty=\"l2\", C=100.)\n",
     "model.fit(X, y)\n",
@@ -278,8 +278,8 @@
    "outputs": [],
    "source": [
     "# Decision Tree for regression:\n",
-    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
-    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
+    "# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
+    "# min_samples_leaf (>= 1): how many training points are in one prediction bucket\n",
     "X, y = X_reg_3, y_reg_3\n",
     "model = DecisionTreeRegressor(max_depth=2, min_samples_leaf=10)\n",
     "model.fit(X, y)\n",
@@ -293,8 +293,8 @@
    "outputs": [],
    "source": [
     "# Decision Tree for classification:\n",
-    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
-    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
+    "# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
+    "# min_samples_leaf (>= 1): how many training points are in one prediction bucket\n",
     "X, y = X_clf_1, y_clf_1\n",
     "model = DecisionTreeClassifier(max_depth=2, min_samples_leaf=10)\n",
     "model.fit(X, y)\n",
@@ -329,9 +329,9 @@
    "outputs": [],
    "source": [
     "# Random Forest for regression:\n",
-    "# n_estimators (> 1): how many decision trees to train (don't set this too high, gets computationally expensive)\n",
-    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
-    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
+    "# n_estimators (>= 1): how many decision trees to train (don't set this too high, gets computationally expensive)\n",
+    "# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
+    "# min_samples_leaf (>= 1): how many training points are in one prediction bucket\n",
     "X, y = X_reg_3, y_reg_3\n",
     "model = RandomForestRegressor(n_estimators=100, max_depth=2, min_samples_leaf=10)\n",
     "model.fit(X, y)\n",
@@ -345,9 +345,9 @@
    "outputs": [],
    "source": [
     "# Random Forest for classification:\n",
-    "# n_estimators (> 1): how many decision trees to train\n",
-    "# max_depth (> 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
-    "# min_samples_leaf (> 1): how many training points are in one prediction bucket\n",
+    "# n_estimators (>= 1): how many decision trees to train\n",
+    "# max_depth (>= 1): depth of the tree (i.e. how many decisions are made before the final prediction)\n",
+    "# min_samples_leaf (>= 1): how many training points are in one prediction bucket\n",
     "X, y = X_clf_2, y_clf_2\n",
     "model = RandomForestClassifier(n_estimators=100, max_depth=2, min_samples_leaf=10)\n",
     "model.fit(X, y)\n",
@@ -363,7 +363,8 @@
     "After reading the chapter on k-nearest neighbors, test the method here on different datasets and experiment with the hyperparameter settings.\n",
     "\n",
     "**Questions:**\n",
-    "- On the 3rd regression dataset for a larger number of nearest neighbors (e.g. 20), what do you observe for the prediction at the edges of the input domain and why?"
+    "- On the 3rd regression dataset for a larger number of nearest neighbors (e.g., 20), what do you observe for the prediction at the edges of the input domain and why?\n",
+    "- Especially for binary classification problems, why does it make sense to always use an odd number of nearest neighbors?"
    ]
   },
   {
@@ -382,7 +383,7 @@
    "outputs": [],
    "source": [
     "# k-Nearest Neighbors for regression:\n",
-    "# n_neighbors (> 1): how many nearest neighbors are used for the prediction\n",
+    "# n_neighbors (>= 1): how many nearest neighbors are used for the prediction\n",
     "X, y = X_reg_3, y_reg_3\n",
     "model = KNeighborsRegressor(n_neighbors=10)\n",
     "model.fit(X, y)\n",
@@ -396,9 +397,9 @@
    "outputs": [],
    "source": [
     "# k-Nearest Neighbors for classification:\n",
-    "# n_neighbors (> 1): how many nearest neighbors are used for the prediction\n",
+    "# n_neighbors (>= 1): how many nearest neighbors are used for the prediction\n",
     "X, y = X_clf_3, y_clf_3\n",
-    "model = KNeighborsClassifier(n_neighbors=12)\n",
+    "model = KNeighborsClassifier(n_neighbors=11)\n",
     "model.fit(X, y)\n",
     "plot_classification(X, y, model)"
    ]
diff --git a/exercises/4_information_retrieval.ipynb b/exercises/4_information_retrieval.ipynb
index 1bcfa4d..dd0faae 100644
--- a/exercises/4_information_retrieval.ipynb
+++ b/exercises/4_information_retrieval.ipynb
@@ -8,7 +8,7 @@
     "\n",
     "**Idea:** Respond more quickly to customer service requests by using Natural Language Processing (NLP) and Information Retrieval to automatically suggest one or several FAQ articles or response templates given an incoming customer email to speed up the process of drafting a response.\n",
     "\n",
-    "Since personal emails are private, unfortunately there are no public datasets available with customer requests, so we instead test the methodology on a [question answering dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset contains 477 wikipedia articles, each split into paragraphs, with several questions associated with each paragraph (i.e. where the answer to the question can be found in the paragraph).\n",
+    "Since personal emails are private, unfortunately there are no public datasets available with customer requests, so we instead test the methodology on a [question answering dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset contains 477 wikipedia articles, each split into paragraphs, with several questions associated with each paragraph (i.e., where the answer to the question can be found in the paragraph).\n",
     "\n",
     "This means your task is, given a question, to identify the correct paragraph that contains the answer to this question."
    ]
@@ -22,10 +22,7 @@
     "import json\n",
     "import numpy as np\n",
     "from sklearn.feature_extraction.text import TfidfVectorizer\n",
-    "from sklearn.neighbors import NearestNeighbors\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "from sklearn.neighbors import NearestNeighbors"
    ]
   },
   {
@@ -91,7 +88,7 @@
    "source": [
     "## Assign Questions to Paragraphs\n",
     "\n",
-    "We first transform both the paragraphs as well as all questions associated with the paragraphs into TF-IDF features and then identify the most similar paragraph for a given question by computing the cosine similarity of the TF-IDF vector for the question to the TF-IDF vectors of all paragraphs to identify the most similar paragraphs, which we then return as the search results. "
+    "We first transform both the paragraphs, as well as all questions associated with the paragraphs, into TF-IDF features and then identify the most similar paragraph for a given question by computing the cosine similarity of the TF-IDF vector for the question to the TF-IDF vectors of all paragraphs to identify the most similar paragraphs, which we then return as the search results. "
    ]
   },
   {
@@ -116,9 +113,9 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# transform both paragraphs and questions into tf-idf features\n",
+    "# transform both paragraphs and questions into TF-IDF features\n",
     "vectorizer = TfidfVectorizer(strip_accents='unicode')\n",
-    "# learn internal parameters of vectorizer (vocabulary, idf weights) from known data\n",
+    "# learn internal parameters of vectorizer (vocabulary, IDF weights) from known data\n",
     "vectorizer.fit(paragraphs_corpus)\n",
     "# transform both datasets with the same vectorizer so they have the same feature dimensions\n",
     "X_pars = vectorizer.transform(paragraphs_corpus)\n",
@@ -250,9 +247,9 @@
    "source": [
     "#### Systematic performance analysis with the hits@k metric\n",
     "\n",
-    "While above we just collected some anecdotal evidence for the performance of our method, of course before deploying this into production we should conduct a more systematic evaluation. For this, we use the _hits@k_ metric, which, for different _k_, checks whether the correct answer was within the first _k_ search results. E.g. in a Google search, is the website you're looking for the 1st result, then this would be a hit@1, or is it on the first page, then it would still count towards the hits@10.\n",
+    "While above we just collected some anecdotal evidence for the performance of our method, of course before deploying this into production we should conduct a more systematic evaluation. For this, we use the *hits@k* metric, which, for different *k*, checks whether the correct answer was within the first *k* search results. E.g. in a Google search, is the website you're looking for the 1st result, then this would be a hit@1, or is it on the first page, then it would still count towards the hits@10.\n",
     "\n",
-    "In our example, we check both the hits@k for the paragraphs as well as the articles, i.e., we check for every paragraph that was returned as a search result, whether that was actually the correct paragraph, or whether it at least came from the right article."
+    "In our example, we check both the hits@k for the paragraphs, as well as the articles, i.e., we check for every paragraph that was returned as a search result, whether that was actually the correct paragraph, or whether it at least came from the right article."
    ]
   },
   {
@@ -299,10 +296,14 @@
    "source": [
     "# Exercises\n",
     "\n",
-    "For these exercises, please work on the complete set of articles, not the subset we used for now, i.e., load the data again. To construct a single text from all paragraphs of an article, we join them together with `\"\\n\".join(list_of_paragraphs)`.\n",
+    "For these exercises, please work on the complete set of articles, not the subset we used for now, i.e., load the data again. To construct a single text from all paragraphs of an article, we join them together with `\"\\n\".join(list_of_paragraphs)` (i.e., one data point is now not one paragraph, but one artice).\n",
     "\n",
     "### 1. Find the article that is most similar to the article about \"Beer\"\n",
-    "To do this, compute the cosine similaritiy between the target article with the title \"Beer\" and all other articles and choose the most similar one (not counting the target article itself ;-)). Similar to how we selected a matching paragraph for each question, this can be done with the `NearestNeighbors` class from `sklearn`."
+    "Our dataset contains an article with the title \"Beer\". Your task is to identify another article in the dataset that is the most similar to this article about beer.\n",
+    "\n",
+    "To do this, compute the cosine similaritiy between the target article with the title \"Beer\" and all other articles and choose the most similar one (not counting the target article itself ;-)). Similar to how we selected a matching paragraph for each question, this can be done with the `NearestNeighbors` class from `sklearn`.\n",
+    "\n",
+    "**Important:** Don't search for the article closest to the word \"beer\", but to the *whole article with the title \"Beer\"*."
    ]
   },
   {
@@ -319,7 +320,7 @@
     "article_ids = sorted(articles.keys())\n",
     "\n",
     "# get the corresponding texts of these articles by concatenating all the paragraphs of each article.\n",
-    "# the texts in this list are in the same order as the article titles in article_ids.\n",
+    "# the article texts in this list are in the same order as the article titles in article_ids.\n",
     "article_corpus = [\"\\n\".join([p[\"paragraph\"] for p in articles[a]]) for a in article_ids]\n",
     "print(len(article_corpus))"
    ]
@@ -330,7 +331,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# now the article texts need to be transformed into a tf-idf feature matrix X\n",
+    "# transform the article texts into a TF-IDF feature matrix X\n",
     "X = ...\n",
     "\n",
     "# when you're done, check how many articles and feature dimensions were generated\n",
@@ -344,7 +345,7 @@
    "outputs": [],
    "source": [
     "# find out the index of our target article \"Beer\" (remember the titles are in article_ids), \n",
-    "# so you know which row in the feature matrix X contains the corresponding tf-idf vector\n",
+    "# so you know which row in the feature matrix X contains the corresponding TF-IDF vector\n",
     "# (so that you can use this vector to get the nearest neighbors)\n"
    ]
   },
@@ -354,8 +355,15 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# find the most similar article with NearestNeighbors\n",
-    "\n",
+    "# find the most similar article with NearestNeighbors\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
     "# based on the index of the most similar article, get the title of this article\n"
    ]
   },
@@ -366,7 +374,11 @@
     "## Advanced Exercises\n",
     "\n",
     "### 2. Find the most similar article to \"Beer\" - without using `NearestNeighbors`!\n",
-    "Solve the above task without using the `NearestNeighbors` class, i.e., by computing the similarities and selecting the most similar article yourself. Since the TF-IDF feature vectors are by default length-normalized, the cosine similarity between the articles can be computed with a simple dot-product (but beware: X is by default a sparse matrix, i.e., `np.dot()` wont work, but you can also call `.dot()` at an array or sparse matrix directly). The result of the dot-product will still be sparse, but you can call `.toarray()` on a sparse matrix to convert it into a regular dense numpy array with which you can work as usual. Once you have a vector with similarities, the functions `np.argmax` or `np.argsort` might be helpful."
+    "Solve the above task without using the `NearestNeighbors` class, i.e., by computing the similarities and selecting the most similar article yourself. \n",
+    "\n",
+    "Since the TF-IDF feature vectors are by default length-normalized, the cosine similarity between the articles can be computed with a simple dot-product (but beware: X is by default a sparse matrix, i.e., `np.dot()` wont work, but you can also call `.dot()` on an array or sparse matrix directly). The result of the dot-product will still be sparse, but you can call `.toarray()` on a sparse matrix to convert it into a regular dense numpy array with which you can work as usual. \n",
+    "\n",
+    "Once you have a vector with similarities, the functions `np.argmax` or `np.argsort` might be helpful."
    ]
   },
   {
@@ -381,7 +393,9 @@
    "metadata": {},
    "source": [
     "### 3. Find the two most similar articles in the set\n",
-    "From all articles, which two have the most similar text? What is their cosine similarity score?"
+    "From all articles, which two have the most similar text? What is their cosine similarity score?\n",
+    "\n",
+    "**Question:** Do you have an idea *why* these two articles might have been identified as similar?"
    ]
   },
   {
diff --git a/exercises/5_mnist_keras.ipynb b/exercises/5_mnist_keras.ipynb
index 10acca9..8a7e1db 100644
--- a/exercises/5_mnist_keras.ipynb
+++ b/exercises/5_mnist_keras.ipynb
@@ -39,10 +39,7 @@
     "from tensorflow.keras.datasets import mnist, fashion_mnist\n",
     "from tensorflow.keras import Sequential\n",
     "from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D\n",
-    "from tensorflow.keras import backend as K\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "from tensorflow.keras import backend as K"
    ]
   },
   {
@@ -103,6 +100,7 @@
     "    if use_fashion:\n",
     "        (x_train, y_train), (x_test, y_test) = fashion_mnist_load_local_data()  # fashion_mnist.load_data()\n",
     "    else:\n",
+    "        # might need to use mnist.load_data(path=\"mnist.npz\") when executing on Google Colab\n",
     "        (x_train, y_train), (x_test, y_test) = mnist.load_data(path=os.path.join(os.path.abspath(\"../data/\"), \"mnist.npz\"))\n",
     "\n",
     "    if K.image_data_format() == 'channels_first':\n",
@@ -329,7 +327,7 @@
     "y_pred = model.predict(x_test)\n",
     "# convert predictions to classes\n",
     "y_pred_classes = np.argmax(y_pred, axis=1)\n",
-    "print(accuracy_score(y_test, y_pred_classes))\n",
+    "print(accuracy_score(y_test, y_pred_classes), \"\\n\")\n",
     "# train multi-layer FFNN\n",
     "model = train_ffnn(x_train, y_train_cat)\n",
     "score = model.evaluate(x_test, y_test_cat, verbose=0)\n",
@@ -404,7 +402,7 @@
    "source": [
     "## Test all models on the Fashion MNIST dataset\n",
     "\n",
-    "On the more difficult FMNIST task, the LogReg model has a much lower accuracy of 86.6%. When trained for only a single epoch, both the linear and multi-layer FFNNs have a lower accuracy (82.7 and 83.7% respectively) and only the CNN does a bit better (88.6%). "
+    "On the more difficult FMNIST task, the LogReg model has a much lower accuracy of 84.4% compared to the 92.6% achieved on the original MNIST dataset. When trained for only a single epoch, both the linear and multi-layer FFNNs have a lower accuracy than the LogReg model (80.5 and 81.9% respectively) and only the CNN does a bit better (86.5%). "
    ]
   },
   {
@@ -452,15 +450,15 @@
     "# train LogReg classifier\n",
     "clf = LogisticRegression(class_weight='balanced', random_state=1, fit_intercept=True)\n",
     "clf.fit(x_train, y_train)\n",
-    "print('Test accuracy LogReg:', clf.score(x_test, y_test))\n",
+    "print('Test accuracy LogReg:', clf.score(x_test, y_test), \"\\n\")\n",
     "# train simple linear model\n",
     "model = train_linnn(x_train, y_train_cat)\n",
     "score = model.evaluate(x_test, y_test_cat, verbose=0)\n",
-    "print('Test accuracy Linear NN:', score[1])\n",
+    "print('Test accuracy Linear NN:', score[1], \"\\n\")\n",
     "# train multi-layer FFNN\n",
     "model = train_ffnn(x_train, y_train_cat)\n",
     "score = model.evaluate(x_test, y_test_cat, verbose=0)\n",
-    "print('Test accuracy FFNN:', score[1])\n",
+    "print('Test accuracy FFNN:', score[1], \"\\n\")\n",
     "# load data again (not reshaped)\n",
     "x_train, x_test, y_train, y_test = load_data(True)\n",
     "y_train_cat, y_test_cat = convert_cat(y_train, y_test)\n",
@@ -595,11 +593,11 @@
     "# train simple linear model\n",
     "model = train_linnn(x_train, y_train_cat, epochs=15)\n",
     "score = model.evaluate(x_test, y_test_cat, verbose=0)\n",
-    "print('Test accuracy Linear NN:', score[1])\n",
+    "print('Test accuracy Linear NN:', score[1], \"\\n\")\n",
     "# train multi-layer FFNN\n",
     "model = train_ffnn(x_train, y_train_cat, epochs=15)\n",
     "score = model.evaluate(x_test, y_test_cat, verbose=0)\n",
-    "print('Test accuracy FFNN:', score[1])\n",
+    "print('Test accuracy FFNN:', score[1], \"\\n\")\n",
     "# load data again (not reshaped)\n",
     "x_train, x_test, y_train, y_test = load_data(True)\n",
     "y_train_cat, y_test_cat = convert_cat(y_train, y_test)\n",
diff --git a/exercises/5_mnist_torch.ipynb b/exercises/5_mnist_torch.ipynb
index 77d6d94..b9b6218 100644
--- a/exercises/5_mnist_torch.ipynb
+++ b/exercises/5_mnist_torch.ipynb
@@ -38,10 +38,7 @@
     "from skorch.callbacks import EpochScoring\n",
     "# set random seeds to get (at least more or less) reproducable results\n",
     "np.random.seed(28)\n",
-    "torch.manual_seed(28)\n",
-    "\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "torch.manual_seed(28);"
    ]
   },
   {
@@ -115,8 +112,8 @@
     "X_train, X_test, y_train, y_test = load_data()\n",
     "plot_images(X_train)\n",
     "# Fashion MNIST\n",
-    "X_train, X_test, y_train, y_test = load_data(use_fashion=True)\n",
-    "plot_images(X_train)"
+    "X_train_F, X_test_F, y_train_F, y_test_F = load_data(use_fashion=True)\n",
+    "plot_images(X_train_F)"
    ]
   },
   {
@@ -195,7 +192,17 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# look at the network's output for the first data point\n",
+    "# check the image of the first training sample\n",
+    "plt.imshow(X_train[0].reshape(28, 28));"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# look at the network's output for this first data point\n",
     "# -> since the network wasn't trained yet, the predicted probabilities for all 10 classes are ~0.1\n",
     "# (notice the grad parameter, which indicates that the network kept track of the gradients,\n",
     "# which are needed for later tuning the weights during training)\n",
@@ -258,7 +265,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# we can also get the original probabilities\n",
+    "# we can also get the original probabilities (notice the higher value at the index of the true class)\n",
     "y = net.predict_proba(X_train[:16])\n",
     "y[0]"
    ]
@@ -269,7 +276,7 @@
    "source": [
     "## Define NNs for the classification task\n",
     "\n",
-    "In the code below we define 3 different neural network architectures: a linear FFNN, a FFNN with multiple hidden layers, and a CNN, which is an architecture particularly well suited for image classification tasks.\n",
+    "In the code below, we define 3 different neural network architectures: a linear FFNN, a FFNN with multiple hidden layers, and a CNN, which is an architecture particularly well suited for image classification tasks.\n",
     "\n",
     "You will see that the more complex architectures use an additional operation between layers called `Dropout`. This is a regularization technique used for training neural networks, where a certain percentage of the values in the hidden layer representation of a data point are randomly set to zero. You can think of this as the network suffering from a temporary stroke, which forces the neurons learn redundant representations (i.e., such that one neuron can take over for another neuron that was knocked out), which improves generalization."
    ]
@@ -368,7 +375,7 @@
     "    net.fit(X_train, y_train)\n",
     "    # evaluate on test set\n",
     "    y_pred = net.predict(X_test)\n",
-    "    print('Test accuracy:', accuracy_score(y_test, y_pred))\n",
+    "    print('Test accuracy:', accuracy_score(y_test, y_pred), \"\\n\")\n",
     "    return net"
    ]
   },
@@ -380,7 +387,7 @@
     "\n",
     "As you see below, the simple logistic regression classifier is already very good on this easy task, with a test accuracy of over 93.5%.\n",
     "\n",
-    "The linear FFNN has almost the same accuracy (90.5%) as the LogReg model (please note the NNs were only trained for a single epoch!) and the multi-layer FFNN is already better than the LogReg model (96.4%), while the CNN beats them all (98.2%), which is expected since this architecture is designed for the image classification task."
+    "The linear FFNN has almost the same accuracy (90.5%) as the LogReg model (please note: the NNs were only trained for a single epoch!) and the multi-layer FFNN is already better than the LogReg model (96.4%), while the CNN beats them all (98.2%), which is expected since this architecture is designed for the image classification task."
    ]
   },
   {
@@ -444,7 +451,7 @@
     "print(\"### LogReg\")\n",
     "clf = LogisticRegression(class_weight='balanced', random_state=1, fit_intercept=True)\n",
     "clf.fit(X_train, y_train)\n",
-    "print('Test accuracy:', clf.score(X_test, y_test))\n",
+    "print('Test accuracy:', clf.score(X_test, y_test), \"\\n\")\n",
     "# and our different NN architectures\n",
     "for net_module in [LinNN, FFNN, CNN]:\n",
     "    if net_module == CNN:\n",
@@ -461,7 +468,7 @@
    "source": [
     "## Test on FashionMNIST\n",
     "\n",
-    "On the more difficult FMNIST task, the LogReg model has a much lower accuracy of 86.6%. When trained for only a single epoch, both the linear and multi-layer FFNNs have a lower accuracy (82.7 and 83.7% respectively) and only the CNN does a bit better (88.6%). "
+    "On the more difficult FMNIST task, the LogReg model has a much lower accuracy of 86.6% compared to the 93.5% achieved on the original MNIST dataset. When trained for only a single epoch, both the linear and multi-layer FFNNs have a lower accuracy than the LogReg model (82.7 and 83.7% respectively) and only the CNN does a bit better (88.6%). "
    ]
   },
   {
@@ -524,7 +531,7 @@
     "print(\"### LogReg\")\n",
     "clf = LogisticRegression(class_weight='balanced', random_state=1, fit_intercept=True)\n",
     "clf.fit(X_train, y_train)\n",
-    "print('Test accuracy:', clf.score(X_test, y_test))\n",
+    "print('Test accuracy:', clf.score(X_test, y_test), \"\\n\")\n",
     "# our different NN\n",
     "for net_module in [LinNN, FFNN, CNN]:\n",
     "    if net_module == CNN:\n",
@@ -539,7 +546,7 @@
    "source": [
     "However, when trained for more epochs, the performance of all models improves, with the accuracy of the linear FFNN now being very close to that of the LogReg model (85.8%), while the multi-layer FFNN is better (89.3%) and the CNN can now solve the task quite well with an accuracy of 94.6%.\n",
     "\n",
-    "(See how the training and validation loss decrease over time - observing how these metrics develop can help you judge whether you've set your learning rate correctly.)"
+    "(See how the training and validation loss decrease over time - observing how these metrics develop can help you judge whether you've set your learning rate correctly and for how many epochs you should train the network.)"
    ]
   },
   {
diff --git a/exercises/6_analyze_toydata.ipynb b/exercises/6_analyze_toydata.ipynb
index 80d0c01..6f59117 100644
--- a/exercises/6_analyze_toydata.ipynb
+++ b/exercises/6_analyze_toydata.ipynb
@@ -30,11 +30,7 @@
     "import plotly.express as px\n",
     "# suppress unnecessary warnings\n",
     "import warnings\n",
-    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
-    "\n",
-    "# these are some 'magic' commands for the notebook to automatically load updated libraries\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
+    "warnings.simplefilter(action='ignore', category=FutureWarning)"
    ]
   },
   {
@@ -177,7 +173,7 @@
     "**Questions:**\n",
     "- If all that someone had told you was that two variables have a linear correlation of 0.7, is this the scatter plot that you would have imagined for the two variables? (You might also want to look at the Wikipedia article again for some other example plots)\n",
     "- Why is the correlation coefficient for these two variables so large?\n",
-    "- What would you expect the correlation coefficient to be if you only consider the large blob in the middle?\n",
+    "- What would you expect the correlation coefficient to be if you only consider the large blob in the middle (i.e., ignore the points at (0, 0))?\n",
     "\n",
     "In reality, it often happens that two variables seem to be perfectly correlated (i.e., they have a correlation coefficient of (almost) 1), but when you look closer, then this is just due to the fact that, for example, two sensors are off at the same time, but for the part where they're on, they actually aren't giving redundant values. Therefore be careful before throwing away \"rendundant\" variables and always verify the correlation with a scatter plot!"
    ]
@@ -578,7 +574,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "As you can see, the tree is quite big and therefore also more tedious to interpret. Additionally, we see that many of the splits right before the leaf nodes are made without any change in the predicted class (i.e., all the nodes remain orange). This happens, because the tree itself only cares about the Gini impurity, which indeed still decreases after these splits. However, since this is not helpful for us, lets prune on the tree by cutting off these unnecessary splits, which can be done by setting the parameter `ccp_alpha`."
+    "As you can see, the tree is quite big and therefore also more tedious to interpret. Additionally, we see that many of the splits right before the leaf nodes are made without any change in the predicted class (e.g., all the nodes remain orange). This happens, because the tree itself only cares about the Gini impurity, which indeed still decreases after these splits. However, since this is not helpful for us, lets prune on the tree by cutting off these unnecessary splits, which can be done by setting the parameter `ccp_alpha`."
    ]
   },
   {
@@ -740,7 +736,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "The performance is still a lot lower than what we got with a decision tree... Furthermore, you saw in both cases that the model threw a `ConvergenceWarning`. While this isn't too tragic (usually the results are still quite good), in many cases this warning occurs when the data isn't normally distributed (i.e., violates the model's assumptions) and the results often get better when you transform the data accordingly. Therefore, we now use the `StandardScaler` to ensure each feature has a mean of 0 and a standard deviation of 1."
+    "The performance is still a lot lower than what we got with a decision tree... Furthermore, you saw in both cases that the model threw a `ConvergenceWarning`. While this usually isn't too tragic in practice (in most cases the results are still quite good), in many cases this warning occurs when the data isn't normally distributed (i.e., violates the model's assumptions) and the results often get better when you transform the data accordingly. Therefore, we now use the `StandardScaler` to ensure each feature has a mean of 0 and a standard deviation of 1."
    ]
   },
   {
@@ -790,7 +786,7 @@
     "# the coefficients tell us why an item was classified as faulty:\n",
     "# higher temperatures lead to faulty items, but we have different offsets for the different products, \n",
     "# i.e., product 3 can handle higher temperatures than product 1\n",
-    "# -> features with small coefficients can be removed\n",
+    "# -> features with very small coefficients can be removed\n",
     "dict(zip(feature_cols, clf.coef_[0]))"
    ]
   },
@@ -845,6 +841,27 @@
     "While it was a bit more work to set up the logistic regression model appropriately, incl. extra data preprocessing steps, we now even got a balanced accuracy on the test set that is slightly higher than that of the decision tree (0.938 instead of 0.935)."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Advanced Exercise (optional)\n",
+    "\n",
+    "Use a neural network (either using the `torch`/`skorch` (recommended) or `tensorflow`/`keras` libraries) to solve this task.\n",
+    "\n",
+    "Start with a linear network (i.e., a FFNN without hidden layers, i.e., the same number of trainable parameters as the logistic regression model used above) and try to get approximately the same performance as the LogReg model.\n",
+    "\n",
+    "Then use a deeper network (e.g., one additional hidden layer) and see if this improves the performance.\n",
+    "\n",
+    "**Tips:**\n",
+    "- Make sure to use scaled data!\n",
+    "- Since the faulty products are underrepresented, samples from this class should get a higher weight during training (similar to what we're doing with `class_weight=\"balanced\"` in sklearn models). \n",
+    "\n",
+    "**Using `torch` & `skorch`:**\n",
+    "- Use a skorch [`NeuralNetBinaryClassifier`](https://skorch.readthedocs.io/en/latest/classifier.html). Here the torch network shoud predict the output without any non-linear activation function at the end (i.e., *don't* use a sigmoid function to convert the output into probabilities) as the skorch model takes care of this conversion for you!\n",
+    "- The sample weights can be set by passing `criterion__pos_weight=torch.Tensor([np.sum(y_train==0)/np.sum(y_train==1)])` as an argument to the `NeuralNetBinaryClassifier` (see also the documentation for the torch [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html) loss function, which is used internally by the skorch model)"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
diff --git a/exercises/7_hard_drive_failures.ipynb b/exercises/7_hard_drive_failures.ipynb
index 49e1e6d..ca8bb7c 100644
--- a/exercises/7_hard_drive_failures.ipynb
+++ b/exercises/7_hard_drive_failures.ipynb
@@ -6,16 +6,17 @@
    "source": [
     "# Predicting hard drive failures\n",
     "\n",
-    "**Scenario:** In a data center with many hard drives, occasionally, one of these drives will fail. To prevent possible data loss, it's a data scientist's (i.e. your) task to predict as soon as possible in advance when a drive might fail.\n",
+    "**Scenario:** In a data center with many hard drives, occasionally, one of these drives will fail. To prevent possible data loss, it's a data scientist's (i.e., your) task to predict as soon as possible in advance when a drive might fail.\n",
     "\n",
     "The original data can be downloaded from [backblaze](https://www.backblaze.com/b2/hard-drive-test-data.html).\n",
     "It was already cleaned and restructured for your convenience (see `data/hdf_data`). This preprocessing process included:\n",
     "\n",
     "- removing NaNs\n",
+    "- removing SMART variables with zero variance\n",
     "- keeping only data from the most frequent drive model (to avoid artifacts due to differences in SMART recordings)\n",
     "- creating a dataframe where each drive is one data point with the information whether it failed or not (= class label)\n",
     "\n",
-    "The original data consisted of daily SMART statistics measurements for all drives at that time installed in the data center (i.e. for each drive until it failed). Your task is to build a binary classification model, which receives the measurements from all drives every day and should predict which of these drives are likely to fail in the next hours or days. To train such a model, you are given a simplified dataset, which includes only a single measurement per drive, either from some random time point during the year if the drive did not fail (class 0), or the SMART statistics on the day the drive failed (csv files ending in `_0`) or from a few days before the drive failed (e.g. `_1` for 1 day before it failed, `_7` for 7 days, etc). This means by using e.g. the data from `df_2016_0.csv` you can build a model that can predict whether a drive will fail today, while a model trained on the data in `df_2016_7.csv` can predict whether a drive will fail one week from now. (Normally, you would make use of the measurements over time and e.g. track maximum values up to now or do some other feature engineering to improve the performance, but for the sake of simplicity we only use these individual snapshots here.) \n",
+    "The original data consisted of daily SMART statistics measurements for all drives that were installed in the data center at this time (i.e., measurements for each drive until it failed). Your task is to build a binary classification model, which receives the measurements from all drives every day and should predict which of these drives are likely to fail in the next hours or days. To train such a model, you are given a simplified dataset, which includes only a single measurement per drive, either from some random time point during the year if the drive did not fail (class 0), or the SMART statistics on the day the drive failed (csv files ending in `_0`) or from a few days before the drive failed (e.g., `_1` for 1 day before it failed, `_7` for 7 days, etc). This means by using, e.g., the data from `df_2016_0.csv` you can build a model that can predict whether a drive will fail today, while a model trained on the data in `df_2016_7.csv` can predict whether a drive will fail one week from now. (Normally, you would make use of the measurements over time and, e.g., track maximum values up to now or do some other feature engineering to improve the performance, but for the sake of simplicity we only use these individual snapshots here.) \n",
     "\n",
     "Use the data from 2016 for training the model and tuning hyperparameters and the data from 2017 for the final evaluation to get a realistic performance estimate of how well the model can handle some slight data drifts etc.\n",
     "\n",
@@ -38,6 +39,9 @@
     "import warnings\n",
     "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
     "\n",
+    "# these \"magic commands\" are helpful if you plan to import functions from another script\n",
+    "# where you keep changing things, i.e., if you change a function in the script\n",
+    "# it will automagically be reloaded in the notebook so you work with the latest version\n",
     "%load_ext autoreload\n",
     "%autoreload 2"
    ]
@@ -49,8 +53,8 @@
    "outputs": [],
    "source": [
     "# load the data with the SMART statistics of the drives.\n",
-    "# with the data ending in _0, we can learn to predict if a drive has failed or is working properly;\n",
-    "# try e.g. df_2016_7.csv to predict failures a week in advance\n",
+    "# with the data ending in _0, we can learn to predict if a drive has failed or is working properly right now;\n",
+    "# try, e.g., df_2016_7.csv to predict failures a week in advance\n",
     "df = pd.read_csv(\"../data/hdf_data/df_2016_0.csv\")\n",
     "# have a look at what we've loaded\n",
     "df.head()"
@@ -62,7 +66,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "# construct training and test data from this dataframe - use only the smart statistics as features\n",
+    "# construct training and test data from this dataframe\n",
+    "# -> use the smart statistics as features & \"failure\" as the target\n",
     "feat_cols = [c for c in df.columns if c.startswith(\"smart\")]\n",
     "X = df[feat_cols].to_numpy()\n",
     "y = df[\"failure\"].to_numpy()\n",
@@ -112,24 +117,26 @@
    "metadata": {},
    "source": [
     "-------------------------------------------------------------------------------------\n",
-    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project.\n",
+    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
+    "\n",
+    "The previous notebook, \"analyze toydata\", deals with a very similar problem and can serve as a guideline for this exercise. For an example of how to use the t-SNE algorithm, have a look at the first notebook, \"visualize text\" (but please note that since you don't have sparse data here, there is no need to transform the data with a kernel PCA before using t-SNE).\n",
     "\n",
     "### (Suggested) Steps\n",
     "\n",
     "#### a) Get a better understanding of the problem\n",
     "- Create a t-SNE plot of the data (from the features; color the dots in the scatter plot with the target variable): Do you think a classification model will do well on this data?\n",
     "- Look at the variables in more detail: Are they normally/uniformly distributed?\n",
-    "- Try different kinds of models in place of the `DummyClassifier` (e.g. decision tree, linear model, SVM) and play around with the hyperparameters a little bit to get a better feeling for the problem.\n",
+    "- Try different kinds of models in place of the `DummyClassifier` (e.g., decision tree, linear model, SVM) and play around with the hyperparameters a little bit to get a better feeling for the problem.\n",
     "- Would outlier detection make sense here? Why (not)?\n",
     "\n",
     "#### b) Improve the prediction performance\n",
-    "- Try different normalizations of the data (e.g. using the `StandardScaler`): How do the t-SNE plot and performance of the different models change? Why does a decision tree not improve? Can you apply some other transformations to make the features more normally distributed?\n",
+    "- Try different normalizations of the data (e.g., using the `StandardScaler`): How do the t-SNE plot and performance of the different models change? Why does a decision tree not improve? Can you apply some other transformations to make the features more normally distributed?\n",
     "- Are any variables highly correlated? How does the performance change when you remove some features? Do you have any other feature engineering ideas? Again observe how your previous results change as you modify the input features!\n",
     "- Systematically find optimal hyperparameters for your models using a `GridSearchCV` and decide what you want to use as your final model.\n",
     "\n",
     "#### c) Final evaluation & model interpretation\n",
     "- Try to better understand what your model is doing: Which variables are the most predictive of failures?\n",
-    "- Predict failures multiple days in advance by training and evaluating your models on the other csv files from 2016 (e.g. `df_2016_7.csv` for 7 days before the drive fails). How many days in advance is a reliable prediction possible (e.g. plot \"days before failure\" vs \"balanced accuracy\")?\n",
+    "- Predict failures multiple days in advance by training and evaluating your models on the other csv files from 2016 (e.g., `df_2016_7.csv` for 7 days before the drive fails). How many days in advance is a reliable prediction possible (e.g., plot \"days before failure\" vs \"balanced accuracy\")?\n",
     "- Evaluate your final model (trained on a complete dataframe from 2016) on the respective data from 2017.\n",
     "\n",
     "#### d) Presentation of results\n",
diff --git a/exercises/8_rl_gridmove.ipynb b/exercises/8_rl_gridmove.ipynb
index 97595ec..5219b6f 100644
--- a/exercises/8_rl_gridmove.ipynb
+++ b/exercises/8_rl_gridmove.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Reinforcement Learning with discrete states and actions\n",
+    "# Tabular Reinforcement Learning (with discrete states and actions)\n",
     "\n",
     "In this notebook we demonstrate how a RL agent can learn to navigate the grid world environment shown in the book using Q-learning."
    ]
@@ -196,7 +196,7 @@
     "    plt.xlabel(\"episode\")\n",
     "    plt.ylabel(\"cumulative reward\")\n",
     "    plt.ylim(-100, 100)\n",
-    "    return Q, cum_rewards\n",
+    "    return Q\n",
     "\n",
     "def vis_Q(Q, env):\n",
     "    # see which state-action values we have learned\n",
@@ -243,6 +243,88 @@
     "Q = learn_Q(max_steps=250000, decay_rate=0.00001)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Predict Q-values with a simple `torch` NN\n",
+    "\n",
+    "While training a Q-network goes beyond this course, here is a simple example of how the prediction of Q-values could look like with a neural network. In reality, the state vectors wouldn't be one-hot encoded vectors, but instead some meaningful representation of the states such that the RL agent could also generalize to unseen states."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# torch neural network stuff\n",
+    "import torch\n",
+    "import torch.nn as nn\n",
+    "\n",
+    "class LinNN(nn.Module):\n",
+    "    \n",
+    "    def __init__(self, Q):\n",
+    "        super(LinNN, self).__init__()\n",
+    "        self.l = nn.Linear(Q.shape[0], Q.shape[1], bias=False)\n",
+    "        # we're not training the network, but directly initialize it with the optimal weights\n",
+    "        self.l.weight.data = torch.Tensor(Q.T)\n",
+    "        \n",
+    "    def forward(self, x):\n",
+    "        y = self.l(x)\n",
+    "        return y"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# initialize the network with the learned Q matrix\n",
+    "qnn = LinNN(Q)\n",
+    "# check that the weights are set appropriately\n",
+    "# -> same picture as above for the Q-matrix, just transposed\n",
+    "plt.imshow(qnn.l.weight.data)\n",
+    "plt.clim(-100, 100);"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# generate an input feature vector for some test state\n",
+    "test_state = (2, 0)\n",
+    "# get the index for this state\n",
+    "env = Environment()\n",
+    "test_state_idx = env.possible_states.index(test_state)\n",
+    "# transform into a one-hot encoded torch vector\n",
+    "input_tensor = torch.zeros((1, len(env.possible_states)))\n",
+    "input_tensor[0, test_state_idx] = 1."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# check the Q-network \"predictions\"\n",
+    "qnn(input_tensor)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# verify that the results are the same as the corresponding row from the Q-matrix\n",
+    "Q[test_state_idx]"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,