quizzes as pdfs and workbook

2026-01-14 12:14:38 +01:00 · 2021-10-23 22:31:51 +02:00
parent 30f203c34e
commit 6032017679
10 changed files with 50 additions and 50 deletions
--- a/README.md
+++ b/README.md
@@ -22,11 +22,11 @@ While Google Colab already includes most packages that we need, should you requi

 ## Course Overview

-For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below. Additionally, you should take notes in the workbook while working through the materials.
+For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below. Additionally, you should take notes in the [workbook](/other/ml_course_workbook.pdf) while working through the materials.

-**Important:** Please make note of all questions that arise while working through the materials. At the beginning of each group session, we'll collect everyone's questions and discuss them.
+**Important:** Please make a note of all questions that arise while working through the materials. At the beginning of each group session, we'll collect everyone's questions and discuss them.

-You can also find the course syllabus on the last page of the [course description](/ml_course_description.pdf), which explicitly lists all the sections of the book for each block.
+You can also find the course syllabus on the last page of the [course description](/ml_course_description.pdf), which explicitly lists the sections of the book for each block.

 ---

@@ -34,7 +34,7 @@ You can also find the course syllabus on the last page of the [course descriptio

 ##### Block 1.1:
 - [ ] Read the whole chapter: ["Introduction"](https://franziskahorn.de/mlbook/_introduction.html)
- [ ] Answer [Quiz 1](https://forms.gle/uzdzytpsYf9sFG946)
+- [ ] Answer [Quiz 1](https://forms.gle/uzdzytpsYf9sFG946) (quizzes are also available in PDF form in the folder "other" in case you can't access Google Forms)

 ##### Block 1.2:
 - [ ] Read the whole chapter: ["ML with Python"](https://franziskahorn.de/mlbook/_ml_with_python.html)
@@ -59,7 +59,7 @@ You can also find the course syllabus on the last page of the [course descriptio
 - [ ] Work through [Notebook 2: image quantization](/notebooks/2_image_quantization.ipynb) (after the section on clustering)

 ##### Block 2.2:
- [ ] Read the first sections of the chapter ["Supervised Learning"](https://franziskahorn.de/mlbook/_supervised_learning.html) up to and including ["Model Evaluation"](https://franziskahorn.de/mlbook/_model_evaluation.html)
+- [ ] Start reading the first sections of the chapter ["Supervised Learning"](https://franziskahorn.de/mlbook/_supervised_learning.html) up to and including ["Model Evaluation"](https://franziskahorn.de/mlbook/_model_evaluation.html)
 - [ ] Answer [Quiz 4](https://forms.gle/M2dDevwzicjcHLtc9)

 ---
@@ -71,7 +71,7 @@ You can also find the course syllabus on the last page of the [course descriptio
 - [ ] **In parallel**, work through the respective sections of [Notebook 3: supervised comparison](/notebooks/3_supervised_comparison.ipynb)

 ##### Block 3.2:
- [ ] Start with the chapter ["Deep Learning & more"](https://franziskahorn.de/mlbook/_deep_learning_more.html) up to and including the section: ["Information Retrieval (Similarity Search)"](https://franziskahorn.de/mlbook/_information_retrieval_similarity_search.html) and refresh your memory on the sections on [TF-IDF feature vectors](https://franziskahorn.de/mlbook/_feature_extraction.html) and [cosine similarity](https://franziskahorn.de/mlbook/_computing_similarities.html)
+- [ ] Start reading the chapter ["Deep Learning & more"](https://franziskahorn.de/mlbook/_deep_learning_more.html) up to and including the section: ["Information Retrieval (Similarity Search)"](https://franziskahorn.de/mlbook/_information_retrieval_similarity_search.html) and refresh your memory about [TF-IDF feature vectors](https://franziskahorn.de/mlbook/_feature_extraction.html) and [cosine similarity](https://franziskahorn.de/mlbook/_computing_similarities.html)
 - [ ] Work through [Notebook 4: information retrieval](/notebooks/4_information_retrieval.ipynb)

 ##### Block 3.3:
@@ -107,5 +107,6 @@ You can also find the course syllabus on the last page of the [course descriptio
 - [ ] Answer [Quiz 5](https://forms.gle/uZGj54YQHKwckmL46)
 - [ ] Read the whole chapter: ["Conclusion"](https://franziskahorn.de/mlbook/_conclusion.html)
 - [ ] Complete the exercise: ["Your next ML Project"](/other/exercise_your_ml_project.pdf) (in case you need some inspiration for a project idea, have a look at [how ML could be used to fight climate change](https://www.climatechange.ai/summaries)). Feel free to prepare a few slides or use the [Word template](/other/exercise_your_ml_project_template.docx) and aim for a 5 minute presentation.
+- [ ] Please fill out the [Feedback Survey](https://forms.gle/Ccv5h5zQxwPjWtCS7) to help me further improve this course! :-)

 ---
--- a/notebooks/6_analyze_toydata.ipynb
+++ b/notebooks/6_analyze_toydata.ipynb
@@ -256,7 +256,7 @@
    "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\"]\n",
    "X = df[feature_cols].to_numpy()  # convert df into a numpy array\n",
    "# ... and the vector with labels\n",
-    "y = df[[\"faulty\"]].to_numpy()\n",
+    "y = df[\"faulty\"].to_numpy()\n",
    "# to evaluate our prediction model, we need to split off a test dataset\n",
    "# later we will use the train_test_split function from sklearn to do this, \n",
    "# but this just goes to show that there is no magic behind it\n",
@@ -340,46 +340,12 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# maybe we just need to give the tree the freedom to make more splits? (i.e., increase its depth)\n",
-    "clf = tree.DecisionTreeClassifier(max_depth=100, random_state=1)\n",
-    "clf = clf.fit(X_train, y_train)\n",
-    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
-    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
-    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
-    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "**Questions:** \\\n",
-    "Is this a better model? If anything, is the model over- or underfitting?"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# when the tree is too large (or you're using a random forest),\n",
-    "# check the feature importances instead of plotting the tree\n",
-    "dict(zip(feature_cols, clf.feature_importances_))"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# now let's do what we probably should have done in the beginning and \n",
+    "# let's do what we probably should have done in the beginning and \n",
    "# remove the outliers (i.e., keep only samples with a height > 0)\n",
    "df_new = df[df[\"height\"] > 0.]\n",
    "# create a train/test split again, this time using the sklearn function\n",
    "X_train, X_test, y_train, y_test = train_test_split(df_new[feature_cols].to_numpy(), \n",
-    "                                                    df_new[[\"faulty\"]].to_numpy(), \n",
+    "                                                    df_new[\"faulty\"].to_numpy(), \n",
    "                                                    test_size=0.33, random_state=15)\n",
    "# see how imbalanced the label distribution in the training and test sets is\n",
    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
@@ -408,6 +374,18 @@
    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot the tree again\n",
+    "plt.figure(figsize=(15, 10))\n",
+    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);\n",
+    "# notice how in the leaf nodes where the tree predicts \"faulty\", there are only very few data points"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -422,10 +400,32 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "# plot the tree\n",
-    "plt.figure(figsize=(15, 10))\n",
-    "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);\n",
-    "# notice how in the leaf nodes where the tree predicts \"faulty\", there are only very few data points"
+    "# maybe we just need to give the tree the freedom to make more splits? (i.e., increase its depth)\n",
+    "clf = tree.DecisionTreeClassifier(max_depth=100, random_state=1)\n",
+    "clf = clf.fit(X_train, y_train)\n",
+    "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
+    "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
+    "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
+    "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Questions:** \\\n",
+    "Is this a better model? If anything, is the model over- or underfitting?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# when the tree is too large (or you're using a random forest),\n",
+    "# check the feature importances instead of plotting the tree\n",
+    "dict(zip(feature_cols, clf.feature_importances_))"
   ]
  },
  {
@@ -521,7 +521,7 @@
    "# let's try with temp as an additional feature\n",
    "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\", \"temp\"]\n",
    "X = df_new[feature_cols].to_numpy()\n",
-    "y = df_new[[\"faulty\"]].to_numpy()\n",
+    "y = df_new[\"faulty\"].to_numpy()\n",
    "# split into train/test sets again\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)\n",
    "# see how imbalanced the label distribution in the training and test sets is\n",
@@ -706,7 +706,6 @@
   "outputs": [],
   "source": [
    "# try a different classifier: logistic regression\n",
-    "y_train, y_test = y_train.flatten(), y_test.flatten()  # otherwise the model will complain about the shapes\n",
    "# first, try the model with the default parameter settings\n",
    "clf = LogisticRegression()\n",
    "clf = clf.fit(X_train, y_train)\n",
--- a/notebooks/7_hard_drive_failures.ipynb
+++ b/notebooks/7_hard_drive_failures.ipynb
@@ -117,7 +117,7 @@
   "metadata": {},
   "source": [
    "-------------------------------------------------------------------------------------\n",
-    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
+    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/other/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
    "\n",
    "The previous notebook, \"analyze toydata\", deals with a very similar problem and can serve as a guideline for this exercise. For an example of how to use the t-SNE algorithm, have a look at the first notebook, \"visualize text\" (but please note that since you don't have sparse data here, there is no need to transform the data with a kernel PCA before using t-SNE).\n",
    "\n",
--- a/other/ml_course_workbook.docx
+++ b/other/ml_course_workbook.docx
--- a/other/ml_course_workbook.pdf
+++ b/other/ml_course_workbook.pdf
--- a/other/quizzes/quiz1_introduction.pdf
+++ b/other/quizzes/quiz1_introduction.pdf
--- a/other/quizzes/quiz2_data.pdf
+++ b/other/quizzes/quiz2_data.pdf
--- a/other/quizzes/quiz3_ml_solutions.pdf
+++ b/other/quizzes/quiz3_ml_solutions.pdf
--- a/other/quizzes/quiz4_model_selection.pdf
+++ b/other/quizzes/quiz4_model_selection.pdf
--- a/other/quizzes/quiz5_big_recap.pdf
+++ b/other/quizzes/quiz5_big_recap.pdf