diff --git a/README.md b/README.md index c75c616..3ef7c51 100644 --- a/README.md +++ b/README.md @@ -22,11 +22,11 @@ While Google Colab already includes most packages that we need, should you requi ## Course Overview -For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below. Additionally, you should take notes in the workbook while working through the materials. +For an optimal learning experience, the chapters from the [machine learning book](https://franziskahorn.de/mlbook/) should be interleaved with quizzes and programming exercises as shown below. Additionally, you should take notes in the [workbook](/other/ml_course_workbook.pdf) while working through the materials. -**Important:** Please make note of all questions that arise while working through the materials. At the beginning of each group session, we'll collect everyone's questions and discuss them. +**Important:** Please make a note of all questions that arise while working through the materials. At the beginning of each group session, we'll collect everyone's questions and discuss them. -You can also find the course syllabus on the last page of the [course description](/ml_course_description.pdf), which explicitly lists all the sections of the book for each block. +You can also find the course syllabus on the last page of the [course description](/ml_course_description.pdf), which explicitly lists the sections of the book for each block. --- @@ -34,7 +34,7 @@ You can also find the course syllabus on the last page of the [course descriptio ##### Block 1.1: - [ ] Read the whole chapter: ["Introduction"](https://franziskahorn.de/mlbook/_introduction.html) -- [ ] Answer [Quiz 1](https://forms.gle/uzdzytpsYf9sFG946) +- [ ] Answer [Quiz 1](https://forms.gle/uzdzytpsYf9sFG946) (quizzes are also available in PDF form in the folder "other" in case you can't access Google Forms) ##### Block 1.2: - [ ] Read the whole chapter: ["ML with Python"](https://franziskahorn.de/mlbook/_ml_with_python.html) @@ -59,7 +59,7 @@ You can also find the course syllabus on the last page of the [course descriptio - [ ] Work through [Notebook 2: image quantization](/notebooks/2_image_quantization.ipynb) (after the section on clustering) ##### Block 2.2: -- [ ] Read the first sections of the chapter ["Supervised Learning"](https://franziskahorn.de/mlbook/_supervised_learning.html) up to and including ["Model Evaluation"](https://franziskahorn.de/mlbook/_model_evaluation.html) +- [ ] Start reading the first sections of the chapter ["Supervised Learning"](https://franziskahorn.de/mlbook/_supervised_learning.html) up to and including ["Model Evaluation"](https://franziskahorn.de/mlbook/_model_evaluation.html) - [ ] Answer [Quiz 4](https://forms.gle/M2dDevwzicjcHLtc9) --- @@ -71,7 +71,7 @@ You can also find the course syllabus on the last page of the [course descriptio - [ ] **In parallel**, work through the respective sections of [Notebook 3: supervised comparison](/notebooks/3_supervised_comparison.ipynb) ##### Block 3.2: -- [ ] Start with the chapter ["Deep Learning & more"](https://franziskahorn.de/mlbook/_deep_learning_more.html) up to and including the section: ["Information Retrieval (Similarity Search)"](https://franziskahorn.de/mlbook/_information_retrieval_similarity_search.html) and refresh your memory on the sections on [TF-IDF feature vectors](https://franziskahorn.de/mlbook/_feature_extraction.html) and [cosine similarity](https://franziskahorn.de/mlbook/_computing_similarities.html) +- [ ] Start reading the chapter ["Deep Learning & more"](https://franziskahorn.de/mlbook/_deep_learning_more.html) up to and including the section: ["Information Retrieval (Similarity Search)"](https://franziskahorn.de/mlbook/_information_retrieval_similarity_search.html) and refresh your memory about [TF-IDF feature vectors](https://franziskahorn.de/mlbook/_feature_extraction.html) and [cosine similarity](https://franziskahorn.de/mlbook/_computing_similarities.html) - [ ] Work through [Notebook 4: information retrieval](/notebooks/4_information_retrieval.ipynb) ##### Block 3.3: @@ -107,5 +107,6 @@ You can also find the course syllabus on the last page of the [course descriptio - [ ] Answer [Quiz 5](https://forms.gle/uZGj54YQHKwckmL46) - [ ] Read the whole chapter: ["Conclusion"](https://franziskahorn.de/mlbook/_conclusion.html) - [ ] Complete the exercise: ["Your next ML Project"](/other/exercise_your_ml_project.pdf) (in case you need some inspiration for a project idea, have a look at [how ML could be used to fight climate change](https://www.climatechange.ai/summaries)). Feel free to prepare a few slides or use the [Word template](/other/exercise_your_ml_project_template.docx) and aim for a 5 minute presentation. +- [ ] Please fill out the [Feedback Survey](https://forms.gle/Ccv5h5zQxwPjWtCS7) to help me further improve this course! :-) --- diff --git a/notebooks/6_analyze_toydata.ipynb b/notebooks/6_analyze_toydata.ipynb index 6f59117..a19c929 100644 --- a/notebooks/6_analyze_toydata.ipynb +++ b/notebooks/6_analyze_toydata.ipynb @@ -256,7 +256,7 @@ "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\"]\n", "X = df[feature_cols].to_numpy() # convert df into a numpy array\n", "# ... and the vector with labels\n", - "y = df[[\"faulty\"]].to_numpy()\n", + "y = df[\"faulty\"].to_numpy()\n", "# to evaluate our prediction model, we need to split off a test dataset\n", "# later we will use the train_test_split function from sklearn to do this, \n", "# but this just goes to show that there is no magic behind it\n", @@ -340,46 +340,12 @@ "metadata": {}, "outputs": [], "source": [ - "# maybe we just need to give the tree the freedom to make more splits? (i.e., increase its depth)\n", - "clf = tree.DecisionTreeClassifier(max_depth=100, random_state=1)\n", - "clf = clf.fit(X_train, y_train)\n", - "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n", - "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n", - "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n", - "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "**Questions:** \\\n", - "Is this a better model? If anything, is the model over- or underfitting?" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# when the tree is too large (or you're using a random forest),\n", - "# check the feature importances instead of plotting the tree\n", - "dict(zip(feature_cols, clf.feature_importances_))" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# now let's do what we probably should have done in the beginning and \n", + "# let's do what we probably should have done in the beginning and \n", "# remove the outliers (i.e., keep only samples with a height > 0)\n", "df_new = df[df[\"height\"] > 0.]\n", "# create a train/test split again, this time using the sklearn function\n", "X_train, X_test, y_train, y_test = train_test_split(df_new[feature_cols].to_numpy(), \n", - " df_new[[\"faulty\"]].to_numpy(), \n", + " df_new[\"faulty\"].to_numpy(), \n", " test_size=0.33, random_state=15)\n", "# see how imbalanced the label distribution in the training and test sets is\n", "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n", @@ -408,6 +374,18 @@ "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# plot the tree again\n", + "plt.figure(figsize=(15, 10))\n", + "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);\n", + "# notice how in the leaf nodes where the tree predicts \"faulty\", there are only very few data points" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -422,10 +400,32 @@ "metadata": {}, "outputs": [], "source": [ - "# plot the tree\n", - "plt.figure(figsize=(15, 10))\n", - "tree.plot_tree(clf, feature_names=feature_cols, filled=True, class_names=np.array(clf.classes_, dtype=str), proportion=True);\n", - "# notice how in the leaf nodes where the tree predicts \"faulty\", there are only very few data points" + "# maybe we just need to give the tree the freedom to make more splits? (i.e., increase its depth)\n", + "clf = tree.DecisionTreeClassifier(max_depth=100, random_state=1)\n", + "clf = clf.fit(X_train, y_train)\n", + "print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n", + "print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n", + "print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n", + "print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Questions:** \\\n", + "Is this a better model? If anything, is the model over- or underfitting?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# when the tree is too large (or you're using a random forest),\n", + "# check the feature importances instead of plotting the tree\n", + "dict(zip(feature_cols, clf.feature_importances_))" ] }, { @@ -521,7 +521,7 @@ "# let's try with temp as an additional feature\n", "feature_cols = [\"product_1\", \"product_5\", \"product_17\", \"height\", \"width\", \"depth\", \"temp\"]\n", "X = df_new[feature_cols].to_numpy()\n", - "y = df_new[[\"faulty\"]].to_numpy()\n", + "y = df_new[\"faulty\"].to_numpy()\n", "# split into train/test sets again\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=15)\n", "# see how imbalanced the label distribution in the training and test sets is\n", @@ -706,7 +706,6 @@ "outputs": [], "source": [ "# try a different classifier: logistic regression\n", - "y_train, y_test = y_train.flatten(), y_test.flatten() # otherwise the model will complain about the shapes\n", "# first, try the model with the default parameter settings\n", "clf = LogisticRegression()\n", "clf = clf.fit(X_train, y_train)\n", diff --git a/notebooks/7_hard_drive_failures.ipynb b/notebooks/7_hard_drive_failures.ipynb index ca8bb7c..e95632d 100644 --- a/notebooks/7_hard_drive_failures.ipynb +++ b/notebooks/7_hard_drive_failures.ipynb @@ -117,7 +117,7 @@ "metadata": {}, "source": [ "-------------------------------------------------------------------------------------\n", - "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n", + "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/other/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n", "\n", "The previous notebook, \"analyze toydata\", deals with a very similar problem and can serve as a guideline for this exercise. For an example of how to use the t-SNE algorithm, have a look at the first notebook, \"visualize text\" (but please note that since you don't have sparse data here, there is no need to transform the data with a kernel PCA before using t-SNE).\n", "\n", diff --git a/other/ml_course_workbook.docx b/other/ml_course_workbook.docx new file mode 100644 index 0000000..603952c Binary files /dev/null and b/other/ml_course_workbook.docx differ diff --git a/other/ml_course_workbook.pdf b/other/ml_course_workbook.pdf new file mode 100644 index 0000000..19f6b24 Binary files /dev/null and b/other/ml_course_workbook.pdf differ diff --git a/other/quizzes/quiz1_introduction.pdf b/other/quizzes/quiz1_introduction.pdf new file mode 100644 index 0000000..8d36c76 Binary files /dev/null and b/other/quizzes/quiz1_introduction.pdf differ diff --git a/other/quizzes/quiz2_data.pdf b/other/quizzes/quiz2_data.pdf new file mode 100644 index 0000000..600c44e Binary files /dev/null and b/other/quizzes/quiz2_data.pdf differ diff --git a/other/quizzes/quiz3_ml_solutions.pdf b/other/quizzes/quiz3_ml_solutions.pdf new file mode 100644 index 0000000..aa2e52c Binary files /dev/null and b/other/quizzes/quiz3_ml_solutions.pdf differ diff --git a/other/quizzes/quiz4_model_selection.pdf b/other/quizzes/quiz4_model_selection.pdf new file mode 100644 index 0000000..7db9356 Binary files /dev/null and b/other/quizzes/quiz4_model_selection.pdf differ diff --git a/other/quizzes/quiz5_big_recap.pdf b/other/quizzes/quiz5_big_recap.pdf new file mode 100644 index 0000000..9789324 Binary files /dev/null and b/other/quizzes/quiz5_big_recap.pdf differ