new case study

2026-01-14 12:14:38 +01:00 · 2022-09-14 13:36:50 -07:00
parent be2b83489e
commit d3c6ea1ea5
4 changed files with 519 additions and 183 deletions
--- a/notebooks/5_hard_drive_failures.ipynb
+++ b/notebooks/5_hard_drive_failures.ipynb
@@ -1,183 +0,0 @@
-{
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Predicting hard drive failures\n",
-    "\n",
-    "**Scenario:** In a data center with many hard drives, occasionally, one of these drives will fail. To prevent possible data loss, it's a data scientist's (i.e., your) task to predict as soon as possible in advance when a drive might fail.\n",
-    "\n",
-    "The original data can be downloaded from [backblaze](https://www.backblaze.com/b2/hard-drive-test-data.html).\n",
-    "It was already cleaned and restructured for your convenience (see `data/hdf_data`). This preprocessing process included:\n",
-    "\n",
-    "- removing NaNs\n",
-    "- removing SMART variables with zero variance\n",
-    "- keeping only data from the most frequent drive model (to avoid artifacts due to differences in SMART recordings)\n",
-    "- creating a dataframe where each drive is one data point with the information whether it failed or not (= class label)\n",
-    "\n",
-    "The original data consisted of daily SMART statistics measurements for all drives that were installed in the data center at this time (i.e., measurements for each drive until it failed). Your task is to build a binary classification model, which receives the measurements from all drives every day and should predict which of these drives are likely to fail in the next hours or days. To train such a model, you are given a simplified dataset, which includes only a single measurement per drive, either from some random time point during the year if the drive did not fail (class 0), or the SMART statistics on the day the drive failed (csv files ending in `_0`) or from a few days before the drive failed (e.g., `_1` for 1 day before it failed, `_7` for 7 days, etc). This means by using, e.g., the data from `df_2016_0.csv` you can build a model that can predict whether a drive will fail today, while a model trained on the data in `df_2016_7.csv` can predict whether a drive will fail one week from now. (Normally, you would make use of the measurements over time and, e.g., track maximum values up to now or do some other feature engineering to improve the performance, but for the sake of simplicity we only use these individual snapshots here.) \n",
-    "\n",
-    "Use the data from 2016 for training the model and tuning hyperparameters and the data from 2017 for the final evaluation to get a realistic performance estimate of how well the model can handle some slight data drifts etc.\n",
-    "\n",
-    "More about the SMART attributes used as features in this problem can be found on [Wikipedia](https://en.wikipedia.org/wiki/S.M.A.R.T.)."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import numpy as np\n",
-    "import pandas as pd\n",
-    "import matplotlib.pyplot as plt\n",
-    "from sklearn.dummy import DummyClassifier\n",
-    "from sklearn.model_selection import train_test_split\n",
-    "from sklearn.metrics import balanced_accuracy_score\n",
-    "# don't get unneccessary warnings\n",
-    "import warnings\n",
-    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
-    "\n",
-    "# these \"magic commands\" are helpful if you plan to import functions from another script\n",
-    "# where you keep changing things, i.e., if you change a function in the script\n",
-    "# it will automagically be reloaded in the notebook so you work with the latest version\n",
-    "%load_ext autoreload\n",
-    "%autoreload 2"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# load the data with the SMART statistics of the drives.\n",
-    "# with the data ending in _0, we can learn to predict if a drive has failed or is working properly right now;\n",
-    "# try, e.g., df_2016_7.csv to predict failures a week in advance\n",
-    "df = pd.read_csv(\"../data/hdf_data/df_2016_0.csv\")\n",
-    "# have a look at what we've loaded\n",
-    "df.head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# construct training and test data from this dataframe\n",
-    "# -> use the smart statistics as features & \"failure\" as the target\n",
-    "feat_cols = [c for c in df.columns if c.startswith(\"smart\")]\n",
-    "X = df[feat_cols].to_numpy()\n",
-    "y = df[\"failure\"].to_numpy()\n",
-    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)\n",
-    "# see how imbalanced the label distribution in the training and test sets is\n",
-    "print(f\"Fraction of ok items in training set: {1-np.mean(y_train):.3f}\")\n",
-    "print(f\"Fraction of ok items in test set: {1-np.mean(y_test):.3f}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "def eval_clf(clf, X_train, y_train, X_test, y_test):\n",
-    "    \"\"\"\n",
-    "    Function to evaluate a trained classifier: prints accuracy and balanced accuracy scores.\n",
-    "    \n",
-    "    Inputs:\n",
-    "        - clf: the trained classifier\n",
-    "        - X_train, y_train: the training data\n",
-    "        - X_test, y_test: the test data\n",
-    "    \"\"\"\n",
-    "    print(f\"Accuracy on training data: {clf.score(X_train, y_train):.3f}\")\n",
-    "    print(f\"Accuracy on test data: {clf.score(X_test, y_test):.3f}\")\n",
-    "    print(f\"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}\")\n",
-    "    print(f\"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "# train a dummy model\n",
-    "clf = DummyClassifier(strategy=\"most_frequent\")\n",
-    "clf = clf.fit(X_train, y_train)\n",
-    "# evaluate the model\n",
-    "# later, make sure to pass the correct training and test data, e.g., in case you scaled your data etc.\n",
-    "eval_clf(clf, X_train, y_train, X_test, y_test)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "-------------------------------------------------------------------------------------\n",
-    "You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://franziskahorn.de/mlws_resources/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
-    "\n",
-    "The previous notebook, \"analyze toydata\", deals with a very similar problem and can serve as a guideline for this exercise. For an example of how to use the t-SNE algorithm, have a look at the first notebook, \"visualize text\" (but please note that since you don't have sparse data here, there is no need to transform the data with a kernel PCA before using t-SNE).\n",
-    "\n",
-    "### (Suggested) Steps\n",
-    "\n",
-    "#### a) Get a better understanding of the problem\n",
-    "- Create a t-SNE plot of the data (from the features; color the dots in the scatter plot with the target variable): Do you think a classification model will do well on this data?\n",
-    "- Look at the variables in more detail: Are they normally/uniformly distributed?\n",
-    "- Try different kinds of models in place of the `DummyClassifier` (e.g., decision tree, linear model, SVM) and play around with the hyperparameters a little bit to get a better feeling for the problem.\n",
-    "- Would outlier detection make sense here? Why (not)?\n",
-    "\n",
-    "#### b) Improve the prediction performance\n",
-    "- Try different normalizations of the data (e.g., using the `StandardScaler`): How do the t-SNE plot and performance of the different models change? Why does a decision tree not improve? Can you apply some other transformations to make the features more normally distributed?\n",
-    "- Are any variables highly correlated? How does the performance change when you remove some features? Do you have any other feature engineering ideas? Again observe how your previous results change as you modify the input features!\n",
-    "- Systematically find optimal hyperparameters for your models using a `GridSearchCV` and decide what you want to use as your final model.\n",
-    "\n",
-    "#### c) Final evaluation & model interpretation\n",
-    "- Try to better understand what your model is doing: Which variables are the most predictive of failures?\n",
-    "- Predict failures multiple days in advance by training and evaluating your models on the other csv files from 2016 (e.g., `df_2016_7.csv` for 7 days before the drive fails). How many days in advance is a reliable prediction possible (e.g., plot \"days before failure\" vs \"balanced accuracy\")?\n",
-    "- Evaluate your final model (trained on a complete dataframe from 2016) on the respective data from 2017.\n",
-    "\n",
-    "#### d) Presentation of results\n",
-    "Clean up your code & think about which results you want to present + the story they tell:\n",
-    "- What is the best model that you found & its performance?\n",
-    "- Which preprocessing steps had the most impact on the performance?\n",
-    "- What worked and what didn't for the different models?\n",
-    "- Which of the SMART statistics indicate that a drive will fail?\n",
-    "- How many days in advance can you predict a hard drive failure?\n",
-    "- How well does your model perform on the new data from 2017?\n",
-    "- What have you learned in this case study? Did any of the results surprise you?\n",
-    "-------------------------------------------------------------------------------------"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  }
- ],
- "metadata": {
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.10.2"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 2
-}
--- a/notebooks/5_quality_prediction.ipynb
+++ b/notebooks/5_quality_prediction.ipynb
@@ -0,0 +1,516 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Case Study: Concrete Compressive Strength Prediction\n",
+    "\n",
+    "In this case study, you'll solve a machine learning problem from start to finish.\n",
+    "\n",
+    "The previous notebook, \"analyze toydata\", deals with a similar problem and can serve as a guideline for this exercise. You may also want to have a look at the [cheat sheet](https://franziskahorn.de/mlws_resources/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
+    "\n",
+    "Feel free to get creative! "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# first load some libraries that are needed later\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "import matplotlib.pyplot as plt\n",
+    "from scipy.stats import pearsonr\n",
+    "from scipy.optimize import minimize\n",
+    "# machine learning stuff\n",
+    "from sklearn.metrics import mean_absolute_error\n",
+    "from sklearn.dummy import DummyRegressor\n",
+    "from sklearn.preprocessing import MaxAbsScaler\n",
+    "from sklearn.pipeline import make_pipeline\n",
+    "from sklearn.linear_model import ElasticNetCV\n",
+    "from sklearn.svm import SVR\n",
+    "from sklearn.ensemble import RandomForestRegressor, StackingRegressor\n",
+    "from sklearn.model_selection import GridSearchCV, train_test_split\n",
+    "from sklearn import tree\n",
+    "from sklearn.inspection import plot_partial_dependence, permutation_importance\n",
+    "from sklearn.manifold import TSNE\n",
+    "# interactive plotting (parallel coordinate plot)\n",
+    "import plotly.express as px\n",
+    "# suppress unnecessary warnings\n",
+    "import warnings\n",
+    "warnings.simplefilter(action='ignore', category=FutureWarning)\n",
+    "\n",
+    "# these \"magic commands\" are helpful if you plan to import functions from another script\n",
+    "# where you keep changing things, i.e., if you change a function in the script\n",
+    "# it will automagically be reloaded in the notebook so you work with the latest version\n",
+    "%load_ext autoreload\n",
+    "%autoreload 2"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 0: Load the data\n",
+    "\n",
+    "The original data can be obtained from the [UCI ML data repository](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength). (If you're having trouble loading the xls file, you can also open it in Excel and then save it as a CSV file and use `pd.read_csv` instead.)\n",
+    "\n",
+    "One data point in the dataset corresponds to one concrete mixture with the following variables:\n",
+    "- Mixture components (unit: kg in a m3 mixture):\n",
+    "    - Cement\n",
+    "    - Blast Furnace Slag\n",
+    "    - Fly Ash\n",
+    "    - Water\n",
+    "    - Superplasticizer\n",
+    "    - Coarse Aggregate\n",
+    "    - Fine Aggregate\n",
+    "- Age (number of days the concrete mixture hardened before the compressive strength was measured)\n",
+    "- Concrete compressive strength (unit: MPa; the main quality indicator for concrete (i.e., how strong it is))\n",
+    "\n",
+    "Our goal is to **predict the compressive strength** of a concrete mixture based on the amounts of the different components. We will later filter the data for only the 28-day measurements, since this is the most important measurement to determine whether the concrete is within the norm constraints and okay to be used for construction."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# load the file into a dataframe with pandas\n",
+    "df = pd.read_excel(\"../data/Concrete_Data.xls\")\n",
+    "# look at the raw data (first 5 rows)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the column names are not super convenient for an analysis, so let's rename them\n",
+    "RENAME_DICT = {\n",
+    "    'Cement (component 1)(kg in a m^3 mixture)': \"cement\",\n",
+    "    'Blast Furnace Slag (component 2)(kg in a m^3 mixture)': \"slag\",\n",
+    "    'Fly Ash (component 3)(kg in a m^3 mixture)': \"fly_ash\",\n",
+    "    'Water  (component 4)(kg in a m^3 mixture)': \"water\",\n",
+    "    'Superplasticizer (component 5)(kg in a m^3 mixture)': \"plasticizer\",\n",
+    "    'Coarse Aggregate  (component 6)(kg in a m^3 mixture)': \"coarse_aggregate\",\n",
+    "    'Fine Aggregate (component 7)(kg in a m^3 mixture)': \"fine_aggregate\", \n",
+    "    'Age (day)': \"age\",\n",
+    "    'Concrete compressive strength(MPa, megapascals) ': \"strength\",\n",
+    "}\n",
+    "df = df.rename(columns=RENAME_DICT)\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 1: Exploratory Analysis\n",
+    "\n",
+    "Get a better understanding of the data and the problem:\n",
+    "- How are the individual variables distributed?\n",
+    "- Are any variables correlated? \n",
+    "- Do you observe any patterns between the input and target variables? Do these make sense or is anything surprising?\n",
+    "- Anything else you should take into account when preprocessing the data later for the supervised learning part?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# create some plots to better understand the data\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 2: Predict the 28-day Compressive Strength\n",
+    "\n",
+    "Now that you've become more familiar with the dataset, it's time to tackle the real task, i.e., predict the 28-day compressive strength of a concrete mixture.\n",
+    "\n",
+    "An evaluation pipeline is already set up below using a \"stupid baseline\" (= predicting the mean). Your task is to improve upon the performance by trying... \n",
+    "- different models\n",
+    "- different preprocessing steps (e.g., transformations or feature engineering)\n",
+    "- hyperparameter tuning\n",
+    "\n",
+    "Get creative :-)\n",
+    "\n",
+    "**Tip:** Have a look at the [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) function from sklearn to combine multiple steps (e.g., preprocessing and prediction model) into a single estimator object that can be applied to the original data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# the most important measurement to check whether the concrete adheres to\n",
+    "# the norm is after 28 days (-> almost half of our samples)\n",
+    "df[\"age\"].value_counts()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# for the prediction task we use only the 28-day samples \n",
+    "df = df.loc[df[\"age\"] == 28].reset_index()\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# define features (all input variables except age) and target\n",
+    "features = ['cement', 'slag', 'fly_ash', 'water', 'plasticizer', 'coarse_aggregate', 'fine_aggregate']\n",
+    "target = 'strength'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# you might have noticed that there are a few sampes that have the exact \n",
+    "# same feature values, but different strengths...\n",
+    "# we just assume that these were repeat measurements and take the average\n",
+    "df = df.groupby(features).mean().reset_index()\n",
+    "df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# extract training and validation/test sets from this dataframe\n",
+    "# (normally, we would also set aside a final test set,\n",
+    "# but we only have very few samples here)\n",
+    "X = df[features]\n",
+    "y = df[target]\n",
+    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def eval_model(model, X_train, y_train, X_test, y_test):\n",
+    "    \"\"\"\n",
+    "    Function to evaluate a trained regressor: prints R^2 and mean absolute error.\n",
+    "    \n",
+    "    Inputs:\n",
+    "        - model: the trained model\n",
+    "        - X_train, y_train: the training data\n",
+    "        - X_test, y_test: the test data\n",
+    "    \"\"\"\n",
+    "    print(f\"R^2 on training data: {model.score(X_train, y_train):.3f}\")\n",
+    "    print(f\"R^2 on test data: {model.score(X_test, y_test):.3f}\")\n",
+    "    print(f\"MAE on training data: {mean_absolute_error(y_train, model.predict(X_train)):.3f}\")\n",
+    "    print(f\"MAE on test data: {mean_absolute_error(y_test, model.predict(X_test)):.3f}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# train a dummy model\n",
+    "model = DummyRegressor()\n",
+    "model = model.fit(X_train, y_train)\n",
+    "# evaluate the model\n",
+    "eval_model(model, X_train, y_train, X_test, y_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# now it's up to you: try an actual model and get better predictions!\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 3: Interpret the model\n",
+    "\n",
+    "After you've found a model that achieves a decent performance, try to understand what it's doing.\n",
+    "- Calculate the permutation feature importance to see which features are most influential overall\n",
+    "- For the most important features, look at the partial dependence plot to see _how_ these features influence the outcome\n",
+    "\n",
+    "Do these results make sense in terms of the actual physical process?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# permutation feature importance\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# partial dependence plots\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 4: Optimization & What-If Analysis\n",
+    "\n",
+    "When mixing concrete, water can usually be dosed rather flexibly.\n",
+    "\n",
+    "Let's say our goal is to achieve a compressive strength after 28 days of 42.5 MPa. \n",
+    "\n",
+    "Use your prediction model on the test set to see whether the concrete is getting too strong or too weak and then change the water levels accordingly to make sure the production is on target.\n",
+    "\n",
+    "You can run the code below as is, just make sure that `model` is an estimator object that also includes all necessary preprocessing steps (e.g., by using the `make_pipeline` function mentioned above).\n",
+    "\n",
+    "Does your model help to get the production more on target?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def optimize_water(model, x, value_min, value_max, target_strength=42.5):\n",
+    "    \"\"\"\n",
+    "    Optimize the water content for a concrete mixture.\n",
+    "    \n",
+    "    Inputs:\n",
+    "        - model: the trained model\n",
+    "        - x: pandas dataframe row with one data point\n",
+    "        - value_min: minimum bound for water content\n",
+    "        - value_max: maximum bound for water content\n",
+    "        - target_strength: what we would like the output to be (default: 42.5)\n",
+    "    Returns:\n",
+    "        - water_org: original water content\n",
+    "        - water_new: optimized water content\n",
+    "        - pred_org: original strength prediction of the model\n",
+    "        - pred_new: strength prediction with optimized water content\n",
+    "    \"\"\"\n",
+    "    # original situation\n",
+    "    water_org = x[\"water\"].values[0]\n",
+    "    pred_org = model.predict(x)[0]\n",
+    "    print(f\"original prediction with water content {water_org:.1f}: {pred_org:.2f} MPa\")\n",
+    "    \n",
+    "    def _loss_fun(water_value):\n",
+    "        \"\"\"\n",
+    "        Nested function (i.e., has access to all variables from the enclosing function)\n",
+    "        to compute the squared error between the models strength prediction with the given \n",
+    "        water value and our target strength value.\n",
+    "        \n",
+    "        Inputs:\n",
+    "            - water_value: np.array with a single value, the proposed water content\n",
+    "        Returns:\n",
+    "            - loss: the squared error between the predicted and target strength\n",
+    "        \"\"\"\n",
+    "        # insert the new value into our original data point\n",
+    "        new_x = x.copy()\n",
+    "        new_x[\"water\"] = water_value[0]\n",
+    "        # predict strength with new water content\n",
+    "        pred_strength = model.predict(new_x)\n",
+    "        # optimization loss = squared difference to target value\n",
+    "        loss = (target_strength - pred_strength)**2\n",
+    "        return loss\n",
+    "    \n",
+    "    # use scipy's minimize function to find a value for 'water'\n",
+    "    # where the model predicts something close to our target value.\n",
+    "    # the start value for the optimization is the original water content.\n",
+    "    # to get realistic values, we additionaly specify bounds\n",
+    "    # based on the actual min/max values for the water content\n",
+    "    res = minimize(_loss_fun, np.array([water_org]), bounds=[(value_min, value_max)])\n",
+    "    # the optimized water content is stored in res.x (again a np.array)\n",
+    "    water_new = res.x[0]\n",
+    "    # check the final strength prediction\n",
+    "    new_x = x.copy()\n",
+    "    new_x[\"water\"] = water_new\n",
+    "    pred_new = model.predict(new_x)[0]\n",
+    "    print(f\"new prediction with water content {water_new:.1f}: {pred_new:.2f} MPa\")\n",
+    "    return water_org, water_new, pred_org, pred_new\n",
+    "    \n",
+    "\n",
+    "def optimize_water_all(model, X, target_strength=42.5):\n",
+    "    \"\"\"\n",
+    "    Compute the optimized the water content for all data points.\n",
+    "    \n",
+    "    Inputs:\n",
+    "        - model: the trained model\n",
+    "        - X: pandas dataframe with input features for all data points\n",
+    "        - target_strength: what we would like the output to be (default: 42.5)\n",
+    "    Returns:\n",
+    "        - water_org_s: original water content for all data points\n",
+    "        - water_new_s: optimized water content for all data points\n",
+    "        - pred_org_s: original strength prediction of the model for all data points\n",
+    "        - pred_new_s: strength prediction with optimized water content for all data points\n",
+    "    \"\"\"\n",
+    "    # bounds for optimization: known, realistic values for water content\n",
+    "    value_min, value_max = X[\"water\"].min(), X[\"water\"].max()\n",
+    "    # run the optimization for all data points\n",
+    "    water_org_s, water_new_s, pred_org_s, pred_new_s = [], [], [], []\n",
+    "    for i in range(len(X)):\n",
+    "        water_org, water_new, pred_org, pred_new = optimize_water(model, X.iloc[[i]], value_min, value_max, \n",
+    "                                                                  target_strength)\n",
+    "        water_org_s.append(water_org)\n",
+    "        water_new_s.append(water_new)\n",
+    "        pred_org_s.append(pred_org)\n",
+    "        pred_new_s.append(pred_new)\n",
+    "    # convert lists to numpy arrays for easier plotting\n",
+    "    water_org_s, water_new_s = np.array(water_org_s), np.array(water_new_s)\n",
+    "    pred_org_s, pred_new_s = np.array(pred_org_s), np.array(pred_new_s)\n",
+    "    return water_org_s, water_new_s, pred_org_s, pred_new_s\n",
+    "\n",
+    "\n",
+    "def plot_optimization(water_org_s, water_new_s, pred_org_s, pred_new_s, y, target_strength=42.5):\n",
+    "    \"\"\"\n",
+    "    Create two plots based on the results from optimize_water_all:\n",
+    "        1. What-if results after optimization: by changing the water content, the strength predictions\n",
+    "           should be closer to the target strength, even after correcting for prediction errors.\n",
+    "           The legend includes the MATD = mean absolute target deviation, i.e., how far away the\n",
+    "           respective points are from the target strength on average.\n",
+    "        2. Original and optimized water content and resulting strength increase/decrease.\n",
+    "    \n",
+    "    Inputs:\n",
+    "        - water_org_s: original water content for all data points\n",
+    "        - water_new_s: optimized water content for all data points\n",
+    "        - pred_org_s: original strength prediction of the model for all data points\n",
+    "        - pred_new_s: strength prediction with optimized water content for all data points\n",
+    "        - y: pandas dataframe with true compressive strength values for all data points\n",
+    "        - target_strength: what we would like the output to be (default: 42.5)\n",
+    "    \"\"\"\n",
+    "    # convert y to a numpy array to make sure the indices match up with the other arrays\n",
+    "    target_org_s = y.to_numpy()\n",
+    "    \n",
+    "    # plot the optimization results\n",
+    "    plt.figure(figsize=(10, 5))\n",
+    "    # dashed line to indicated the target value we wanted\n",
+    "    plt.hlines(target_strength, 0, len(water_org_s), \"k\", \"dashed\", linewidth=1)\n",
+    "    # original target values as light blue dots\n",
+    "    plt.plot(target_org_s, \"o\", c=\"#53DDFE\", alpha=0.8, \n",
+    "             label=f\"original y (MATD: {np.abs(target_org_s - target_strength).mean():.1f})\")\n",
+    "    # predicted target values with original water content as blue x\n",
+    "    plt.plot(pred_org_s, \"x\", c=\"#007693\", \n",
+    "             label=f\"predicted y (MATD: {np.abs(pred_org_s - target_strength).mean():.1f})\")\n",
+    "    # predicted target values with optimized water content as orange x\n",
+    "    plt.plot(pred_new_s, \"x\", c=\"#A84801\", \n",
+    "             label=f\"optimized predicted y (MATD: {np.abs(pred_new_s - target_strength).mean():.1f})\")\n",
+    "    # since our original predictions are not perfect, we shift our optimized predictions\n",
+    "    # by the error we made on the original predictions -> plot as orange dots\n",
+    "    pred_new_corrected_s = pred_new_s + (target_org_s - pred_org_s)\n",
+    "    plt.plot(pred_new_corrected_s, \"o\", alpha=0.8, c=\"#FE9C53\", \n",
+    "             label=f\"realistic optimized y (MATD: {np.abs(pred_new_corrected_s - target_strength).mean():.1f})\")\n",
+    "    plt.xticks([], [])\n",
+    "    plt.xlabel(\"samples\")\n",
+    "    plt.ylabel(\"compressive strength [MPa]\")\n",
+    "    plt.title(\"What-If Analysis\")\n",
+    "    plt.legend(loc=2, bbox_to_anchor=(1.02, 1), numpoints=1)\n",
+    "    \n",
+    "    # plot original and optimized water content\n",
+    "    plt.figure()\n",
+    "    # diagonal line -> original and optimized water content are the same\n",
+    "    plt.plot([water_new_s.min(), water_new_s.max()], [water_new_s.min(), water_new_s.max()], \"k\", alpha=0.5)\n",
+    "    # points above the line: more water than before\n",
+    "    # points below the line: less water than before\n",
+    "    # color of dot shows whether the optimization resulted in a reduction or increase in strength\n",
+    "    plt.scatter(water_org_s, water_new_s, c=pred_new_s-pred_org_s)\n",
+    "    plt.colorbar()\n",
+    "    plt.xlabel(\"original water content\")\n",
+    "    plt.ylabel(\"optimized water content\")\n",
+    "    plt.title(\"changes in water & strength\");\n",
+    "    "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# run the optimization on the test set\n",
+    "water_org_s, water_new_s, pred_org_s, pred_new_s = optimize_water_all(model, X_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# plot the results\n",
+    "plot_optimization(water_org_s, water_new_s, pred_org_s, pred_new_s, y_test)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Step 5: Presentation of results\n",
+    "Clean up your code & think about which results you want to present and the story they tell:\n",
+    "- What have you learned about concrete production and how is this reflected in the data?\n",
+    "- What is the best model that you found & its performance?\n",
+    "- Which preprocessing steps had the most impact on the performance (for different models)?\n",
+    "- Which features were the most influential and how did they impact the model's prediction?\n",
+    "- What have you learned in this case study? Did any of the results surprise you?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
--- a/requirements.txt
+++ b/requirements.txt
@@ -5,6 +5,7 @@ scikit-learn>=1.1.2
 matplotlib>=3.5.1
 pillow>=9.1.0
 plotly>=5.7.0
+xlrd>=2.0.1
 torch>=1.12.1
 torchvision>=0.13.1
 skorch>=0.11.0
--- a/test_installation.ipynb
+++ b/test_installation.ipynb
@@ -29,6 +29,8 @@
    "print(\"pillow\", PIL.__version__)         # >= 9.1.0\n",
    "import plotly\n",
    "print(\"plotly\", plotly.__version__)      # >= 5.7.0\n",
+    "import xlrd\n",
+    "print(\"xlrd\", xlrd.__version__)      # >= 2.0.1\n",
    "print(\"Congratulations! Your installation of the basic libraries was successful!\")\n",
    "# the following libraries are needed for the neural network example \n",
    "# (if you're working with the recommended pytorch, not keras/tensorflow)\n",