ml_exercises/notebooks/5a_quality_prediction.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Case Study: Concrete Compressive Strength Prediction\n",
    "\n",
    "In this case study, you'll solve a machine learning problem from start to finish.\n",
    "\n",
    "The previous notebook, \"analyze toydata\", deals with a similar problem and can serve as a guideline for this exercise. You may also want to have a look at the [cheat sheet](https://franziskahorn.de/mlws_resources/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project. \n",
    "\n",
    "Feel free to get creative! \n",
    "\n",
    "### The Task\n",
    "\n",
    "You are a data scientist at a startup that wants to build an app to help construction workers mix concrete of a certain target compressive strength (= the main quality parameter of concrete). The construction worker can enter the amounts of the different ingredients:\n",
    "```\n",
    "Cement:            _____ kg/m^3\n",
    "Slag:              _____ kg/m^3\n",
    "Fly Ash:           _____ kg/m^3\n",
    "Fine Aggregate:    _____ kg/m^3\n",
    "Coarse Aggregate:  _____ kg/m^3\n",
    "Plasticizer:       _____ kg/m^3\n",
    "Water:             _____ kg/m^3\n",
    "```\n",
    "and the app then tells them what the estimated compressive strength of this mixture will be:\n",
    "```\n",
    "Predicted Compressive Strength after 28 days: XX.X MPa\n",
    "```\n",
    "**Your job is it to build an ML model that predicts this compressive strength.**\n",
    "\n",
    "Your colleagues have already prepared the backend for the app (see notebook `b`), which will load your model to make the predictions. Additionally, the backend also contains code to optimize the water content in the concrete recipe so that the mixture will get closer to the specified target strength:\n",
    "```\n",
    "Target strength:   _____ MPa\n",
    "\n",
    "Recommended water amount: XX.X kg/m^3\n",
    "New predicted strength with optimized water amount: XX.X MPa\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import requests\n",
    "import joblib\n",
    "from sklearn.dummy import DummyRegressor\n",
    "from sklearn.metrics import mean_absolute_error\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "# these \"magic commands\" are helpful if you plan to import functions from another script\n",
    "# where you keep changing things, i.e., if you change a function in the script\n",
    "# it will automagically be reloaded in the notebook so you work with the latest version\n",
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 0: Load the data\n",
    "\n",
    "The original data can be obtained from the [UCI ML data repository](https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength). (If you're having trouble loading the xls file, you can also open it in Excel and then save it as a CSV file and use `pd.read_csv` instead.)\n",
    "\n",
    "One data point in the dataset corresponds to one concrete mixture with the following variables:\n",
    "- Mixture components (unit: kg in a m^3 mixture):\n",
    "    - Cement\n",
    "    - Blast Furnace Slag\n",
    "    - Fly Ash\n",
    "    - Water\n",
    "    - Superplasticizer\n",
    "    - Coarse Aggregate\n",
    "    - Fine Aggregate\n",
    "- Age (number of days the concrete mixture hardened before the compressive strength was measured)\n",
    "- Concrete compressive strength (unit: MPa; the main quality indicator for concrete (i.e., how strong it is))\n",
    "\n",
    "Our goal is to **predict the compressive strength** of a concrete mixture based on the amounts of the different components. We will later filter the data for only the 28-day measurements, since this is the most important measurement to determine whether the concrete is within the norm constraints and okay to be used for construction."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the file into a dataframe with pandas\n",
    "df = pd.read_excel(\"../data/Concrete_Data.xls\")\n",
    "# look at the raw data (first 5 rows)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the column names are not super convenient for an analysis, so let's rename them\n",
    "RENAME_DICT = {\n",
    "    'Cement (component 1)(kg in a m^3 mixture)': \"cement\",\n",
    "    'Blast Furnace Slag (component 2)(kg in a m^3 mixture)': \"slag\",\n",
    "    'Fly Ash (component 3)(kg in a m^3 mixture)': \"fly_ash\",\n",
    "    'Water  (component 4)(kg in a m^3 mixture)': \"water\",\n",
    "    'Superplasticizer (component 5)(kg in a m^3 mixture)': \"plasticizer\",\n",
    "    'Coarse Aggregate  (component 6)(kg in a m^3 mixture)': \"coarse_aggregate\",\n",
    "    'Fine Aggregate (component 7)(kg in a m^3 mixture)': \"fine_aggregate\", \n",
    "    'Age (day)': \"age\",\n",
    "    'Concrete compressive strength(MPa, megapascals) ': \"strength\",\n",
    "}\n",
    "df = df.rename(columns=RENAME_DICT)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 1: Exploratory Analysis\n",
    "\n",
    "Get a better understanding of the data and the problem:\n",
    "- How are the individual variables distributed?\n",
    "- Are any variables correlated? \n",
    "- Do you observe any patterns between the input and target variables? Do these make sense or is anything surprising?\n",
    "- Anything else you should take into account when preprocessing the data later for the supervised learning task?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: create some plots to better understand the data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 2: Predict the 28-day Compressive Strength\n",
    "\n",
    "Now that you've become more familiar with the dataset, it's time to tackle the real task, i.e., predict the 28-day compressive strength of a concrete mixture.\n",
    "\n",
    "An evaluation pipeline is already set up below using a \"stupid baseline\" (= predicting the mean). Your task is to improve upon the performance by trying, for exampe... \n",
    "- different [models](https://scikit-learn.org/stable/supervised_learning.html)\n",
    "- different [preprocessing steps](https://scikit-learn.org/stable/modules/preprocessing.html) (e.g., transformations or feature engineering)\n",
    "- [hyperparameter tuning](https://scikit-learn.org/stable/modules/grid_search.html)\n",
    "- [ensemble models](https://scikit-learn.org/stable/modules/ensemble.html) (i.e., combining different models)\n",
    "\n",
    "Get creative :-)\n",
    "\n",
    "**Tip:** To use your model within the app's backend later (notebook `b`), it's important that your final model incl. all necessary preprocessing steps are combined in a single estimator object. This can be accomplished with sklearn's [`make_pipeline`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html) function. If necessary, you could even write a [custom transformer](https://towardsdatascience.com/pipelines-custom-transformers-in-scikit-learn-the-step-by-step-guide-with-python-code-4a7d9b068156) to perform more fancy feature engineering steps than what is provided by sklearn.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# the most important measurement to check whether the concrete adheres to\n",
    "# the norm is after 28 days (-> almost half of our samples)\n",
    "df[\"age\"].value_counts()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# for the prediction task we use only the 28-day samples \n",
    "df = df.loc[df[\"age\"] == 28].reset_index(drop=True)\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define features (all input variables except age) and target\n",
    "# CAUTION: do not change this - the backend assumes these will be the inputs for your model\n",
    "features = [\"cement\", \"slag\", \"fly_ash\", \"water\", \"plasticizer\", \"coarse_aggregate\", \"fine_aggregate\"]\n",
    "target = \"strength\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you might have noticed that there are a few sampes that have the exact \n",
    "# same feature values, but different strengths...\n",
    "# we just assume that these were repeat measurements and take the average\n",
    "df = df.groupby(features).mean().reset_index()\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# extract training and validation/test sets from this dataframe\n",
    "# (normally, we would also set aside a final test set,\n",
    "# but we only have very few samples here)\n",
    "X = df[features]\n",
    "y = df[target]\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def eval_model(model, X_train, y_train, X_test, y_test):\n",
    "    \"\"\"\n",
    "    Function to evaluate a trained regressor: prints R^2 and mean absolute error.\n",
    "    \n",
    "    Inputs:\n",
    "        - model: the trained model\n",
    "        - X_train, y_train: the training data\n",
    "        - X_test, y_test: the test data\n",
    "    \"\"\"\n",
    "    print(f\"R^2 on training data: {model.score(X_train, y_train):.3f}\")\n",
    "    print(f\"R^2 on test data: {model.score(X_test, y_test):.3f}\")\n",
    "    print(f\"MAE on training data: {mean_absolute_error(y_train, model.predict(X_train)):.3f}\")\n",
    "    print(f\"MAE on test data: {mean_absolute_error(y_test, model.predict(X_test)):.3f}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# train a dummy model\n",
    "model = DummyRegressor()\n",
    "model = model.fit(X_train, y_train)\n",
    "# evaluate the model\n",
    "eval_model(model, X_train, y_train, X_test, y_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: now it's up to you: try an actual model and get better predictions!\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save your final model so it can be used by the backend\n",
    "# CAUTION: your model must be a single estimator object that also includes all necessary preprocessing steps \n",
    "# (e.g., by using the `make_pipeline` function mentioned above) which works on the originally defined features\n",
    "joblib.dump(model, \"strength_model.pkl\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 3: Interpret the model\n",
    "\n",
    "After you've found a model that achieves a decent performance, try to understand what it's doing.\n",
    "- Calculate the permutation feature importance to see which features are most influential overall\n",
    "- For the most important features, look at the partial dependence plot to see _how_ these features influence the outcome\n",
    "\n",
    "Do these results make sense in terms of the actual physical process?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: permutation feature importance\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# TODO: partial dependence plots\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 4: Use the backend\n",
    "\n",
    "Head over to notebook `b` and run it from top to bottom and leave it open. It runs the server with our backend using your trained and saved model to make predictions.\n",
    "\n",
    "In the code below we query the backend with a single data point to see what it returns."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# url where the backend is running\n",
    "port = 8000\n",
    "backend_url = f\"http://localhost:{port}/predict\""
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get one row from our feature matrix and convert it to a dictionary\n",
    "x = X_test.iloc[0].to_dict()\n",
    "# add a target strength (optional)\n",
    "x[\"target_strength\"] = 42.5\n",
    "# send this as a json to the backend via a POST request\n",
    "response = requests.post(backend_url, json=x)\n",
    "# the result will also be dictionary with\n",
    "# - the original water content\n",
    "# - the predicted strength with the original water content\n",
    "# - the optimized water content\n",
    "# - the new predicted strength when using the optimized water content\n",
    "result = response.json()\n",
    "result"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 5: Optimization & What-If Analysis\n",
    "\n",
    "When mixing concrete, water can usually be dosed rather flexibly.\n",
    "\n",
    "Let's say our goal is to achieve a compressive strength after 28 days of 42.5 MPa.\n",
    "\n",
    "Your prediction model is used in the backend to check whether the proposed concrete mixture would get too strong or too weak and then the water amount is adapted accordingly to get closer to the target.\n",
    "\n",
    "Does your model help to get the production more on target?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def plot_what_if(X, y, target_strength=42.5):\n",
    "    \"\"\"\n",
    "    Compute and plot the optimized the water content for all data points.\n",
    "    This code creates two plots:\n",
    "        1. What-if results after optimization: by changing the water content, the strength predictions\n",
    "           should be closer to the target strength, even after correcting for prediction errors.\n",
    "           The legend includes the MATD = mean absolute target deviation, i.e., how far away the\n",
    "           respective points are from the target strength on average.\n",
    "        2. Original and optimized water content and resulting strength increase/decrease.\n",
    "    \n",
    "    Inputs:\n",
    "        - X: pandas dataframe with original input features for all data points\n",
    "        - y: pandas dataframe with true compressive strength values for all data points\n",
    "        - target_strength: what we would like the output to be (default: 42.5)\n",
    "    \"\"\"\n",
    "    # run the optimization for all data points\n",
    "    water_org_s, water_new_s, pred_org_s, pred_new_s = [], [], [], []\n",
    "    for i in range(len(X)):\n",
    "        if not i % 10:\n",
    "            print(f\"backend queried for {i:2} / {len(X)} data points\", end=\"\\r\")\n",
    "        x = X.iloc[i].to_dict()\n",
    "        x[\"target_strength\"] = target_strength\n",
    "        result = requests.post(backend_url, json=x).json()\n",
    "        water_org_s.append(result[\"water_org\"])\n",
    "        water_new_s.append(result[\"water_new\"])\n",
    "        pred_org_s.append(result[\"pred_org\"])\n",
    "        pred_new_s.append(result[\"pred_new\"])\n",
    "    print(f\"backend queried for {len(X)} / {len(X)} data points\")\n",
    "    # convert lists to numpy arrays for easier plotting\n",
    "    water_org_s, water_new_s = np.array(water_org_s), np.array(water_new_s)\n",
    "    pred_org_s, pred_new_s = np.array(pred_org_s), np.array(pred_new_s)\n",
    "    # convert y to a numpy array to make sure the indices match up with the other arrays\n",
    "    target_org_s = y.to_numpy()\n",
    "    \n",
    "    # plot the optimization results\n",
    "    plt.figure(figsize=(10, 5))\n",
    "    # dashed line to indicated the target value we wanted\n",
    "    plt.hlines(target_strength, 0, len(water_org_s), \"k\", \"dashed\", linewidth=1)\n",
    "    # original target values as light blue dots\n",
    "    plt.plot(target_org_s, \"o\", c=\"#53DDFE\", alpha=0.8, \n",
    "             label=f\"original y (MATD: {np.abs(target_org_s - target_strength).mean():.1f})\")\n",
    "    # predicted target values with original water content as blue x\n",
    "    plt.plot(pred_org_s, \"x\", c=\"#007693\", \n",
    "             label=f\"predicted y (MATD: {np.abs(pred_org_s - target_strength).mean():.1f})\")\n",
    "    # predicted target values with optimized water content as orange x\n",
    "    plt.plot(pred_new_s, \"x\", c=\"#A84801\", \n",
    "             label=f\"optimized predicted y (MATD: {np.abs(pred_new_s - target_strength).mean():.1f})\")\n",
    "    # since our original predictions are not perfect, we shift our optimized predictions\n",
    "    # by the error we made on the original predictions -> plot as orange dots\n",
    "    pred_new_corrected_s = pred_new_s + (target_org_s - pred_org_s)\n",
    "    plt.plot(pred_new_corrected_s, \"o\", alpha=0.8, c=\"#FE9C53\", \n",
    "             label=f\"realistic optimized y (MATD: {np.abs(pred_new_corrected_s - target_strength).mean():.1f})\")\n",
    "    plt.xticks([], [])\n",
    "    plt.xlabel(\"samples\")\n",
    "    plt.ylabel(\"compressive strength [MPa]\")\n",
    "    plt.title(\"What-If Analysis\")\n",
    "    plt.legend(loc=2, bbox_to_anchor=(1.02, 1), numpoints=1)\n",
    "    \n",
    "    # plot original and optimized water content\n",
    "    plt.figure()\n",
    "    # diagonal line -> original and optimized water content are the same\n",
    "    plt.plot([water_new_s.min(), water_new_s.max()], [water_new_s.min(), water_new_s.max()], \"k\", alpha=0.5)\n",
    "    # points above the line: more water than before\n",
    "    # points below the line: less water than before\n",
    "    # color of dot shows whether the optimization resulted in a reduction or increase in predicted strength\n",
    "    plt.scatter(water_org_s, water_new_s, c=pred_new_s-pred_org_s)\n",
    "    plt.colorbar()\n",
    "    plt.xlabel(\"original water content\")\n",
    "    plt.ylabel(\"optimized water content\")\n",
    "    plt.title(\"changes in water & strength\");\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# run the optimization on the test set and plot the results\n",
    "plot_what_if(X_test, y_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Step 6: Presentation of results\n",
    "Clean up your code & think about which results you want to present and the story they tell:\n",
    "- What have you learned about concrete production and how is this reflected in the data?\n",
    "- What is the best model that you found & its performance?\n",
    "- Which preprocessing steps had the most impact on the performance (for different models)?\n",
    "- Which features were the most influential and how did they impact the model's prediction?\n",
    "- What have you learned in this case study? Did any of the results surprise you?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}