# Predicting hard drive failures

**Scenario:** In a data center with many hard drives, occasionally, one of these drives will fail. To prevent possible data loss, it's a data scientist's (i.e. your) task to predict as soon as possible in advance when a drive might fail.

The original data can be downloaded from [backblaze](https://www.backblaze.com/b2/hard-drive-test-data.html).
It was already cleaned and restructured for your convenience (see `data/hdf_data`). This preprocessing process included:

- removing NaNs
- keeping only data from the most frequent drive model (to avoid artifacts due to differences in SMART recordings)
- creating a dataframe where each drive is one data point with the information whether it failed or not (= class label)

The original data consisted of daily SMART statistics measurements for all drives at that time installed in the data center (i.e. for each drive until it failed). Your task is to build a binary classification model, which receives the measurements from all drives every day and should predict which of these drives are likely to fail in the next hours or days. To train such a model, you are given a simplified dataset, which includes only a single measurement per drive, either from some random time point during the year if the drive did not fail (class 0), or the SMART statistics on the day the drive failed (csv files ending in `_0`) or from a few days before the drive failed (e.g. `_1` for 1 day before it failed, `_7` for 7 days, etc). This means by using e.g. the data from `df_2016_0.csv` you can build a model that can predict whether a drive will fail today, while a model trained on the data in `df_2016_7.csv` can predict whether a drive will fail one week from now. (Normally, you would make use of the measurements over time and e.g. track maximum values up to now or do some other feature engineering to improve the performance, but for the sake of simplicity we only use these individual snapshots here.) 

Use the data from 2016 for training the model and tuning hyperparameters and the data from 2017 for the final evaluation to get a realistic performance estimate of how well the model can handle some slight data drifts etc.

More about the SMART attributes used as features in this problem can be found on [Wikipedia](https://en.wikipedia.org/wiki/S.M.A.R.T.).

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
# don't get unneccessary warnings
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

%load_ext autoreload
%autoreload 2

In [None]:
# load the data with the SMART statistics of the drives.
# with the data ending in _0, we can learn to predict if a drive has failed or is working properly;
# try e.g. df_2016_7.csv to predict failures a week in advance
df = pd.read_csv("../data/hdf_data/df_2016_0.csv")
# have a look at what we've loaded
df.head()

In [None]:
# construct training and test data from this dataframe - use only the smart statistics as features
feat_cols = [c for c in df.columns if c.startswith("smart")]
X = df[feat_cols].to_numpy()
y = df["failure"].to_numpy()
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=23)
# see how imbalanced the label distribution in the training and test sets is
print(f"Fraction of ok items in training set: {1-np.mean(y_train):.3f}")
print(f"Fraction of ok items in test set: {1-np.mean(y_test):.3f}")

In [None]:
def eval_clf(clf, X_train, y_train, X_test, y_test):
    """
    Function to evaluate a trained classifier: prints accuracy and balanced accuracy scores.
    
    Inputs:
        - clf: the trained classifier
        - X_train, y_train: the training data
        - X_test, y_test: the test data
    """
    print(f"Accuracy on training data: {clf.score(X_train, y_train):.3f}")
    print(f"Accuracy on test data: {clf.score(X_test, y_test):.3f}")
    print(f"Balanced accuracy on training data: {balanced_accuracy_score(y_train, clf.predict(X_train)):.3f}")
    print(f"Balanced accuracy on test data: {balanced_accuracy_score(y_test, clf.predict(X_test)):.3f}")

In [None]:
# train a dummy model
clf = DummyClassifier(strategy="most_frequent")
clf = clf.fit(X_train, y_train)
# evaluate the model
# later, make sure to pass the correct training and test data, e.g., in case you scaled your data etc.
eval_clf(clf, X_train, y_train, X_test, y_test)

-------------------------------------------------------------------------------------
You're already given this rudimentary prediction pipeline, now your job is to improve it. Below are some things you might want to try, but feel free to get creative! Have a look at the [cheat sheet](https://github.com/cod3licious/ml_exercises/blob/main/cheatsheet.pdf) for more ideas and a concise overview of the relevant steps when developing a machine learning solution in any data science project.

### (Suggested) Steps

#### a) Get a better understanding of the problem
- Create a t-SNE plot of the data (from the features; color the dots in the scatter plot with the target variable): Do you think a classification model will do well on this data?
- Look at the variables in more detail: Are they normally/uniformly distributed?
- Try different kinds of models in place of the `DummyClassifier` (e.g. decision tree, linear model, SVM) and play around with the hyperparameters a little bit to get a better feeling for the problem.
- Would outlier detection make sense here? Why (not)?

#### b) Improve the prediction performance
- Try different normalizations of the data (e.g. using the `StandardScaler`): How do the t-SNE plot and performance of the different models change? Why does a decision tree not improve? Can you apply some other transformations to make the features more normally distributed?
- Are any variables highly correlated? How does the performance change when you remove some features? Do you have any other feature engineering ideas? Again observe how your previous results change as you modify the input features!
- Systematically find optimal hyperparameters for your models using a `GridSearchCV` and decide what you want to use as your final model.

#### c) Final evaluation & model interpretation
- Try to better understand what your model is doing: Which variables are the most predictive of failures?
- Predict failures multiple days in advance by training and evaluating your models on the other csv files from 2016 (e.g. `df_2016_7.csv` for 7 days before the drive fails). How many days in advance is a reliable prediction possible (e.g. plot "days before failure" vs "balanced accuracy")?
- Evaluate your final model (trained on a complete dataframe from 2016) on the respective data from 2017.

#### d) Presentation of results
Clean up your code & think about which results you want to present + the story they tell:
- What is the best model that you found & its performance?
- Which preprocessing steps had the most impact on the performance?
- What worked and what didn't for the different models?
- Which of the SMART statistics indicate that a drive will fail?
- How many days in advance can you predict a hard drive failure?
- How well does your model perform on the new data from 2017?
- What have you learned in this case study? Did any of the results surprise you?
-------------------------------------------------------------------------------------