Update notebooks 1 to 8 to latest library versions (in particular Scikit-Learn 0.20)

2026-02-02 21:17:49 +01:00 · 2018-12-21 10:18:31 +08:00
parent dc16446c5f
commit b54ee1b608
8 changed files with 694 additions and 586 deletions
--- a/02_end_to_end_machine_learning_project.ipynb
+++ b/02_end_to_end_machine_learning_project.ipynb
@@ -661,15 +661,25 @@
    "sample_incomplete_rows"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning**: Since Scikit-Learn 0.20, the `sklearn.preprocessing.Imputer` class was replaced by the `sklearn.impute.SimpleImputer` class."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
-    "from sklearn.preprocessing import Imputer\n",
+    "try:\n",
+    "    from sklearn.impute import SimpleImputer # Scikit-Learn 0.20+\n",
+    "except ImportError:\n",
+    "    from sklearn.preprocessing import Imputer as SimpleImputer\n",
    "\n",
-    "imputer = Imputer(strategy=\"median\")"
+    "imputer = SimpleImputer(strategy=\"median\")"
   ]
  },
  {
@@ -798,7 +808,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. However, the `OrdinalEncoder` class that is planned to be introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines (introduced later in this notebook). For now, we will import it from `future_encoders.py`, but once it is available you can import it directly from `sklearn.preprocessing`."
+    "**Warning**: earlier versions of the book used the `LabelEncoder` class or Pandas' `Series.factorize()` method to encode string categorical attributes as integers. However, the `OrdinalEncoder` class that was introduced in Scikit-Learn 0.20 (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)) is preferable since it is designed for input features (`X` instead of labels `y`) and it plays well with pipelines (introduced later in this notebook). If you are using an older version of Scikit-Learn (<0.20), then you can import it from `future_encoders.py` instead."
   ]
  },
  {
@@ -807,7 +817,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from future_encoders import OrdinalEncoder"
+    "try:\n",
+    "    from sklearn.preprocessing import OrdinalEncoder\n",
+    "except ImportError:\n",
+    "    from future_encoders import OrdinalEncoder # Scikit-Learn < 0.20"
   ]
  },
  {
@@ -834,7 +847,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Warning**: earlier versions of the book used the `LabelBinarizer` or `CategoricalEncoder` classes to convert each categorical value to a one-hot vector. It is now preferable to use the `OneHotEncoder` class. Right now it can only handle integer categorical inputs, but in Scikit-Learn 0.20 it will also handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)). So for now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.preprocessing` instead:"
+    "**Warning**: earlier versions of the book used the `LabelBinarizer` or `CategoricalEncoder` classes to convert each categorical value to a one-hot vector. It is now preferable to use the `OneHotEncoder` class. Since Scikit-Learn 0.20 it can handle string categorical inputs (see [PR #10521](https://github.com/scikit-learn/scikit-learn/issues/10521)), not just integer categorical inputs. If you are using an older version of Scikit-Learn, you can import the new version from `future_encoders.py`:"
   ]
  },
  {
@@ -843,7 +856,11 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from future_encoders import OneHotEncoder\n",
+    "try:\n",
+    "    from sklearn.preprocessing import OrdinalEncoder # just to raise an ImportError if Scikit-Learn < 0.20\n",
+    "    from sklearn.preprocessing import OneHotEncoder\n",
+    "except ImportError:\n",
+    "    from future_encoders import OneHotEncoder # Scikit-Learn < 0.20\n",
    "\n",
    "cat_encoder = OneHotEncoder()\n",
    "housing_cat_1hot = cat_encoder.fit_transform(housing_cat)\n",
@@ -959,7 +976,7 @@
    "from sklearn.preprocessing import StandardScaler\n",
    "\n",
    "num_pipeline = Pipeline([\n",
-    "        ('imputer', Imputer(strategy=\"median\")),\n",
+    "        ('imputer', SimpleImputer(strategy=\"median\")),\n",
    "        ('attribs_adder', CombinedAttributesAdder()),\n",
    "        ('std_scaler', StandardScaler()),\n",
    "    ])\n",
@@ -980,7 +997,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Warning**: earlier versions of the book applied different transformations to different columns using a solution based on a `DataFrameSelector` transformer and a `FeatureUnion` (see below). It is now preferable to use the `ColumnTransformer` class that will be introduced in Scikit-Learn 0.20. For now we import it from `future_encoders.py`, but when Scikit-Learn 0.20 is released, you can import it from `sklearn.compose` instead:"
+    "**Warning**: earlier versions of the book applied different transformations to different columns using a solution based on a `DataFrameSelector` transformer and a `FeatureUnion` (see below). It is now preferable to use the `ColumnTransformer` class that was introduced in Scikit-Learn 0.20. If you are using an older version of Scikit-Learn, you can import it from `future_encoders.py`:"
   ]
  },
  {
@@ -989,7 +1006,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "from future_encoders import ColumnTransformer"
+    "try:\n",
+    "    from sklearn.compose import ColumnTransformer\n",
+    "except ImportError:\n",
+    "    from future_encoders import ColumnTransformer # Scikit-Learn < 0.20"
   ]
  },
  {
@@ -1070,7 +1090,7 @@
    "\n",
    "old_num_pipeline = Pipeline([\n",
    "        ('selector', OldDataFrameSelector(num_attribs)),\n",
-    "        ('imputer', Imputer(strategy=\"median\")),\n",
+    "        ('imputer', SimpleImputer(strategy=\"median\")),\n",
    "        ('attribs_adder', CombinedAttributesAdder()),\n",
    "        ('std_scaler', StandardScaler()),\n",
    "    ])\n",
@@ -1275,6 +1295,13 @@
    "display_scores(lin_rmse_scores)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**: we specify `n_estimators=10` to avoid a warning about the fact that the default value is going to change to 100 in Scikit-Learn 0.22."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 91,
@@ -1283,7 +1310,7 @@
   "source": [
    "from sklearn.ensemble import RandomForestRegressor\n",
    "\n",
-    "forest_reg = RandomForestRegressor(random_state=42)\n",
+    "forest_reg = RandomForestRegressor(n_estimators=10, random_state=42)\n",
    "forest_reg.fit(housing_prepared, housing_labels)"
   ]
  },
@@ -2114,10 +2141,10 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "param_grid = [\n",
-    "        {'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n",
-    "         'feature_selection__k': list(range(1, len(feature_importances) + 1))}\n",
-    "]\n",
+    "param_grid = [{\n",
+    "    'preparation__num__imputer__strategy': ['mean', 'median', 'most_frequent'],\n",
+    "    'feature_selection__k': list(range(1, len(feature_importances) + 1))\n",
+    "}]\n",
    "\n",
    "grid_search_prep = GridSearchCV(prepare_select_and_predict_pipeline, param_grid, cv=5,\n",
    "                                scoring='neg_mean_squared_error', verbose=2, n_jobs=4)\n",
@@ -2164,7 +2191,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.6.5"
+   "version": "3.6.6"
  },
  "nav_menu": {
   "height": "279px",