Improve alignment between notebook and book section headers

2026-01-14 12:14:36 +01:00 · 2021-10-03 23:05:49 +13:00
parent 6b821335c0
commit 3f89676892
6 changed files with 560 additions and 151 deletions
--- a/07_ensemble_learning_and_random_forests.ipynb
+++ b/07_ensemble_learning_and_random_forests.ipynb
@@ -89,7 +89,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Voting classifiers"
+    "# Voting Classifiers"
   ]
  },
  {
@@ -103,6 +103,13 @@
    "cumulative_heads_ratio = np.cumsum(coin_tosses, axis=0) / np.arange(1, 10001).reshape(-1, 1)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–3. The law of large numbers:**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 3,
@@ -121,6 +128,13 @@
    "plt.show()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's use the moons dataset:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 4,
@@ -141,6 +155,13 @@
    "**Note**: to be future-proof, we set `solver=\"lbfgs\"`, `n_estimators=100`, and `gamma=\"scale\"` since these will be the default values in upcoming Scikit-Learn versions."
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code examples:**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 5,
@@ -232,7 +253,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Bagging ensembles"
+    "# Bagging and Pasting\n",
+    "## Bagging and Pasting in Scikit-Learn"
   ]
  },
  {
@@ -273,6 +295,13 @@
    "print(accuracy_score(y_test, y_pred_tree))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–5. A single Decision Tree (left) versus a bagging ensemble of 500 trees (right):**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 13,
@@ -302,7 +331,9 @@
  {
   "cell_type": "code",
   "execution_count": 14,
-   "metadata": {},
+   "metadata": {
+    "scrolled": true
+   },
   "outputs": [],
   "source": [
    "fix, axes = plt.subplots(ncols=2, figsize=(10,4), sharey=True)\n",
@@ -321,7 +352,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Random Forests"
+    "## Out-of-Bag evaluation"
   ]
  },
  {
@@ -331,8 +362,10 @@
   "outputs": [],
   "source": [
    "bag_clf = BaggingClassifier(\n",
-    "    DecisionTreeClassifier(max_features=\"sqrt\", max_leaf_nodes=16),\n",
-    "    n_estimators=500, random_state=42)"
+    "    DecisionTreeClassifier(), n_estimators=500,\n",
+    "    bootstrap=True, oob_score=True, random_state=40)\n",
+    "bag_clf.fit(X_train, y_train)\n",
+    "bag_clf.oob_score_"
   ]
  },
  {
@@ -341,13 +374,32 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "bag_clf.fit(X_train, y_train)\n",
-    "y_pred = bag_clf.predict(X_test)"
+    "bag_clf.oob_decision_function_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.metrics import accuracy_score\n",
+    "y_pred = bag_clf.predict(X_test)\n",
+    "accuracy_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Random Forests"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -359,18 +411,53 @@
    "y_pred_rf = rnd_clf.predict(X_test)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A Random Forest is equivalent to a bag of decision trees:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bag_clf = BaggingClassifier(\n",
+    "    DecisionTreeClassifier(max_features=\"sqrt\", max_leaf_nodes=16),\n",
+    "    n_estimators=500, random_state=42)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "bag_clf.fit(X_train, y_train)\n",
+    "y_pred = bag_clf.predict(X_test)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.sum(y_pred == y_pred_rf) / len(y_pred)  # very similar predictions"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Feature Importance"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 19,
+   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -384,16 +471,23 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 20,
+   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "rnd_clf.feature_importances_"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The following figure overlays the decision boundaries of 15 decision trees. As you can see, even though each decision tree is imperfect, the ensemble defines a pretty good decision boundary:"
+   ]
+  },
  {
   "cell_type": "code",
-   "execution_count": 21,
+   "execution_count": 24,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -412,47 +506,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Out-of-Bag evaluation"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 22,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "bag_clf = BaggingClassifier(\n",
-    "    DecisionTreeClassifier(), n_estimators=500,\n",
-    "    bootstrap=True, oob_score=True, random_state=40)\n",
-    "bag_clf.fit(X_train, y_train)\n",
-    "bag_clf.oob_score_"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 23,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "bag_clf.oob_decision_function_"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 24,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "from sklearn.metrics import accuracy_score\n",
-    "y_pred = bag_clf.predict(X_test)\n",
-    "accuracy_score(y_test, y_pred)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## Feature importance"
+    "**Code to generate Figure 7–6. MNIST pixel importance (according to a Random Forest classifier):**"
   ]
  },
  {
@@ -516,7 +570,8 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# AdaBoost"
+    "# Boosting\n",
+    "## AdaBoost"
   ]
  },
  {
@@ -542,6 +597,13 @@
    "plot_decision_boundary(ada_clf, X, y)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–8. Decision boundaries of consecutive predictors:**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 31,
@@ -583,7 +645,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "# Gradient Boosting"
+    "## Gradient Boosting"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let create a simple quadratic dataset:"
   ]
  },
  {
@@ -597,6 +666,13 @@
    "y = 3*X[:, 0]**2 + 0.05 * np.random.randn(100)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's train a decision tree regressor on this dataset:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 33,
@@ -658,6 +734,13 @@
    "y_pred"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–9. In this depiction of Gradient Boosting, the first predictor (top left) is trained normally, then each consecutive predictor (middle left and lower left) is trained on the previous predictor’s residuals; the right column shows the resulting ensemble’s predictions:**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 39,
@@ -714,6 +797,13 @@
    "plt.show()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's try a gradient boosting regressor:"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 41,
@@ -726,6 +816,13 @@
    "gbrt.fit(X, y)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–10. GBRT ensembles with not enough predictors (left) and too many (right):**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 42,
@@ -763,7 +860,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Gradient Boosting with Early stopping"
+    "**Gradient Boosting with Early stopping:**"
   ]
  },
  {
@@ -789,6 +886,13 @@
    "gbrt_best.fit(X_train, y_train)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Code to generate Figure 7–11. Tuning the number of trees using early stopping:**"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 45,
@@ -827,6 +931,13 @@
    "plt.show()"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Early stopping with some patience (interrupts training only after there's no improvement for 5 epochs):"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 47,
@@ -873,7 +984,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "## Using XGBoost"
+    "**Using XGBoost:**"
   ]
  },
  {