Update to latest library versions

2026-01-14 12:14:36 +01:00 · 2020-11-21 12:22:42 +13:00
parent 1e81324573
commit f225f59780
3 changed files with 224 additions and 176 deletions
--- a/03_classification.ipynb
+++ b/03_classification.ipynb
@@ -291,7 +291,7 @@
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.base import clone\n",
    "\n",
-    "skfolds = StratifiedKFold(n_splits=3, random_state=42)\n",
+    "skfolds = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)\n",
    "\n",
    "for train_index, test_index in skfolds.split(X_train, y_train_5):\n",
    "    clone_clf = clone(sgd_clf)\n",
@@ -306,6 +306,13 @@
    "    print(n_correct / len(y_pred))"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Note**: `shuffle=True` was omitted by mistake in previous releases of the book."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 19,
@@ -330,6 +337,17 @@
    "cross_val_score(never_5_clf, X_train, y_train_5, cv=3, scoring=\"accuracy\")"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning**: this output (and many others in this notebook and other notebooks) may differ slightly from those in the book. Don't worry, that's okay! There are several reasons for this:\n",
+    "* first, Scikit-Learn and other libraries evolve, and algorithms get tweaked a bit, which may change the exact result you get. If you use the latest Scikit-Learn version (and in general, you really should), you probably won't be using the exact same version I used when I wrote the book or this notebook, hence the difference. I try to keep this notebook reasonably up to date, but I can't change the numbers on the pages in your copy of the book.\n",
+    "* second, many training algorithms are stochastic, meaning they rely on randomness. In principle, it's possible to get consistent outputs from a random number generator by setting the seed from which it generates the pseudo-random numbers (which is why you will see `random_state=42` or `np.random.seed(42)` pretty often). However, sometimes this does not suffice due to the other factors listed here.\n",
+    "* third, if the training algorithm runs across multiple threads (as do some algorithms implemented in C) or across multiple processes (e.g., when using the `n_jobs` argument), then the precise order in which operations will run is not always guaranteed, and thus the exact result may vary slightly.\n",
+    "* lastly, other things may prevent perfect reproducibility, such as Python maps and sets whose order is not guaranteed to be stable across sessions, or the order of files in a directory which is also not guaranteed."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 21,
@@ -375,11 +393,12 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 25,
+   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
-    "4096 / (4096 + 1522)"
+    "cm = confusion_matrix(y_train_5, y_train_pred)\n",
+    "cm[1, 1] / (cm[0, 1] + cm[1, 1])"
   ]
  },
  {
@@ -393,11 +412,11 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
-    "4096 / (4096 + 1325)"
+    "cm[1, 1] / (cm[1, 0] + cm[1, 1])"
   ]
  },
  {
@@ -417,7 +436,7 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "4096 / (4096 + (1522 + 1325) / 2)"
+    "cm[1, 1] / (cm[1, 1] + (cm[1, 0] + cm[0, 1]) / 2)"
   ]
  },
  {
@@ -462,7 +481,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -472,7 +491,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 35,
+   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -483,7 +502,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 36,
+   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -514,7 +533,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 37,
+   "execution_count": 33,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -523,7 +542,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 38,
+   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -536,47 +555,20 @@
    "\n",
    "plt.figure(figsize=(8, 6))\n",
    "plot_precision_vs_recall(precisions, recalls)\n",
-    "plt.plot([0.4368, 0.4368], [0., 0.9], \"r:\")\n",
-    "plt.plot([0.0, 0.4368], [0.9, 0.9], \"r:\")\n",
-    "plt.plot([0.4368], [0.9], \"ro\")\n",
+    "plt.plot([recall_90_precision, recall_90_precision], [0., 0.9], \"r:\")\n",
+    "plt.plot([0.0, recall_90_precision], [0.9, 0.9], \"r:\")\n",
+    "plt.plot([recall_90_precision], [0.9], \"ro\")\n",
    "save_fig(\"precision_vs_recall_plot\")\n",
    "plt.show()"
   ]
  },
-  {
-   "cell_type": "code",
-   "execution_count": 39,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 40,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "threshold_90_precision"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": 41,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "y_train_pred_90 = (y_scores >= threshold_90_precision)"
-   ]
-  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
-    "precision_score(y_train_5, y_train_pred_90)"
+    "threshold_90_precision = thresholds[np.argmax(precisions >= 0.90)]"
   ]
  },
  {
@@ -584,6 +576,33 @@
   "execution_count": 43,
   "metadata": {},
   "outputs": [],
+   "source": [
+    "threshold_90_precision"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "y_train_pred_90 = (y_scores >= threshold_90_precision)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "precision_score(y_train_5, y_train_pred_90)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "metadata": {},
+   "outputs": [],
   "source": [
    "recall_score(y_train_5, y_train_pred_90)"
   ]
@@ -597,7 +616,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 44,
+   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -608,7 +627,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 45,
+   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -620,18 +639,19 @@
    "    plt.ylabel('True Positive Rate (Recall)', fontsize=16)    # Not shown\n",
    "    plt.grid(True)                                            # Not shown\n",
    "\n",
-    "plt.figure(figsize=(8, 6))                         # Not shown\n",
+    "plt.figure(figsize=(8, 6))                                    # Not shown\n",
    "plot_roc_curve(fpr, tpr)\n",
-    "plt.plot([4.837e-3, 4.837e-3], [0., 0.4368], \"r:\") # Not shown\n",
-    "plt.plot([0.0, 4.837e-3], [0.4368, 0.4368], \"r:\")  # Not shown\n",
-    "plt.plot([4.837e-3], [0.4368], \"ro\")               # Not shown\n",
-    "save_fig(\"roc_curve_plot\")                         # Not shown\n",
+    "fpr_90 = fpr[np.argmax(tpr >= recall_90_precision)]           # Not shown\n",
+    "plt.plot([fpr_90, fpr_90], [0., recall_90_precision], \"r:\")   # Not shown\n",
+    "plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], \"r:\")  # Not shown\n",
+    "plt.plot([fpr_90], [recall_90_precision], \"ro\")               # Not shown\n",
+    "save_fig(\"roc_curve_plot\")                                    # Not shown\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 46,
+   "execution_count": 53,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -649,7 +669,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 54,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -661,7 +681,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 48,
+   "execution_count": 55,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -671,18 +691,20 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 49,
+   "execution_count": 57,
   "metadata": {},
   "outputs": [],
   "source": [
+    "recall_for_forest = tpr_forest[np.argmax(fpr_forest >= fpr_90)]\n",
+    "\n",
    "plt.figure(figsize=(8, 6))\n",
    "plt.plot(fpr, tpr, \"b:\", linewidth=2, label=\"SGD\")\n",
    "plot_roc_curve(fpr_forest, tpr_forest, \"Random Forest\")\n",
-    "plt.plot([4.837e-3, 4.837e-3], [0., 0.4368], \"r:\")\n",
-    "plt.plot([0.0, 4.837e-3], [0.4368, 0.4368], \"r:\")\n",
-    "plt.plot([4.837e-3], [0.4368], \"ro\")\n",
-    "plt.plot([4.837e-3, 4.837e-3], [0., 0.9487], \"r:\")\n",
-    "plt.plot([4.837e-3], [0.9487], \"ro\")\n",
+    "plt.plot([fpr_90, fpr_90], [0., recall_90_precision], \"r:\")\n",
+    "plt.plot([0.0, fpr_90], [recall_90_precision, recall_90_precision], \"r:\")\n",
+    "plt.plot([fpr_90], [recall_90_precision], \"ro\")\n",
+    "plt.plot([fpr_90, fpr_90], [0., recall_for_forest], \"r:\")\n",
+    "plt.plot([fpr_90], [recall_for_forest], \"ro\")\n",
    "plt.grid(True)\n",
    "plt.legend(loc=\"lower right\", fontsize=16)\n",
    "save_fig(\"roc_curve_comparison_plot\")\n",
@@ -691,7 +713,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 50,
+   "execution_count": 58,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -700,7 +722,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 51,
+   "execution_count": 59,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -710,7 +732,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 52,
+   "execution_count": 60,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -1031,7 +1053,7 @@
   "outputs": [],
   "source": [
    "from sklearn.dummy import DummyClassifier\n",
-    "dmy_clf = DummyClassifier()\n",
+    "dmy_clf = DummyClassifier(strategy=\"prior\")\n",
    "y_probas_dmy = cross_val_predict(dmy_clf, X_train, y_train_5, cv=3, method=\"predict_proba\")\n",
    "y_scores_dmy = y_probas_dmy[:, 1]"
   ]
@@ -2127,14 +2149,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 142,
+   "execution_count": 185,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
-    "X = np.array(ham_emails + spam_emails)\n",
+    "X = np.array(ham_emails + spam_emails, dtype=object)\n",
    "y = np.array([0] * len(ham_emails) + [1] * len(spam_emails))\n",
    "\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)"
@@ -2488,14 +2510,14 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 158,
+   "execution_count": 183,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LogisticRegression\n",
    "from sklearn.model_selection import cross_val_score\n",
    "\n",
-    "log_clf = LogisticRegression(solver=\"lbfgs\", random_state=42)\n",
+    "log_clf = LogisticRegression(solver=\"lbfgs\", max_iter=1000, random_state=42)\n",
    "score = cross_val_score(log_clf, X_train_transformed, y_train, cv=3, verbose=3)\n",
    "score.mean()"
   ]
@@ -2504,14 +2526,14 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "Over 98.7%, not bad for a first try! :) However, remember that we are using the \"easy\" dataset. You can try with the harder datasets, the results won't be so amazing. You would have to try multiple models, select the best ones and fine-tune them using cross-validation, and so on.\n",
+    "Over 98.5%, not bad for a first try! :) However, remember that we are using the \"easy\" dataset. You can try with the harder datasets, the results won't be so amazing. You would have to try multiple models, select the best ones and fine-tune them using cross-validation, and so on.\n",
    "\n",
    "But you get the picture, so let's stop now, and just print out the precision/recall we get on the test set:"
   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 159,
+   "execution_count": 184,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -2519,7 +2541,7 @@
    "\n",
    "X_test_transformed = preprocess_pipeline.transform(X_test)\n",
    "\n",
-    "log_clf = LogisticRegression(solver=\"lbfgs\", random_state=42)\n",
+    "log_clf = LogisticRegression(solver=\"lbfgs\", max_iter=1000, random_state=42)\n",
    "log_clf.fit(X_train_transformed, y_train)\n",
    "\n",
    "y_pred = log_clf.predict(X_test_transformed)\n",
@@ -2552,7 +2574,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.7.6"
+   "version": "3.7.8"
  },
  "nav_menu": {},
  "toc": {