Merge branch 'master' into fix-chapter3-header

2026-02-02 21:17:49 +01:00 · 2021-03-02 10:33:10 +13:00
parent a86b2f657f e90c974205
commit 1238c1f698
32 changed files with 1590 additions and 1177 deletions
--- a/03_classification.ipynb
+++ b/03_classification.ipynb
@@ -84,6 +84,13 @@
    "# MNIST"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning:** since Scikit-Learn 0.24, `fetch_openml()` returns a Pandas `DataFrame` by default. To avoid this and keep the same code as in the book, we use `as_frame=False`."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 2,
@@ -91,7 +98,7 @@
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_openml\n",
-    "mnist = fetch_openml('mnist_784', version=1)\n",
+    "mnist = fetch_openml('mnist_784', version=1, as_frame=False)\n",
    "mnist.keys()"
   ]
  },
@@ -345,7 +352,7 @@
    "* first, Scikit-Learn and other libraries evolve, and algorithms get tweaked a bit, which may change the exact result you get. If you use the latest Scikit-Learn version (and in general, you really should), you probably won't be using the exact same version I used when I wrote the book or this notebook, hence the difference. I try to keep this notebook reasonably up to date, but I can't change the numbers on the pages in your copy of the book.\n",
    "* second, many training algorithms are stochastic, meaning they rely on randomness. In principle, it's possible to get consistent outputs from a random number generator by setting the seed from which it generates the pseudo-random numbers (which is why you will see `random_state=42` or `np.random.seed(42)` pretty often). However, sometimes this does not suffice due to the other factors listed here.\n",
    "* third, if the training algorithm runs across multiple threads (as do some algorithms implemented in C) or across multiple processes (e.g., when using the `n_jobs` argument), then the precise order in which operations will run is not always guaranteed, and thus the exact result may vary slightly.\n",
-    "* lastly, other things may prevent perfect reproducibility, such as Python maps and sets whose order is not guaranteed to be stable across sessions, or the order of files in a directory which is also not guaranteed."
+    "* lastly, other things may prevent perfect reproducibility, such as Python dicts and sets whose order is not guaranteed to be stable across sessions, or the order of files in a directory which is also not guaranteed."
   ]
  },
  {
@@ -393,7 +400,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -412,7 +419,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 27,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -481,7 +488,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -491,7 +498,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -502,7 +509,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 36,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -533,7 +540,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -542,7 +549,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 35,
+   "execution_count": 38,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -564,7 +571,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 42,
+   "execution_count": 39,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -573,7 +580,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 43,
+   "execution_count": 40,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -582,7 +589,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 44,
+   "execution_count": 41,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -591,7 +598,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 45,
+   "execution_count": 42,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -600,7 +607,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 46,
+   "execution_count": 43,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -616,7 +623,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 47,
+   "execution_count": 44,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -627,7 +634,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 50,
+   "execution_count": 45,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -651,7 +658,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 53,
+   "execution_count": 46,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -669,7 +676,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 54,
+   "execution_count": 47,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -681,7 +688,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 55,
+   "execution_count": 48,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -691,7 +698,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 57,
+   "execution_count": 49,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -713,7 +720,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 58,
+   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -722,7 +729,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 59,
+   "execution_count": 51,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -732,7 +739,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 60,
+   "execution_count": 52,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -836,6 +843,13 @@
    "sgd_clf.decision_function([some_digit])"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning**: the following two cells may take close to 30 minutes to run, or more depending on your hardware."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 62,
@@ -1209,7 +1223,7 @@
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "**Warning**: the next cell may take hours to run, depending on your hardware."
+    "**Warning**: the next cell may take close to 16 hours to run, or more depending on your hardware."
   ]
  },
  {
@@ -1355,6 +1369,13 @@
    "knn_clf.fit(X_train_augmented, y_train_augmented)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Warning**: the following cell may take close to an hour to run, depending on your hardware."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": 99,
@@ -1925,7 +1946,7 @@
   "source": [
    "import os\n",
    "import tarfile\n",
-    "import urllib\n",
+    "import urllib.request\n",
    "\n",
    "DOWNLOAD_ROOT = \"http://spamassassin.apache.org/old/publiccorpus/\"\n",
    "HAM_URL = DOWNLOAD_ROOT + \"20030228_easy_ham.tar.bz2\"\n",
@@ -2156,7 +2177,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 185,
+   "execution_count": 142,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -2517,7 +2538,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 183,
+   "execution_count": 158,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -2540,7 +2561,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 184,
+   "execution_count": 159,
   "metadata": {},
   "outputs": [],
   "source": [
@@ -2581,7 +2602,7 @@
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
-   "version": "3.8.2"
+   "version": "3.7.9"
  },
  "nav_menu": {},
  "toc": {