further explanations in the notebooks

This commit is contained in:
franzi
2021-09-25 15:28:19 +02:00
parent f6017bc1ed
commit 44e88622a1
11 changed files with 236 additions and 108 deletions

View File

@@ -29,10 +29,7 @@
"import plotly.express as px\n",
"# suppress unnecessary warnings\n",
"import warnings\n",
"warnings.simplefilter(action='ignore', category=FutureWarning)\n",
"\n",
"%load_ext autoreload\n",
"%autoreload 2"
"warnings.simplefilter(action='ignore', category=FutureWarning)"
]
},
{
@@ -60,6 +57,7 @@
"outputs": [],
"source": [
"# check the first element in the dictionary\n",
"print(\"first key:\", list(articles.keys())[0])\n",
"articles[list(articles.keys())[0]]"
]
},
@@ -115,7 +113,7 @@
"# this is a more efficient way of storing data that contains a lot of 0 values\n",
"# by only remembering the indices where the matrix contains non-zero values and what these values are\n",
"# (since each individual paragraph contains only very few unique words, this makes a lot of sense here)\n",
"# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g. t-SNE!)\n",
"# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g., t-SNE!)\n",
"X"
]
},
@@ -137,8 +135,8 @@
"outputs": [],
"source": [
"# reduce dimensionality with linear kPCA\n",
"# since tf-idf vectors are l2 normalized, the linear kernel = cosine similaritiy\n",
"# --> we use 100 components since we feed the reduced data to t-SNE later!\n",
"# since TF-IDF vectors are length (L2) normalized, the linear kernel = cosine similaritiy\n",
"# --> we use 100 components since we feed the reduced data to t-SNE later (-> not sparse)!\n",
"kpca = KernelPCA(n_components=100, kernel='linear')\n",
"X_kpca = kpca.fit_transform(X)\n",
"print(\"Dimensionality of our data:\", X_kpca.shape)"
@@ -152,7 +150,7 @@
"source": [
"# plot 2D PCA visualization\n",
"# the components are ordered by their eigenvalue (largest first), i.e.,\n",
"# by taking the first 2 this is the same as if we had compute PCA with n_components=2\n",
"# by taking the first 2 this is the same as if we had computed PCA with n_components=2\n",
"plt.figure()\n",
"plt.scatter(X_kpca[:, 0], X_kpca[:, 1], s=2) # s: size of the dots\n",
"plt.title(\"PCA embedding of paragraphs\");\n",
@@ -304,7 +302,7 @@
"1. After you've computed your new kPCA embedding (without outliers), use the code below to compute a t-SNE embedding\n",
"2. Then create a regular (matplotlib) and an interactive (plotly) scatter plot of the results again and explore\n",
"\n",
"Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g. identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
"Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g., identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
]
},
{
@@ -314,7 +312,9 @@
"outputs": [],
"source": [
"# use 100D kPCA embedding, since t-SNE can't handle sparse matrices\n",
"tsne = TSNE(metric='cosine', verbose=2, random_state=42)\n",
"# (we use the \"cosine\" metric here since this works well for text,\n",
"# for other data you can leave this argument at its default value)\n",
"tsne = TSNE(metric=\"cosine\", verbose=2, random_state=42)\n",
"X_tsne = tsne.fit_transform(X_kpca)\n",
"print(\"Dimensionality of our data:\", X_tsne.shape)"
]