{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring Text Datasets\n",
"\n",
"Large text datasets are often difficult to grasp, because it is hard to see the big picture when reading many individual texts. In this notebook, we create an interactive visualization of paragraphs from Wikipedia articles to show that exploring text datasets this way is much more fun and gives us a better overview of the data.\n",
"The original data is from [here](https://rajpurkar.github.io/SQuAD-explorer/) and was modified for our purposes.\n",
"\n",
"Have a look at the file `articles_short.json` in the data folder. The file extension `.json` stands for _JavaScript Object Notation_ and this is a common format for exchanging data online, e.g., when using third-party API services. Conveniently, this data format can be mapped 1:1 to Python data structures (i.e., nested lists and dictionaries). In our case, the file `articles_short.json` contains the texts of 100 Wikipedia articles, which are organized in a dictionary, where the key of the dict is the title of an article and the corresponding value is a list with the individual paragraphs."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# import some libraries that you'll need later\n",
"import json\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"from sklearn.preprocessing import LabelEncoder\n",
"from sklearn.feature_extraction.text import TfidfVectorizer\n",
"from sklearn.decomposition import KernelPCA\n",
"from sklearn.manifold import TSNE\n",
"import plotly.express as px"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Load the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# load data -> json is parsed as Python data structure\n",
"with open(\"../data/articles_short.json\") as f:\n",
" articles = json.load(f)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check the first element in the dictionary\n",
"print(\"first key:\", list(articles.keys())[0])\n",
"articles[list(articles.keys())[0]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# extract all paragraphs with a list comprehension\n",
"# for every article key \"a\" in the dictionary,\n",
"# you get the corresponding list of paragraphs with articles[a]\n",
"# then take all of these paragraphs from all articles to form a long list\n",
"# (have a look at the Python tutorial if this is new to you)\n",
"paragraphs_corpus = [p for a in articles for p in articles[a]]\n",
"print(f\"Our dataset contains {len(paragraphs_corpus)} paragraphs\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# raw data: Wikipedia article paragraphs, i.e. 1 data point = 1 paragraph\n",
"paragraphs_corpus[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Transform into Tf-Idf Features"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# transform the raw texts into features\n",
"vectorizer = TfidfVectorizer(strip_accents='unicode') # strip_accents to use ascii\n",
"# fit: check vocab of corpus (i.e. dimensionality of bag-of-words vector) & compute IDF weights\n",
"# transform: compute vector for each document (i.e., count TF and multiply by IDF)\n",
"X = vectorizer.fit_transform(paragraphs_corpus)\n",
"print(\"Dimensionality of our data:\", X.shape) # number of paragraphs x number of words"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# notice how this is not a normal numpy array, but a scipy sparse matrix\n",
"# this is a more efficient way of storing data that contains a lot of 0 values\n",
"# by only remembering the indices where the matrix contains non-zero values and what these values are\n",
"# (since each individual paragraph contains only very few unique words, this makes a lot of sense here)\n",
"# (BUT: not all of the algorithms in sklearn can directly work with this type data, e.g., t-SNE!)\n",
"X"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visualize dataset in 2D \n",
"\n",
"### ... with Kernel PCA\n",
"\n",
"As you've seen above, our feature matrix $X$ contains many more features (i.e., number of unique words in our corpus, 34258) than data points (i.e., paragraphs, 4492). If we were to compute regular PCA, this would mean we need to compute the eigendecomposition of a $34258 \\times 34258$ covariance matrix - you don't want to do this on your laptop! Instead, we can use Kernel PCA, which gives us the same result (if we specify `kernel='linear'`), but computes the eigendecomposition of the similarity matrix, which is only $4492 \\times 4492$."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# reduce dimensionality with linear kPCA\n",
"# since TF-IDF vectors are length (L2) normalized, the linear kernel = cosine similaritiy\n",
"# --> we use 100 components since we feed the reduced data to t-SNE later (-> not sparse)!\n",
"kpca = KernelPCA(n_components=100, kernel='linear')\n",
"X_kpca = kpca.fit_transform(X)\n",
"print(\"Dimensionality of our data:\", X_kpca.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot 2D PCA visualization\n",
"# the components are ordered by their eigenvalue (largest first), i.e.,\n",
"# by taking the first 2 this is the same as if we had computed PCA with n_components=2\n",
"plt.figure()\n",
"plt.scatter(X_kpca[:, 0], X_kpca[:, 1], s=2) # s: size of the dots\n",
"plt.title(\"PCA embedding of paragraphs\");\n",
"# each dot is one paragraph, represented in 2D"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# generate color codes for the plot based on the article titles (-> key in the dict)\n",
"paragraphs_label = [a for a in articles for p in articles[a]] # article title for each paragraph\n",
"print(len(paragraphs_label)) # same as len(paragraphs_corpus)\n",
"print(paragraphs_label[:3])\n",
"# map the list of strings to numbers (which we can then use in plt.scatter())\n",
"p_labels_num = LabelEncoder().fit_transform(paragraphs_label)\n",
"print(len(p_labels_num))\n",
"print(p_labels_num[:3])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# same plot as above but with colors\n",
"plt.figure()\n",
"plt.scatter(X_kpca[:, 0], X_kpca[:, 1], c=p_labels_num, s=2) # c: list/array of same length as x/y\n",
"plt.title(\"PCA embedding of paragraphs\");\n",
"# -> paragraph-dots with the same color belong to the same article"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# interactive plot with plotly (make sure you're not using Internet Explorer but a modern browser!)\n",
"# generate the tooltip text: split texts into lines\n",
"hover_texts = [\"\" + paragraphs_label[i] + \"
\" + \"
\".join([\" \".join(p.split()[i:min(i+7, len(p.split()))]) for i in range(0, len(p.split()), 7)]) for i, p in enumerate(paragraphs_corpus)]\n",
"# create interactive plot and display\n",
"fig = px.scatter(x=X_kpca[:, 0], y=X_kpca[:, 1], color=p_labels_num, hover_name=hover_texts)\n",
"fig.update_traces(hovertemplate='%{hovertext}') # only show our text, no additional info\n",
"# move your mouse over the dots to see what paragraphs are behind them (first line in bold is the article title)\n",
"fig"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, the two dots on the top and bottom right are very different from the rest of the paragraphs, i.e., they could be considered outliers and strongly influence the first two components (these are actually not two individual dots, but many on top of each other as all articles start and end with these lines).\n",
"\n",
"This means the first principle component here captured whether the paragraph consisted of only `DOCUMENT` plus one additional word, while the second component captured whether this additional word was `BEGIN` or `END`. (The other dimensions then contain additional variance introduced by the fact that the dataset includes paragraphs about different topics.) \n",
"\n",
"This plot therefore tells us that we should probably clean up our dataset a bit by removing these beginning and end phrases before doing any other analysis on this dataset. But before we do that, let's look at the eigenvalue spectrum of the dataset:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(kpca.eigenvalues_[:10])\n",
"# check how much \"information\" we would keep if we were to reduce the dimensionality to 20\n",
"# (this is not 100% accurate, since we only computed the first 100 kPCA components, i.e.,\n",
"# normally kpca.eigenvalues_ should contain all eigenvalues - but this should be close enough)\n",
"print(\"Percentage of variance retained with 20 components:\", 100*(sum(kpca.eigenvalues_[:20])/sum(kpca.eigenvalues_)))\n",
"# plot eigenvalue spectrum\n",
"plt.figure()\n",
"plt.plot(range(1, len(kpca.eigenvalues_)+1), kpca.eigenvalues_)\n",
"plt.xlabel(\"PCs\")\n",
"plt.ylabel(\"Eigenvalue\");\n",
"# observe how the first value is extremely large"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 1: remove outliers and compute Kernel PCA again\n",
"\n",
"1. Remove the `BEGIN DOCUMENT` and `END DOCUMENT` \"paragraphs\" from the dataset, i.e., the first and last elements of the list of paragraphs for each article. You can accomplish this by indexing the list of paragraphs of an article with `[1:-1]` to take only the second until the second-to-last elements.\n",
"2. Transform this new list of paragraphs into TF-IDF vectors again\n",
"3. Compute KernelPCA like before and plot the scatter plot (with colors) again\n",
"4. Look at the eigenvalue spectrum again - what do you observe?"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# remove outliers (i.e., first and last \"paragraph\" for each article)\n",
"paragraphs_corpus = ..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# transform list of paragraphs into TF-IDF vectors again\n",
"...\n",
"print(\"Dimensionality of our data:\", X.shape) # (4292, 34258) -> compare to the original size of X"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# compute kPCA again (with same parameter settings as before)\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# scatter plot without outliers (with color! but remember, \n",
"# the dimensionality of the color vector needs to match the x/y coordinates)\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# check eigenvalue spectrum of kPCA again\n",
"..."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Task 2: Visualize dataset with t-SNE\n",
"\n",
"1. After you've computed your new kPCA embedding (without outliers), use the code below to compute a t-SNE embedding\n",
"2. Then create a regular (matplotlib) and an interactive (plotly) scatter plot of the results again and explore\n",
"\n",
"Notice how the paragraphs form localized clusters (while remembering that this is not a clustering algorithm, but gives us 2D coordinates, not a cluster index, for each data point ;-)). If the task was now to classify the paragraphs (e.g., identify the correct article title for each paragraph), you could see for which articles this would be easy, and where there is overlap between the content of other articles (and you can see how these \"mistakes\", i.e., where a paragraph is located near the paragraphs of another article, are quite understandable, i.e., a human might have made some of these mistakes as well)."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# use 100D kPCA embedding, since t-SNE can't handle sparse matrices\n",
"# (we use the \"cosine\" metric here since this works well for text,\n",
"# for other data you can leave this argument at its default value)\n",
"tsne = TSNE(metric=\"cosine\", verbose=2, random_state=42)\n",
"X_tsne = tsne.fit_transform(X_kpca)\n",
"print(\"Dimensionality of our data:\", X_tsne.shape)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# plot 2D t-SNE visualization with matplotlib (with colors!)\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# new hover_texts since you have less paragraphs\n",
"hover_texts = [\"\" + paragraphs_label[i] + \"
\" + \"
\".join([\" \".join(p.split()[i:min(i+7, len(p.split()))]) for i in range(0, len(p.split()), 7)]) for i, p in enumerate(paragraphs_corpus)]\n",
"# create interactive plot and display\n",
"..."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}