ml_exercises/notebooks/7_information_retrieval.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Information Retrieval / NLP example\n",
    "\n",
    "**Idea:** Respond more quickly to customer service requests by using Natural Language Processing (NLP) and Information Retrieval to automatically suggest one or several FAQ articles or response templates given an incoming customer email to speed up the process of drafting a response.\n",
    "\n",
    "Since personal emails are private, unfortunately there are no public datasets available with customer requests, so we instead test the methodology on a [question answering dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset contains 477 wikipedia articles, each split into paragraphs, with several questions associated with each paragraph (i.e., where the answer to the question can be found in the paragraph).\n",
    "\n",
    "This means your task is, given a question, to identify the correct paragraph that contains the answer to this question."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import json\n",
    "import numpy as np\n",
    "from sklearn.feature_extraction.text import TfidfVectorizer\n",
    "from sklearn.neighbors import NearestNeighbors"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the dataset\n",
    "\n",
    "The original data is again in a JSON format, just a bit more nested than in the 1st notebook (visualize_text), since here we have a list of questions associated with each paragraph. The outer structure is again a dictionary, where the article titles are the keys."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data & parse it as a Python data structure\n",
    "with open(\"../data/articles.json\") as f:\n",
    "    articles = json.load(f)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check out what an article looks like by looking at the entry associated with the first article title\n",
    "# (don't look at all articles at once, this will probably crash your notebook!)\n",
    "# we again have a list with all paragraphs of the article, only that here each paragraph is also a dict\n",
    "# with entries for the paragraph text and a list of associated questions\n",
    "articles[list(articles.keys())[0]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the total number of paragraphs\n",
    "n_paragraphs = [len(articles[a]) for a in articles]\n",
    "print(f\"{len(articles)} articles with alltogether {sum(n_paragraphs)} paragraphs; on average {np.mean(n_paragraphs):.1f} paragraphs per article\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# take only a subset of the articles to speed up the computation\n",
    "subset = sorted(articles.keys())\n",
    "np.random.seed(25)\n",
    "subset = np.random.permutation(subset)\n",
    "articles = {a: articles[a] for a in subset[:100]}  # same dict structure as before, just fewer articles"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Assign Questions to Paragraphs\n",
    "\n",
    "We first transform both the paragraphs, as well as all questions associated with the paragraphs, into TF-IDF features and then identify the most similar paragraph for a given question by computing the cosine similarity of the TF-IDF vector for the question to the TF-IDF vectors of all paragraphs to identify the most similar paragraphs, which we then return as the search results. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# get a list of texts and \"labels\" for the paragraphs\n",
    "paragraphs_corpus = [p[\"paragraph\"] for a in articles for p in articles[a]]\n",
    "# for each paragraph and question we note the title of the corresponding article and the number of the paragraph,\n",
    "# which will later in the evaluation help us to see if the returned paragraph is correct for the given question\n",
    "paragraphs_label = [f\"{a} {i}\" for a in articles for i, p in enumerate(articles[a])]\n",
    "# list of questions - note the additional level in the list comprehension to go through all questions of a paragraph\n",
    "questions_corpus = [q for a in articles for p in articles[a] for q in p[\"questions\"]]\n",
    "questions_label = [f\"{a} {i}\" for a in articles for i, p in enumerate(articles[a]) for q in p[\"questions\"]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform both paragraphs and questions into TF-IDF features\n",
    "vectorizer = TfidfVectorizer(strip_accents='unicode')\n",
    "# learn internal parameters of vectorizer (vocabulary, IDF weights) from known data\n",
    "vectorizer.fit(paragraphs_corpus)\n",
    "# transform both datasets with the same vectorizer so they have the same feature dimensions\n",
    "X_pars = vectorizer.transform(paragraphs_corpus)\n",
    "# important to only call transform here, so you get the same vector space for both paragraphs and questions\n",
    "X_ques = vectorizer.transform(questions_corpus)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Question:** What would the TF-IDF vector look like, if a question contained only words that did not occur in any of the paragraphs?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check the dimensions of the feature matrices\n",
    "print(X_pars.shape)  # number of paragraphs x bag-of-words vocabulary\n",
    "print(X_ques.shape)  # number of questions x bag-of-words vocabulary"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# initialize the nearest neighbors search tree:\n",
    "# return 10 search results for every query based on the cosine similarity\n",
    "nn = NearestNeighbors(n_neighbors=10, metric='cosine')\n",
    "# when calling fit, it builds the search tree to efficiently find the closest paragraphs\n",
    "nn.fit(X_pars)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# pick a question\n",
    "questions_corpus[888]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# query the nearest neighbors search with the corresponding vector to get the indices of the answer\n",
    "nn.kneighbors(X_ques[888])\n",
    "# the result is a tuple with distances and indices (since the nearest neighbors search internally\n",
    "# works with distances instead of similarities, the first result has the smallest distance, which\n",
    "# is 1-cosine similarity, i.e., it has the highest similarity)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check what the answer text is for the first result\n",
    "paragraphs_corpus[65]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# what was really the correct paragraph?\n",
    "questions_label[888]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# corresponding paragraph of the answer\n",
    "paragraphs_label[65]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# but the 3rd result would have been correct\n",
    "paragraphs_label[80]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# ask your own question!\n",
    "q = \"Your question here\"\n",
    "# transform the text into a vector (need to pass it as a list to the vectorizer)\n",
    "x = vectorizer.transform([q])\n",
    "# query the nearest neighbors search with this new vector\n",
    "nn.kneighbors(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# check what paragraph is behind the index for the best result\n",
    "paragraphs_corpus[...]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Systematic performance analysis with the hits@k metric\n",
    "\n",
    "While above we just collected some anecdotal evidence for the performance of our method, of course before deploying this into production we should conduct a more systematic evaluation. For this, we use the *hits@k* metric, which, for different *k*, checks whether the correct answer was within the first *k* search results. E.g. in a Google search, is the website you're looking for the 1st result, then this would be a hit@1, or is it on the first page, then it would still count towards the hits@10.\n",
    "\n",
    "In our example, we check both the hits@k for the paragraphs, as well as the articles, i.e., we check for every paragraph that was returned as a search result, whether that was actually the correct paragraph, or whether it at least came from the right article."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# you do not need to understand in detail what happens here,\n",
    "# just execute and look at the final results\n",
    "print(\"computing nearest neighbors...\", end='\\r')\n",
    "nn = NearestNeighbors(n_neighbors=10, metric='cosine').fit(X_pars)\n",
    "nn_results = nn.kneighbors(X_ques, return_distance=False)\n",
    "print(\"computing nearest neighbors... done!\")\n",
    "# score as: right article in top 10; right paragraph in top 10\n",
    "article_hits = [[] for i in range(10)]\n",
    "paragraph_hits = [[] for i in range(10)]\n",
    "for i in range(X_ques.shape[0]):\n",
    "    t_label = questions_label[i]\n",
    "    labels = [paragraphs_label[j] for j in nn_results[i]]\n",
    "    for k in range(10):\n",
    "        if t_label in labels[:k+1]:\n",
    "            paragraph_hits[k].append(1)\n",
    "        else:\n",
    "            paragraph_hits[k].append(0)\n",
    "    labels = [l.split()[0] for l in labels]\n",
    "    t_label = t_label.split()[0]\n",
    "    for k in range(10):\n",
    "        if t_label in labels[:k+1]:\n",
    "            article_hits[k].append(1)\n",
    "        else:\n",
    "            article_hits[k].append(0)\n",
    "for i in [0, 1, 2, 3, 4, 9]:\n",
    "    print(f\"Article Hits @ {i+1:2}: {100*np.mean(article_hits[i]):.1f}\")\n",
    "for i in [0, 1, 2, 3, 4, 9]:\n",
    "    print(f\"Paragraph Hits @ {i+1:2}: {100*np.mean(paragraph_hits[i]):.1f}\")\n",
    "# --> if we show 5 paragraphs, in almost 80% of the cases the correct paragraph is among them;\n",
    "#     if we return only 3 results, in over 90% at least the correct article is identified"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercises\n",
    "\n",
    "For these exercises, please work on the complete set of articles, not the subset we used for now, i.e., load the data again. To construct a single text from all paragraphs of an article, we join them together with `\"\\n\".join(list_of_paragraphs)` (i.e., one data point is now not one paragraph, but one artice).\n",
    "\n",
    "### 1. Find the article that is most similar to the article about \"Beer\"\n",
    "Our dataset contains an article with the title \"Beer\". Your task is to identify another article in the dataset that is the most similar to this article about beer.\n",
    "\n",
    "To do this, compute the cosine similaritiy between the target article with the title \"Beer\" and all other articles and choose the most similar one (not counting the target article itself ;-)). Similar to how we selected a matching paragraph for each question, this can be done with the `NearestNeighbors` class from `sklearn`.\n",
    "\n",
    "**Important:** Don't search for the article closest to the word \"beer\", but to the *whole article with the title \"Beer\"*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# load the data again\n",
    "with open(\"../data/articles.json\") as f:\n",
    "    articles = json.load(f)\n",
    "\n",
    "# get a list of all article titles\n",
    "article_ids = sorted(articles.keys())\n",
    "\n",
    "# get the corresponding texts of these articles by concatenating all the paragraphs of each article.\n",
    "# the article texts in this list are in the same order as the article titles in article_ids.\n",
    "article_corpus = [\"\\n\".join([p[\"paragraph\"] for p in articles[a]]) for a in article_ids]\n",
    "print(len(article_corpus))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# transform the article texts into a TF-IDF feature matrix X\n",
    "X = ...\n",
    "\n",
    "# when you're done, check how many articles and feature dimensions were generated\n",
    "print(X.shape)  # should be (477, 80732)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find out the index of our target article \"Beer\" (remember the titles are in article_ids), \n",
    "# so you know which row in the feature matrix X contains the corresponding TF-IDF vector\n",
    "# (so that you can use this vector to get the nearest neighbors)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# find the most similar article with NearestNeighbors\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# based on the index of the most similar article, get the title of this article\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Advanced Exercises\n",
    "\n",
    "### 2. Find the most similar article to \"Beer\" - without using `NearestNeighbors`!\n",
    "Solve the above task without using the `NearestNeighbors` class, i.e., by computing the similarities and selecting the most similar article yourself. \n",
    "\n",
    "Since the TF-IDF feature vectors are by default length-normalized, the cosine similarity between the articles can be computed with a simple dot-product (but beware: X is by default a sparse matrix, i.e., `np.dot()` wont work, but you can also call `.dot()` on an array or sparse matrix directly). The result of the dot-product will still be sparse, but you can call `.toarray()` on a sparse matrix to convert it into a regular dense numpy array with which you can work as usual. \n",
    "\n",
    "Once you have a vector with similarities, the functions `np.argmax` or `np.argsort` might be helpful."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3. Find the two most similar articles in the set\n",
    "From all articles, which two have the most similar text? What is their cosine similarity score?\n",
    "\n",
    "**Question:** Do you have an idea *why* these two articles might have been identified as similar?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}