# Information Retrieval / NLP example

**Idea:** Respond more quickly to customer service requests by using Natural Language Processing (NLP) and Information Retrieval to automatically suggest one or several FAQ articles or response templates given an incoming customer email to speed up the process of drafting a response.

Since personal emails are private, unfortunately there are no public datasets available with customer requests, so we instead test the methodology on a [question answering dataset](https://rajpurkar.github.io/SQuAD-explorer/). This dataset contains 477 wikipedia articles, each split into paragraphs, with several questions associated with each paragraph (i.e., where the answer to the question can be found in the paragraph).

This means your task is, given a question, to identify the correct paragraph that contains the answer to this question.

In [None]:
import json
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.neighbors import NearestNeighbors

## Load the dataset

The original data is again in a JSON format, just a bit more nested than in the 1st notebook (visualize_text), since here we have a list of questions associated with each paragraph. The outer structure is again a dictionary, where the article titles are the keys.

In [None]:
# load the data & parse it as a Python data structure
with open("../data/articles.json") as f:
    articles = json.load(f)

In [None]:
# check out what an article looks like by looking at the entry associated with the first article title
# (don't look at all articles at once, this will probably crash your notebook!)
# we again have a list with all paragraphs of the article, only that here each paragraph is also a dict
# with entries for the paragraph text and a list of associated questions
articles[list(articles.keys())[0]]

In [None]:
# check the total number of paragraphs
n_paragraphs = [len(articles[a]) for a in articles]
print(f"{len(articles)} articles with alltogether {sum(n_paragraphs)} paragraphs; on average {np.mean(n_paragraphs):.1f} paragraphs per article")

In [None]:
# take only a subset of the articles to speed up the computation
subset = sorted(articles.keys())
np.random.seed(25)
subset = np.random.permutation(subset)
articles = {a: articles[a] for a in subset[:100]}  # same dict structure as before, just fewer articles

## Assign Questions to Paragraphs

We first transform both the paragraphs, as well as all questions associated with the paragraphs, into TF-IDF features and then identify the most similar paragraph for a given question by computing the cosine similarity of the TF-IDF vector for the question to the TF-IDF vectors of all paragraphs to identify the most similar paragraphs, which we then return as the search results. 

In [None]:
# get a list of texts and "labels" for the paragraphs
paragraphs_corpus = [p["paragraph"] for a in articles for p in articles[a]]
# for each paragraph and question we note the title of the corresponding article and the number of the paragraph,
# which will later in the evaluation help us to see if the returned paragraph is correct for the given question
paragraphs_label = [f"{a} {i}" for a in articles for i, p in enumerate(articles[a])]
# list of questions - note the additional level in the list comprehension to go through all questions of a paragraph
questions_corpus = [q for a in articles for p in articles[a] for q in p["questions"]]
questions_label = [f"{a} {i}" for a in articles for i, p in enumerate(articles[a]) for q in p["questions"]]

In [None]:
# transform both paragraphs and questions into TF-IDF features
vectorizer = TfidfVectorizer(strip_accents='unicode')
# learn internal parameters of vectorizer (vocabulary, IDF weights) from known data
vectorizer.fit(paragraphs_corpus)
# transform both datasets with the same vectorizer so they have the same feature dimensions
X_pars = vectorizer.transform(paragraphs_corpus)
# important to only call transform here, so you get the same vector space for both paragraphs and questions
X_ques = vectorizer.transform(questions_corpus)

**Question:** What would the TF-IDF vector look like, if a question contained only words that did not occur in any of the paragraphs?

In [None]:
# check the dimensions of the feature matrices
print(X_pars.shape)  # number of paragraphs x bag-of-words vocabulary
print(X_ques.shape)  # number of questions x bag-of-words vocabulary

In [None]:
# initialize the nearest neighbors search tree:
# return 10 search results for every query based on the cosine similarity
nn = NearestNeighbors(n_neighbors=10, metric='cosine')
# when calling fit, it builds the search tree to efficiently find the closest paragraphs
nn.fit(X_pars)

In [None]:
# pick a question
questions_corpus[888]

In [None]:
# query the nearest neighbors search with the corresponding vector to get the indices of the answer
nn.kneighbors(X_ques[888])
# the result is a tuple with distances and indices (since the nearest neighbors search internally
# works with distances instead of similarities, the first result has the smallest distance, which
# is 1-cosine similarity, i.e., it has the highest similarity)

In [None]:
# check what the answer text is for the first result
paragraphs_corpus[65]

In [None]:
# what was really the correct paragraph?
questions_label[888]

In [None]:
# corresponding paragraph of the answer
paragraphs_label[65]

In [None]:
# but the 3rd result would have been correct
paragraphs_label[80]

In [None]:
# ask your own question!
q = "Your question here"
# transform the text into a vector (need to pass it as a list to the vectorizer)
x = vectorizer.transform([q])
# query the nearest neighbors search with this new vector
nn.kneighbors(x)

In [None]:
# check what paragraph is behind the index for the best result
paragraphs_corpus[...]

#### Systematic performance analysis with the hits@k metric

While above we just collected some anecdotal evidence for the performance of our method, of course before deploying this into production we should conduct a more systematic evaluation. For this, we use the *hits@k* metric, which, for different *k*, checks whether the correct answer was within the first *k* search results. E.g. in a Google search, is the website you're looking for the 1st result, then this would be a hit@1, or is it on the first page, then it would still count towards the hits@10.

In our example, we check both the hits@k for the paragraphs, as well as the articles, i.e., we check for every paragraph that was returned as a search result, whether that was actually the correct paragraph, or whether it at least came from the right article.

In [None]:
# you do not need to understand in detail what happens here,
# just execute and look at the final results
print("computing nearest neighbors...", end='\r')
nn = NearestNeighbors(n_neighbors=10, metric='cosine').fit(X_pars)
nn_results = nn.kneighbors(X_ques, return_distance=False)
print("computing nearest neighbors... done!")
# score as: right article in top 10; right paragraph in top 10
article_hits = [[] for i in range(10)]
paragraph_hits = [[] for i in range(10)]
for i in range(X_ques.shape[0]):
    t_label = questions_label[i]
    labels = [paragraphs_label[j] for j in nn_results[i]]
    for k in range(10):
        if t_label in labels[:k+1]:
            paragraph_hits[k].append(1)
        else:
            paragraph_hits[k].append(0)
    labels = [l.split()[0] for l in labels]
    t_label = t_label.split()[0]
    for k in range(10):
        if t_label in labels[:k+1]:
            article_hits[k].append(1)
        else:
            article_hits[k].append(0)
for i in [0, 1, 2, 3, 4, 9]:
    print(f"Article Hits @ {i+1:2}: {100*np.mean(article_hits[i]):.1f}")
for i in [0, 1, 2, 3, 4, 9]:
    print(f"Paragraph Hits @ {i+1:2}: {100*np.mean(paragraph_hits[i]):.1f}")
# --> if we show 5 paragraphs, in almost 80% of the cases the correct paragraph is among them;
#     if we return only 3 results, in over 90% at least the correct article is identified

# Exercises

For these exercises, please work on the complete set of articles, not the subset we used for now, i.e., load the data again. To construct a single text from all paragraphs of an article, we join them together with `"\n".join(list_of_paragraphs)` (i.e., one data point is now not one paragraph, but one artice).

### 1. Find the article that is most similar to the article about "Beer"
Our dataset contains an article with the title "Beer". Your task is to identify another article in the dataset that is the most similar to this article about beer.

To do this, compute the cosine similaritiy between the target article with the title "Beer" and all other articles and choose the most similar one (not counting the target article itself ;-)). Similar to how we selected a matching paragraph for each question, this can be done with the `NearestNeighbors` class from `sklearn`.

**Important:** Don't search for the article closest to the word "beer", but to the *whole article with the title "Beer"*.

In [None]:
# load the data again
with open("../data/articles.json") as f:
    articles = json.load(f)

# get a list of all article titles
article_ids = sorted(articles.keys())

# get the corresponding texts of these articles by concatenating all the paragraphs of each article.
# the article texts in this list are in the same order as the article titles in article_ids.
article_corpus = ["\n".join([p["paragraph"] for p in articles[a]]) for a in article_ids]
print(len(article_corpus))

In [None]:
# transform the article texts into a TF-IDF feature matrix X
X = ...

# when you're done, check how many articles and feature dimensions were generated
print(X.shape)  # should be (477, 80732)

In [None]:
# find out the index of our target article "Beer" (remember the titles are in article_ids), 
# so you know which row in the feature matrix X contains the corresponding TF-IDF vector
# (so that you can use this vector to get the nearest neighbors)


In [None]:
# find the most similar article with NearestNeighbors


In [None]:
# based on the index of the most similar article, get the title of this article


## Advanced Exercises

### 2. Find the most similar article to "Beer" - without using `NearestNeighbors`!
Solve the above task without using the `NearestNeighbors` class, i.e., by computing the similarities and selecting the most similar article yourself. 

Since the TF-IDF feature vectors are by default length-normalized, the cosine similarity between the articles can be computed with a simple dot-product (but beware: X is by default a sparse matrix, i.e., `np.dot()` wont work, but you can also call `.dot()` on an array or sparse matrix directly). The result of the dot-product will still be sparse, but you can call `.toarray()` on a sparse matrix to convert it into a regular dense numpy array with which you can work as usual. 

Once you have a vector with similarities, the functions `np.argmax` or `np.argsort` might be helpful.

### 3. Find the two most similar articles in the set
From all articles, which two have the most similar text? What is their cosine similarity score?

**Question:** Do you have an idea *why* these two articles might have been identified as similar?