Search Engine with Python
A good search engine need not be sophisticated, it is better to start simple to understand the workings of a good search engine. Of-course the first thing we need is a good dataset for raw textual data for our mining purpose. I have used the kaggle data-set which consists of patient reviews on medications. This is in csv format so we use pandas dataframe to load it into python.
Data Preprocessing:
To make things easy we are going to match search with only the reviews available. The first challenge we face is data preprocessing. In our raw data set (test and train data combine) there are about 215,000 records. Upon inspection there are unusable rows with incomplete “condition” and “Drug name” columns. After dropping those we are left with 213,892 rows.The reviews contain some html codes that are left as is and need to be converted to utf-8 encoded string with htlm library’s unescape function. Once this is done now it’s time for text mining, we convert each review string into a list of words with the split function for a string. We ensure that all words are in lower case and must be of minimum length 3.
The nltk library is one of the best and easiest for natural language processing with python which includes stopwords.stop(“english”) which can be downloaded and the function is available in the corpus function on the module. We use this to remove stop words from our reviews. The library also consists of WordNetLemmatizer() using which we create a lemmatizer object and use the lemmatize() function on each word to find it’s root word. This ends the preprocessing step.
Creating Vocabulary:
{ word1 : [ docfreq, { docid1:[pos1, pos2, …..], docid2:[pos1, pos2, ….], ……. },
{ doc1:w1, doc2:w2, …. } ]
.
.
.
}
Notice we use a position list to store every index of every occurrence of a word in a document.
We get 32742 unique words as features for our dataset.
Calculating the tf-idf weights for each word
1. After construction of the first stage of the vocabulary we now calculate the weights of each word. 2. Term frequency of a word is given with respect to each review it appears in:word1-tf-review1 = number of times word1 appears in review1 / number of words in review13. Inverse document frequency for a word is defined by:
word1-idf = total number of reviews in the dataset / number of reviews word1 occurs in4. The weight of a word in a reviewis given by:
word1-weight-review1 = (1+ log(word1-tf-review1)) / (word1-idf)
Query processing and calculating similarity
Finally, we use the input box we created in our application’s search page to retrieve user search query and make it undergo the same pre-processing as our reviews, converting to lower case, stop word removal and lemmatization. We calculate the weights of words in the query using only term frequency. Now we retrieve the top 10 reviews from the posting lists of each word in the search query. If a review appears in the top-10 elements of every query word, calculate cosine similarity score.
$$ {sim(q,r)} = \vec{q} \cdot \vec{d} = \sum_{t\ \text{in both q and r}} w_{t,q} \times w_{t,r}.$$
Where ‘q’ is the search query and ‘r’ is a review vector. If a review doesn’t appear in the top-k elements of some query words, use the weight in the 10th element as the upper-bound on weight in vector. Hence, we can calculate the upper-bound score for using the query word’s actual and upper-bound weights with respect to vector, as follows.
$$ \overline{sim(q,r)} = \sum_{t\in T_1} w_{t,q} \times w_{t,r}.$$
$$ + \sum_{t\in T_2} w_{t,q} \times \overline{w_{t,r}}.$$
In the above equation, first part has query words whose top-k elements contain review. Second part includes query words whose top-k elements do not contain the review. The weight in the 10-th element of word’s postings list is used here.
$$ \overline{sim(q,r)} = \sum_{t\in q} w_{t,q} \times \overline{w_{t,r}}.$$
We choose k as 20, this acts as a hyperparameter for the search. Set k to be 20 as trial and error shows that this is optimal value.
Let query be “My stomach hurts”, After preprocessing the vector representation will be [‘stomach’,‘hurt’]
Below we show how to algorithm calculate similarity of query.
word-weight-query = number of times word appears in the query / number of words in query
- retrieve top 20 of posting list for each word in query
- find the list of all reviews.
- for every review:
- for every word:
- if word exsists in the review:
- score += word-weight-review * word-weight-query
- else:
- score += word-weight-review20 * word-weight-query
Challenges faced
- Implementing the count of words in each word. Used the concept of positional indexing.
- Getting the posting list for each word. It was computationally expensive to sort every posting list and keep them stored prior to the query similarity calculation. Sort only the posting list being retrieved while calculating the similarity.
Inverted Index vs Term Document Matrix
The term document matrix was generated with gensim libraryThe retrieve time for a search result was about 10 seconds. The custom Inverted Index yields results drastically faster.
The query retrieve time was about 0.05 seconds.
Contributions:
Implemented my own Inverted index from scratch using numpy and pandas.
Created a way to quickly retrieve the posting lists for similarity calculation.
Improving efficiency by dropping words that appear too many times from the vocabulary.
Used nltk’s WordNetLemmatizer for feature selection.
References
(https://nlp.stanford.edu/IR-book/html/htmledition/a-first-take-at-building-an-inverted-index-1.html)
Continue reading to implement Multinomial Naive Bayes classifier.