Multinomial Naive Bayes Classification

Multinomial Naive Bayes Classification

A classifier is an algorithm that differentiates similar objects based on some features. Naive Bayes is one of the more popular classifiers when it comes to text classification as it has been successfully applied to many domains, particularly Natural Language Processing(NLP). We have about 836 different classes in the dataset and we can see that this will be a difficult task to achieve as there are large number of classes and we need lots of good features to be select in order to justify and understand each class.

Naive Bayes is a family of algorithms based on applying Bayes theorem with an assumption, that every feature is independent of the others, in order to predict the category of a given sample. They are probabilistic classifiers, therefore will calculate the probability of each class using Bayes theorem, and the classes with the highest probability will be output.

In our dataset the condition column that represents the medical condition a review is related to, is the class which we will try to predict.

We have already preprocessed the reviews in Part 1 while building our search engine. In order to make the iterative process of tuning the algorithm we split the entire dataset into train, validation and test. Train consists 50% of the dataset, validation and test each have 25% of the remaining dataset.

Use the existing code from part 1 to generate an Inverted Index vocabulary on the training set.

$$ P(A \mid B) = \frac{P(B \mid A) \, P(A)}{P(B)} $$

Steps to calculating the prior probabilities:

1. Create a temporary python dictionary which stores the list of all documents that occur in each class, all unique words from the vocabulary in each class and the count of total number of words in each class.
temp_class_dict = { class-1 : [[review-1, review-2,…….,review-n],
                              [word-1, word-2, word-3,…….,word-n],
                              count_total_words]
                      .
                      .
                      . }
2. Create the dictionary to store the prior probabilities of each class by

P(class) = Number of reviews within the class / Total number of training set.

  1. Create an inner dictionary for each class with words that occur within the class as keys and store the number for times the word occurred within all reviews in the class and the total number of reviews in which it occured within the class.
    naive_class_dict = { class : [P(class), count_total_words, count_unique_words,
                              {word-1 : [count_total_occurence, count_review_occurrence]
                                 .
                                 .
                                 . }
                       .
                       .
                       .
                          }
    

Calculating probability of a given review belonging to a class:

Consider the below example, we will see how to use the prior probabilities to calculate the probability of this review belonging to class Urinary Tract Infection.

drugName condition review rating usefulCount revvec revID
Cipro Urinary Tract Infection “I also had a very bad reaction to this medication!” 1 44 [‘bad’, ‘reaction’, ‘medication’] 109180
naive_class_dict[‘Urinary Tract Infection’][3][‘bad’] = [315, 249] naive_class_dict[‘Urinary Tract Infection’][3][‘reaction’] = [74, 66] naive_class_dictnaivecps[‘Urinary Tract Infection’][3][‘medication’] = [265, 198]

  • so, the probability of a given query belonging to a class is

P(class | w1,w2,w3) = log(P(w1 | class) * P(w1 | class) * P(w1 | class) * P(class))

  • probability of a class is

P(class) = number of reviews in the class / total number of reviews

  • probability of a word w1 occurring in a class

  • p(w1 | class) = (number of times w1 occurs in class + alpha) / (total number of words in class + total number of words in the vocabulary)

  • Here aplha is the smoothing factor which helps overcome problems of when a word in the query does not occur within a class. This hyperparameter comes handy when trying to optimize the classification to increase the accuracy.

P(Urinary Tract Infection) = 0.008207509918583007
P(bad | Urinary Tract Infection) = (315 + 1 )/ (28831 + 32622) = -4.376021118875437
P( reaction | Urinary Tract Infection ) = (74 + 1) / (28831 + 32622) =  -7.295332487596864
P( medication | Urinary Tract Infection ) = (265 + 1) / (28831 + 32622) =  -9.66062970253546

Challenges Faced

* Optimizing the algorithm was a bit challenging. Created a custom list stop words that are dataset specific. As the text we analyse are reviews the features are not all that unique between classes.

  • Choosing the alpha value was a decision I took to improve the accuracy of the classifier. Generally alpha is assumed to be 1.

Tested on random sample of 30000 reviews of the validation set

 aplha       Accuraacy%       Error% 
        1           53.6 %          46.4 %
       0.1          54.2 %          45.8 %
      0.0001        53.304 %        46.69 %
      0.00001       55.4 %          44.59 %
     0.0000001      59.69 %         42.3 %
     0.00000001     54.2 %          45.8 %

Contributions
  1. Developed the classifier algorithm in numpy.

  2. Smoothing hyperparameter optimized to 0.0000001 instead of default value 1.

  3. Created custom test, train, validation splitter which ensures even distribution of reviews based on the medical condition it is associated with.

References:

Part 1

Part 3