Naïve Bayes Classifier

4 min readFeb 8, 2021

Naïve Bayes is an algorithm primarily used for classification. The major use cases are spam classification, medical diagnosis, recommendation engines, text classification etc. Naïve Bayes classifier is a probabilistic classifier which means that given a input, the classifier predicts the probability of the input being classified for all the classes. it is also called conditional probability.

Bayes’ theorem

The foundation of this algorithm is the classical Bayes theorem. It describes the probability of an event based on the prior knowledge of conditions that might be related to the event. Theorem can be formulated as follows:

The events E1,E2, … ,En are called hypotheses. The probabilities p(Ei) is considered as the priori probability. The probability p(Ei|A) is called the posteriori probability. Bayes Theorem can be derived for events and random variables separately using conditional probabilities and law of independence. In machine learning modelling, we apply the Bayes theorem as follows :

In the above equation on the LHS we have posterior probability which gives the probability that the given data (D) belongs to the class C. In the RHS p(D) is the probability of the getting such a data D. We cannot find the value of p(D). But since we are interested in classifying the data this does not matter. If we have two classes say C1 and C2 then we just need to compare the posterior probabilities with respect to each class and the decision rule can be made in such a way that the class which gets the higher probability will be selected. Now p(D|C) is the likelihood of D given the class C. This can be easily calculated from the frequency counts of each category ( for categorical data). p(C) is the prior probability of occurrence of class C in the given training data. So, to predict the class of the test data, we need to have steps:

Calculating the posterior probabilities
Applying the argmax(decision rule)

In the modelling phase we have a feature vector say (x1,x2, … , xn), then applying above hypothesis, we have the following;

Assumptions made for Naïve Bayes classifier

The attributes or features are independent of each other. This independence assumption is never correct but often works well in practice. Hence the name naïve.
All features are given equal importance.

Types of naïve Bayes classification

Bernoulli Naïve Bayes : This is used in discreet data which follows Bernoulli distribution. When we are dealing with features which are in binary form we use Bernoulli naïve Bayes. sklearn provides BernoulliNB to implement the Bernoulli naive Bayes. Following is the code snippet;

from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X, Y)

We use Multinomial naive Bayes wo model the feature vector where each value represents the frequencies. This performs well in text classification problems.

from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB()
clf.fit(X, y)

If the predictors are not discreet but have a continuous value, we use Gaussian naive Bayes. Here we assume that the predictors are a sample from a gaussian distribution. So, to computing the likelihoods we use normal density as follows:

The model can be fit by simply finding the mean and standard deviation of the points with respect to each feature.

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
clf.fit(X, y)

Note: Naive Bayes is relatively immune to overfitting . Here we have a simple hypothesis, so it can not accurately represent many complex situations. since the bias is high model exhibits low variance.

You can find the naive bayes code (binary categorical) implementation from scratch in my Github::

https://github.com/thomasreji155/Machine-Learning-with-Python/blob/main/naive_bayes.py

Advantage of Naive Bayes

Since there are no gradients or iteration parameter updates to compute, training process and predictions are quick.
Naive Bayes requires a small amount of training data to estimate the test data.
Naive Bayes handles missing values a lot easier.
Model handles both continuous and discreet data.

Disadvantages of Naive Bayes

Model cannot incorporate feature interactions.
Model performance is affected if we have skewed training data.
Zero Frequency problem.

Improving the model performance

Zero frequency problem : There can be cases where a categorical attribute has a value that was not observed in training. In this case the model will assign zero probability and unable to make the predictions. Then we usually add 1 to every value in the frequency table. This is also known as additive smoothing.
Naive Bayes model performance increases if we have non correlated features. So, we can remove those features that are highly correlated using pairwise correlation.
Naive Bayes handles missing data. if a data instance has a missing vale it will be get ignored while computing the probabilities.
Its good to use log probabilities to avoid difficulty with floating point.
Instead of using usual normal, binomial distributions we can always try different distributions to compute the posterior probabilities.

Thank you :)