Logistic Regression

Thomaskutty Reji
7 min readDec 24, 2020

Introduction

Logistic regression is primarily used for binary classification. Some of the Examples are — predicting whether the patient has heart disease or not, predicting whether the transaction is fraudulent or not, Email span detection. In linear regression we model the data as follows:

Here ɵ = [ ɵ0 ɵ1 ɵ2 … ɵn] and X representing the feature matrix. Logistic regression is a special type of generalized linear model (GLM). Logistic regression can work with both continuous and discreet data. One basic difference between linear regression and logistic regression is how the line fits the data. In logistic regression, we fit the logistic curve to the data and the model predicts the probability value that the given event (a linear combination of independent features) belongs to a particular class. But in linear regression, we fit a line to the data using the principle of least squares. We use the sum of the residual squares to calculate the R2 in a linear setting. We don’t have the concept of residuals in logistic regression. Here we have something like the maximum likelihood method.

The logistic regression model takes input data and predicts its probability value that the input belongs to the default class. If the probability value is greater than 0.5 then the given input is predicted as the default class (spam). Otherwise, it is classified as non-spam. The target variable in logistic regression follows the Bernoulli distribution having the unknown probability p. Somehow, we have to connect the input data (a linear combination, β0 + β1x) to the target probabilities.

Logistic regression is a discriminative algorithm where the model is trying to predict p(yi|x) i.e., Pr (default = spam | data) which we can abbreviate as P(data) will range between 0 and 1. In the case of generative algorithms, the model estimates the joint probability p(xi,y). (often, they model p(xi|y). and p(y) separately).

Logistic model

As we are estimating the probabilities, we have to make sure that our prediction must lie between 0 and 1. In logistic regression, we use the logistic function to achieve this criterion. So, the hypothesis is as follows,

Here, β1 and β0 are the parameters which we have to estimate using a method called maximum likelihood which will be discussed later. From the hypothesis, we observe that if β1 is positive then the increase in x will increase the p(x) and if β1 is negative then the increase in x will decrease the probability of x. Consider the following equation:

LHS represents the odds of an event happening. There is a little difference between odds and probabilities. The odds are the ratio of something happening to something not happening. Whereas probability is the ratio of something happening to everything that could happen. We use probabilities to find the odds. The equation (3) can be solved for the probability of the event p(x) belongs to the default class.

We take the log of odds (logit) to make things symmetrical, so that it is easy to interpret the results. So, we take the logarithm on both sides of the equation (3), we get the linear function of the independent variables.

In the above graph, the probabilities are on the x-axis, but we want the probabilities on the y axis. We are already familiar with the β0 + β1x, the linear regression model. So, in linear regression this β0 represent the bias and β1 represents the change in y for a unit change in x. But in logistic regression β1 represents the change in the log-odds for a unit change in x. Note that here β1 does not correspond to the change in p(x). There is no straight-line relationship between x and p(x).

So, we have derived the regression equation for logistic regression. (the estimated probability). Now Inverse of the logit graph plot gives us the probability on the y axis. This inverse of logit is called the sigmoid function

Estimating the regression coefficients using maximum likelihood estimation (MLE)

We have to estimate the parameters (β0 and β1) from the training data. For the linear regression, we used the principle of least squares to estimate the coefficients. In logistic regression, we use maximum likelihood for estimating the parameters. The principle underlying this method of estimation is the idea that the best estimates of the parameters based on a sample are those values of the parameters which make the probability of getting that sample the maximum. MLE provides parameter value(s) that makes the given sample the most likely sample among all possible samples. The likelihood of ɵ is defined as the joint pdf of the sample points given ɵ

And the log-likelihood is as follows:

So, now we have to find the parameters which maximize the likelihood function.

We already have our model hypothesis as follows;

In a binary setting, we can write probability for each class as follows: (1 and 0 representing class)

Combining both equations, we get,

Now we have the probability function, which tells the probability that the dependent variable belongs to a certain class given the input with the model parameterized by ɵ. Now let’s write the likelihood function.

we have to maximize this likelihood since the likelihood represents the plausibility of the model if we maximize it, the result is the most plausible model to represent the given data. Now by taking the partial derivative of log-likelihood with respect to ɵ we get;

Plugging the equ (10) in equ (9) and dividing it by the number of training samples (m). The division by zero makes the math bit easier.

Note that if we have n features say x1, x2, x3, … xn then ɵ = [ ɵ0 ɵ1 ɵ2 … ɵn]. To minimize the negative of log-likelihood, we use gradient descent optimization algorithm to update the weights using the gradient.

Optimization using gradient descent algorithm

If the gradient is negative, we increase the weights and if the gradient is positive then we decrease the weights. We first start with some ɵ and keep updating it (simultaneously) until we hopefully end up at a minimum (Tolerance level). The weights update rule can be formulated as follows,

α represents the learning rate. We should repeat this until convergence. If alpha is too small the gradient descent can be slow. If the learning rate is high the convergence may not happen. Note that as we approach the local minimum the gradient descent will automatically take smaller steps.

Summary

To sum up, logistic regression is a statistical model that allows us to predict the probability that the input belongs to a certain class. We started by explaining the odds and then the logit. Through logit, we could create a model that maximizes the log-likelihood, which takes the linear combination (input) and gives us the probability that the input belongs to the default class (through sigmoid function). For parameter estimation, we used the maximum likelihood principle. And finally, the optimization is done through the gradient descent algorithm. This theory for binary logistic regression can be easily generalized for a multiclass logistic regression.

Thank you:)

--

--