Optimizing Performance: The Journey of Gradient Descent in Logistic Regression

INTRODUCTION

Welcome to our blog on optimizing performance in logistic regression through the power of gradient descent! In this article, we dive into the inner workings of the cost function and its relationship with the popular optimization algorithm. Join us as we unravel the secrets behind achieving accurate predictions and efficient convergence in logistic regression models.

COST FUNCTION

The cost function for logistic regression is commonly known as the "logistic loss" or "log loss."

It is a mathematical representation of the error between the predicted probabilities and the actual class labels in a binary classification problem.

The cost function uses the sigmoid function to map the linear regression output to a probability between 0 and 1.

By minimizing the cost function using techniques like gradient descent, we can effectively train the logistic regression model to make accurate predictions.

LOG LOSS -FORMULA

 L = (-1/m) * Σ [ y(i) * log(y_hat(i)) + (1 - y(i)) * log(1 - y_hat(i)) ]

Where:

  • L represents the cost function.

  • m is the total number of training examples.

  • Σ denotes the sum over all training examples.

  • y(i) represents the actual class label (0 or 1) for the i-th training example.

  • y_hat(i) is the predicted label.

  • log denotes the natural logarithm

GRADIENT DESCENT FOR COST FUNCTION

$$W=W-(lr)*dL/dW$$

Where,

  • W represents the weight parameters of the logistic regression model.

  • lr denotes the learning rate, which is a hyperparameter that controls the step size in each iteration of gradient descent.

  • dL/dW represents the derivative of the cost function with respect to the weights. It indicates the direction and magnitude of the steepest descent, guiding the update of the weights (known as gradient).

DERIVING GRADIENT FOR LOGISTIC REGRESSION

Suppose we have mxn data points**(INPUTS)**

Let x

X = [  [x11  x12  x13  ...  x1n] 
       [x21  x22  x23  ...  x2n]  
       [x31  x32  x33  ...  x3n]  
       ...   ...   ...  ...  ...  
       [xm1  xm2  xm3  ...  xmn]
]

We have n number of labeled outputs

Let y

y = [  [y1]
       [y2]  
       [y3]  
        ... 
       [yn]
   ]

After prediction, we have n number of predicted output

Let y_hat

y_hat = [  [y_hat1]
           [y_hat2]  
           [y_hat3]  
             ...  
           [y_hatn]
]

Step 1: w.k.t y_hat1 = σ(w0 + w1x1 + w2x2 + ... + wnxn)

similarly

y_hat = [ [σ(w0 + w1x11 + w2x22 + ... + wnx1n)] 
      [σ(w0 + w1x21 + w₂x22 + ... + wnx2n)]  
      [σ(w0 + w1x31 + w₂x32 + ... + wnx3n)]  
       ...  
      [σ(w0 + w1xm1 + w2xm2 + ... + wnxmn)]
]

Step 2: By Dot product and scalar multiplication rule in matrix

we derive

y_hat=σ(WX)

Step 3: Substitute y_hat in Log loss

L = (-1/m) * [ y * log(σ(WX)) + (1 - y) * log(1 - σ(WX)) ]

Step 4: find dL/dW

Find first half

$$d(y*log(y_hat)/dW$$

By applying the chain rule for above

for example Derivative of L with respect to y_hat and y_hat with respect to W,

$$dL/dw=dL/d(y_hat)*d(y_hat)/dL$$

Now the first half becomes

Equation _________________(1):

$$d(ylog(yhat)/dW=y(1-y_hat)X$$

Step 5:similarly apply the chain rule for

w.k.t y_hat=σ(WX)

$$d((1 - y) * log(1 - y_hat))/dW$$

in step4 step5 we used this formula which is mentioned below,

we get,

Equation__________________(2):

$$d((1 - y) * log(1 - y_hat))dW = -yhat(1-y)X$$

Step 6: From equations 1 and 2 the step3 equation becomes

$$dL/dW=-1/m(y(1-yhat)-yhat(1-y))X$$

Step 7: By expanding and canceling

we obtained GRADIENT

$$dL/dW=-1*(y-yhat)X/m$$

Step 8: updating weights

$$W=W+(lr)*(y-yhat)X/m$$

IMPLEMENTING GRADIENT DESCENT FOR LOGISTIC REGRESSION

import numpy as np
def gradient_descent_logistic_regression(X,y,learning_rate,epochs):

     #we have to add 1 at 0th column as shown in step2 in derivation
    X=np.insert(X,0,1,axis=1)
    #weights
    w=np.ones(X.shape[1])
    #learning rate
    lr=learning_rate
    for i in range(epochs):
        #Step 1: w.k.t y_hat1 = σ(w0 + w1x1 + w2x2 + ... + wnxn)
       # Calculate predicted probabilities using sigmoid function
        y_hat=sigmoid(np.dot(X,weights))
        # W=W+(lr)*(y-yhat)X/m
        weights=weights+lr*(np.dot((y-yhat),X)/X.shape[0])
        #returning coefficients and intercept
        return weights[1:],weights[0]
def sigmoid(z):
    return 1/(1+np.exp(-z))

CONCLUSION

In conclusion, we explored the power of gradient descent in optimizing performance for logistic regression. By leveraging the gradient descent algorithm, we can iteratively update the weights and minimize the cost function to improve the accuracy of our logistic regression model. Understanding the cost function and its connection to the sigmoid activation function is crucial in this process. Through this blog, we've gained insights into the inner workings of logistic regression and how gradient descent enhances its predictive capabilities, ultimately enabling us to make more accurate and reliable predictions in binary classification tasks.

PRE-REQUISITE

Logistic regression

Costfunction for linear regression

Hope you enjoyed learning !!!

Stay tuned !!!

Thankyou !!!