Logistic Regression

Understand logistic regression from the sigmoid function to maximum likelihood estimation and cross-entropy loss, with an interactive demo.

RegressionClassificationGLM

From Linear to Logistic

In Simple Linear Regression, we modeled a continuous outcome yy as a linear function of xx. But what if yy is binary — just 0 or 1?

For example: Will a patient develop a disease? Will a customer click an ad?

Linear regression can predict values outside [0,1][0, 1], which makes no sense for probabilities. We need a model that always outputs a value between 0 and 1.

This is exactly what logistic regression does — and it fits naturally into the GLM framework as a special case with a Bernoulli distribution and a logit link function.

The Setup

We have:

  • Features x=(x1,x2,,xn)x = (x_1, x_2, \ldots, x_n)
  • A binary outcome y{0,1}y \in \{0, 1\}
  • A linear predictor z=wx+bz = w^\top x + b

The key question: how do we map z(,)z \in (-\infty, \infty) to a probability p(0,1)p \in (0, 1)?

The Sigmoid Function

The answer is the sigmoid (logistic) function:

P(y=1x)=ez1+ez=11+ezP(y = 1 \mid x) = \frac{e^z}{1 + e^z} = \frac{1}{1 + e^{-z}}

This function has elegant properties:

  • For any real zz, the output is always between 0 and 1:   0<ez1+ez<1\;0 < \frac{e^z}{1+e^z} < 1
  • At z=0z = 0, the probability is exactly 0.5
  • As z+z \to +\infty, the probability approaches 1
  • As zz \to -\infty, the probability approaches 0

The sigmoid smoothly “squashes” the entire real line into the interval (0,1)(0, 1), giving us a valid probability.

Maximum Likelihood Estimation

How do we find the best parameters ww and bb? We use maximum likelihood estimation (MLE) — find the parameters that make the observed data most probable.

For a single data point (xi,yi)(x_i, y_i), the likelihood is:

P(yixi)=piyi(1pi)1yiP(y_i \mid x_i) = p_i^{\,y_i} (1 - p_i)^{1 - y_i}

where pi=ezi1+ezip_i = \frac{e^{z_i}}{1 + e^{z_i}} and zi=wxi+bz_i = w^\top x_i + b.

For the full dataset, the likelihood is the product over all observations:

L=i=1npiyi(1pi)1yiL = \prod_{i=1}^{n} p_i^{\,y_i} (1 - p_i)^{1 - y_i}

Taking the log gives us the log-likelihood:

logL=i=1n[yilogpi+(1yi)log(1pi)]\log L = \sum_{i=1}^{n} \left[ y_i \log p_i + (1 - y_i) \log(1 - p_i) \right]

Cross-Entropy Loss

Unlike OLS in linear regression, we cannot solve this by taking derivatives and setting them to zero — there is no closed-form solution.

Instead, we minimize the cross-entropy loss, which is the negative log-likelihood:

L=i=1n[yilog(pi)+(1yi)log(1pi)]\mathcal{L} = -\sum_{i=1}^{n} \left[ y_i \log(p_i) + (1 - y_i) \log(1 - p_i) \right]

This loss function penalizes confident wrong predictions heavily:

  • If yi=1y_i = 1 but the model predicts pi0p_i \approx 0, the loss log(pi)-\log(p_i) becomes very large
  • If yi=0y_i = 0 but the model predicts pi1p_i \approx 1, the loss log(1pi)-\log(1 - p_i) becomes very large

We optimize this using iterative methods like gradient descent, where we repeatedly update:

wwαLw,bbαLbw \leftarrow w - \alpha \frac{\partial \mathcal{L}}{\partial w}, \qquad b \leftarrow b - \alpha \frac{\partial \mathcal{L}}{\partial b}

Interactive Demo

Explore logistic regression hands-on. Adjust the weight and bias sliders to see the sigmoid curve change, or click Start Gradient Descent to watch the model learn optimal parameters automatically.

-4-3-2-101234z = wx + b00.250.50.751P(y=1|x)boundary: x=0.00

Click on the chart to add data points (top half → y=1, bottom half → y=0)

Iteration
0
Log Loss
0.0000
Accuracy
92.9%
y = 1y = 0SigmoidDecision Boundary

Things to Try

  • Click on the chart to add new data points (top half = y=1, bottom half = y=0)
  • Toggle the “Cross-Entropy Loss” tab to see the loss landscape as a heatmap
    • Dark blue regions indicate low loss — the model fits the data well there
    • Bright green/yellow regions indicate high loss — poor fit
    • The red dot marks the current (w,b)(w, b) — gradient descent moves it toward the blue region
  • Start gradient descent and watch the red dot navigate toward the minimum on the loss surface
  • Add overlapping points (e.g., y=1 near x=−2) to see how the model handles noise

Connection to GLMs

Logistic regression is a Generalized Linear Model with:

ComponentChoice
DistributionBernoulli
Link functionLogit: g(μ)=logμ1μg(\mu) = \log\frac{\mu}{1-\mu}
Linear predictorη=wx+b\eta = w^\top x + b

Once you classify observations, you can evaluate your model using Sensitivity, Specificity, and ROC curves.

Summary

Logistic regression transforms the unbounded linear predictor into a probability via the sigmoid function, then finds optimal parameters by minimizing cross-entropy loss through iterative optimization. It is one of the most fundamental classification models in statistics and machine learning — simple enough to interpret, yet powerful enough for real-world applications.

← Back to Encyclopedia