Multiclass Logistic Regression

Extend logistic regression from binary to multiclass classification using the softmax function, cross-entropy loss, and gradient descent — with full derivation and interactive demo.

RegressionClassificationGLM

From Binary to Multiclass

In Logistic Regression, we modeled a binary outcome y{0,1}y \in \{0, 1\} using the sigmoid function. But many real-world problems have more than two classes — for example, classifying handwritten digits (0–9) or categorizing species.

How do we extend logistic regression to handle KK classes?

y{1,2,,K}multiclass logistic regressiony \in \{1, 2, \ldots, K\} \quad \longleftrightarrow \quad \text{multiclass logistic regression}

The Setup

For each class k=1,2,,Kk = 1, 2, \ldots, K, we define a linear predictor:

zk=wkx+bkz_k = \mathbf{w}_k^\top \mathbf{x} + b_k

where x=(x1,x2,,xn)\mathbf{x} = (x_1, x_2, \ldots, x_n) is the feature vector, and each class kk has its own weight vector wk\mathbf{w}_k and bias bkb_k.

The Softmax Function

Instead of the sigmoid, we use the softmax function to convert the KK logits into probabilities:

P(Y=kx)=ezkj=1KezjP(Y = k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

This guarantees two essential properties:

  1. Each probability is positive: P(Y=kx)>0P(Y = k \mid \mathbf{x}) > 0 for all kk
  2. Probabilities sum to one: k=1KP(Y=kx)=jezjjezj=1\displaystyle\sum_{k=1}^{K} P(Y = k \mid \mathbf{x}) = \frac{\sum_j e^{z_j}}{\sum_j e^{z_j}} = 1

The predicted class is the one with the highest probability:

y^=argmaxk  P(Y=kx)\hat{y} = \underset{k}{\operatorname{argmax}} \; P(Y = k \mid \mathbf{x})

Notice that when K=2K = 2, softmax reduces to the sigmoid function from binary logistic regression.

One-Hot Encoding and Likelihood

To express the true label as a vector, we use one-hot encoding: yk=1y_k = 1 if class kk is correct, yk=0y_k = 0 otherwise. Let pk=P(Y=kx)p_k = P(Y = k \mid \mathbf{x}).

The likelihood of observing the correct label is:

P(yx)=k=1KpkykP(\mathbf{y} \mid \mathbf{x}) = \prod_{k=1}^{K} p_k^{\,y_k}

Since only one yk=1y_k = 1, this picks out the probability of the true class. Taking the log:

logP(yx)=k=1Kyklogpk\log P(\mathbf{y} \mid \mathbf{x}) = \sum_{k=1}^{K} y_k \log p_k

Cross-Entropy Loss

Negating the log-likelihood gives us the cross-entropy loss:

L=k=1Kyklogpk\mathcal{L} = -\sum_{k=1}^{K} y_k \log p_k

This is the natural generalization of the binary cross-entropy loss from Logistic Regression. It heavily penalizes the model when the predicted probability for the true class is small.

Deriving the Gradient

To optimize with gradient descent, we need Lwk\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} and Lbk\frac{\partial \mathcal{L}}{\partial b_k}. We use the chain rule:

Lzk=i=1KLpipizk\frac{\partial \mathcal{L}}{\partial z_k} = \sum_{i=1}^{K} \frac{\partial \mathcal{L}}{\partial p_i} \cdot \frac{\partial p_i}{\partial z_k}

Step 1 — Loss w.r.t. probabilities:

Lpi=yipi(i=1,2,,K)\frac{\partial \mathcal{L}}{\partial p_i} = -\frac{y_i}{p_i} \qquad (i = 1, 2, \ldots, K)

Step 2 — Softmax derivative (the tricky part!):

The softmax derivative requires care because both the numerator and denominator depend on zkz_k. Using the Kronecker delta δik\delta_{ik} (where δik=1\delta_{ik} = 1 if i=ki = k, else 00):

pizk=δikezijezjeziezk(jezj)2=pi(δikpk)\frac{\partial p_i}{\partial z_k} = \frac{\delta_{ik} \, e^{z_i} \sum_j e^{z_j} - e^{z_i} \, e^{z_k}}{\left(\sum_j e^{z_j}\right)^2} = p_i(\delta_{ik} - p_k)

Step 3 — Combining via the chain rule:

Lzk=i=1K(yipi)pi(δikpk)=i=1Kyi(δikpk)\frac{\partial \mathcal{L}}{\partial z_k} = \sum_{i=1}^{K} \left(-\frac{y_i}{p_i}\right) \cdot p_i(\delta_{ik} - p_k) = -\sum_{i=1}^{K} y_i(\delta_{ik} - p_k)

Since iyi=1\sum_i y_i = 1 and δik\delta_{ik} picks out the kk-th term:

Lzk=pkyk\frac{\partial \mathcal{L}}{\partial z_k} = p_k - y_k

This is remarkably simple and elegant — the gradient is just the difference between the predicted probability and the true label!

Step 4 — Finally, since zkwk=x\frac{\partial z_k}{\partial \mathbf{w}_k} = \mathbf{x} and zkbk=1\frac{\partial z_k}{\partial b_k} = 1:

Lwk=x(pkyk),Lbk=pkyk\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} = \mathbf{x}(p_k - y_k), \qquad \frac{\partial \mathcal{L}}{\partial b_k} = p_k - y_k

Gradient Descent Update

The parameters are updated iteratively:

wkwkηx(pkyk)\mathbf{w}_k \leftarrow \mathbf{w}_k - \eta \, \mathbf{x}(p_k - y_k)
bkbkη(pkyk)b_k \leftarrow b_k - \eta \,(p_k - y_k)

where η\eta is the learning rate. Compare this to the gradient in binary logistic regression — the structure is identical, just extended to KK classes.

Interactive Demo

Explore multiclass logistic regression with 3 classes in 2D. The Decision Regions tab shows how the model partitions the input space, and the Softmax Probabilities tab visualizes P(Y=kx)P(Y = k \mid \mathbf{x}) for each class.

-3-2-10123x₁-3-2-10123x₂
Iteration
0
Cross-Entropy Loss
0.0000
Accuracy
0.0%

Learned Parameters

Classw₁w₂b
Class 00.30000.30000.0000
Class 10.3000-0.30000.0000
Class 2-0.30000.00000.0000
Class 0Class 1Class 2

How It Works

1. Logit for each class k:
zk = wk,1 x1 + wk,2 x2 + bk
2. Softmax converts logits to probabilities:
P(Y = k | x) = exp(zk) / Σj exp(zj)
3. Cross-entropy loss (minimized by gradient descent):
L = −(1/N) Σi log P(Y = y(i) | x(i))

Things to Try

  • Start Gradient Descent and watch the decision boundaries form in real time
  • Hover over data points to see the softmax probability breakdown for each class
  • Switch to the Softmax Probabilities tab to see individual class probability heatmaps — notice how they always sum to 1
  • Add overlapping clusters to see how the model handles ambiguous regions

Connection to Binary Logistic Regression

When K=2K = 2, the softmax function simplifies:

P(Y=1x)=ez1ez0+ez1=11+e(z1z0)=σ(z1z0)P(Y = 1 \mid \mathbf{x}) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{1}{1 + e^{-(z_1 - z_0)}} = \sigma(z_1 - z_0)

This is exactly the sigmoid function from Logistic Regression with an effective weight w=w1w0\mathbf{w} = \mathbf{w}_1 - \mathbf{w}_0.

Connection to GLMs

Like binary logistic regression, multiclass logistic regression fits into the GLM framework:

ComponentChoice
DistributionCategorical (Multinoulli)
Link functionSoftmax (generalized logit)
Linear predictorzk=wkx+bkz_k = \mathbf{w}_k^\top \mathbf{x} + b_k for each class

After classification, evaluate your model’s per-class performance using metrics like Sensitivity, Specificity, and ROC curves (applied in a one-vs-rest fashion).

Summary

Multiclass logistic regression extends binary classification to KK classes by replacing the sigmoid with the softmax function. The cross-entropy loss generalizes naturally, and the gradient pkykp_k - y_k has the same elegant form as the binary case. The softmax derivative is the trickiest part of the derivation, requiring careful application of the quotient rule and Kronecker delta — but the final result is clean and computationally efficient.

← Back to Encyclopedia