Multiclass Logistic Regression

March 23, 2026

Extend logistic regression from binary to multiclass classification using the softmax function, cross-entropy loss, and gradient descent — with full derivation and interactive demo.

RegressionClassificationGLM

From Binary to Multiclass

In Logistic Regression, we modeled a binary outcome $y \in \{0, 1\}$ using the sigmoid function. But many real-world problems have more than two classes — for example, classifying handwritten digits (0–9) or categorizing species.

How do we extend logistic regression to handle $K$ classes?

y \in \{1, 2, \ldots, K\} \quad \longleftrightarrow \quad \text{multiclass logistic regression}

The Setup

For each class $k = 1, 2, \ldots, K$ , we define a linear predictor:

z_k = \mathbf{w}_k^\top \mathbf{x} + b_k

where $\mathbf{x} = (x_1, x_2, \ldots, x_n)$ is the feature vector, and each class $k$ has its own weight vector $\mathbf{w}_k$ and bias $b_k$ .

The Softmax Function

Instead of the sigmoid, we use the softmax function to convert the $K$ logits into probabilities:

P(Y = k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}

This guarantees two essential properties:

Each probability is positive: $P(Y = k \mid \mathbf{x}) > 0$ for all $k$
Probabilities sum to one: $\displaystyle\sum_{k=1}^{K} P(Y = k \mid \mathbf{x}) = \frac{\sum_j e^{z_j}}{\sum_j e^{z_j}} = 1$

The predicted class is the one with the highest probability:

\hat{y} = \underset{k}{\operatorname{argmax}} \; P(Y = k \mid \mathbf{x})

Notice that when $K = 2$ , softmax reduces to the sigmoid function from binary logistic regression.

One-Hot Encoding and Likelihood

To express the true label as a vector, we use one-hot encoding: $y_k = 1$ if class $k$ is correct, $y_k = 0$ otherwise. Let $p_k = P(Y = k \mid \mathbf{x})$ .

The likelihood of observing the correct label is:

P(\mathbf{y} \mid \mathbf{x}) = \prod_{k=1}^{K} p_k^{\,y_k}

Since only one $y_k = 1$ , this picks out the probability of the true class. Taking the log:

\log P(\mathbf{y} \mid \mathbf{x}) = \sum_{k=1}^{K} y_k \log p_k

Cross-Entropy Loss

Negating the log-likelihood gives us the cross-entropy loss:

\mathcal{L} = -\sum_{k=1}^{K} y_k \log p_k

This is the natural generalization of the binary cross-entropy loss from Logistic Regression. It heavily penalizes the model when the predicted probability for the true class is small.

Deriving the Gradient

To optimize with gradient descent, we need $\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k}$ and $\frac{\partial \mathcal{L}}{\partial b_k}$ . We use the chain rule:

\frac{\partial \mathcal{L}}{\partial z_k} = \sum_{i=1}^{K} \frac{\partial \mathcal{L}}{\partial p_i} \cdot \frac{\partial p_i}{\partial z_k}

Step 1 — Loss w.r.t. probabilities:

\frac{\partial \mathcal{L}}{\partial p_i} = -\frac{y_i}{p_i} \qquad (i = 1, 2, \ldots, K)

Step 2 — Softmax derivative (the tricky part!):

The softmax derivative requires care because both the numerator and denominator depend on $z_k$ . Using the Kronecker delta $\delta_{ik}$ (where $\delta_{ik} = 1$ if $i = k$ , else $0$ ):

\frac{\partial p_i}{\partial z_k} = \frac{\delta_{ik} \, e^{z_i} \sum_j e^{z_j} - e^{z_i} \, e^{z_k}}{\left(\sum_j e^{z_j}\right)^2} = p_i(\delta_{ik} - p_k)

Step 3 — Combining via the chain rule:

\frac{\partial \mathcal{L}}{\partial z_k} = \sum_{i=1}^{K} \left(-\frac{y_i}{p_i}\right) \cdot p_i(\delta_{ik} - p_k) = -\sum_{i=1}^{K} y_i(\delta_{ik} - p_k)

Since $\sum_i y_i = 1$ and $\delta_{ik}$ picks out the $k$ -th term:

\frac{\partial \mathcal{L}}{\partial z_k} = p_k - y_k

This is remarkably simple and elegant — the gradient is just the difference between the predicted probability and the true label!

Step 4 — Finally, since $\frac{\partial z_k}{\partial \mathbf{w}_k} = \mathbf{x}$ and $\frac{\partial z_k}{\partial b_k} = 1$ :

\frac{\partial \mathcal{L}}{\partial \mathbf{w}_k} = \mathbf{x}(p_k - y_k), \qquad \frac{\partial \mathcal{L}}{\partial b_k} = p_k - y_k

Gradient Descent Update

The parameters are updated iteratively:

\mathbf{w}_k \leftarrow \mathbf{w}_k - \eta \, \mathbf{x}(p_k - y_k)

b_k \leftarrow b_k - \eta \,(p_k - y_k)

where $\eta$ is the learning rate. Compare this to the gradient in binary logistic regression — the structure is identical, just extended to $K$ classes.

Interactive Demo

Explore multiclass logistic regression with 3 classes in 2D. The Decision Regions tab shows how the model partitions the input space, and the Softmax Probabilities tab visualizes $P(Y = k \mid \mathbf{x})$ for each class.

Iteration

Cross-Entropy Loss

0.0000

Accuracy

0.0%

Learned Parameters

Class	w₁	w₂
Class 0	0.3000	0.3000
Class 1	0.3000	-0.3000
Class 2	-0.3000	0.0000

Class 0Class 1Class 2

How It Works

1. Logit for each class k:

z_k = w_k,1 x₁ + w_k,2 x₂ + b_k

2. Softmax converts logits to probabilities:

P(Y = k | x) = exp(z_k) / Σ_j exp(z_j)

3. Cross-entropy loss (minimized by gradient descent):

L = −(1/N) Σ_i log P(Y = y⁽ⁱ⁾ | x⁽ⁱ⁾)

Things to Try

Start Gradient Descent and watch the decision boundaries form in real time
Hover over data points to see the softmax probability breakdown for each class
Switch to the Softmax Probabilities tab to see individual class probability heatmaps — notice how they always sum to 1
Add overlapping clusters to see how the model handles ambiguous regions

Connection to Binary Logistic Regression

When $K = 2$ , the softmax function simplifies:

P(Y = 1 \mid \mathbf{x}) = \frac{e^{z_1}}{e^{z_0} + e^{z_1}} = \frac{1}{1 + e^{-(z_1 - z_0)}} = \sigma(z_1 - z_0)

This is exactly the sigmoid function from Logistic Regression with an effective weight $\mathbf{w} = \mathbf{w}_1 - \mathbf{w}_0$ .

Connection to GLMs

Like binary logistic regression, multiclass logistic regression fits into the GLM framework:

Component	Choice
Distribution	Categorical (Multinoulli)
Link function	Softmax (generalized logit)
Linear predictor	$z_k = \mathbf{w}_k^\top \mathbf{x} + b_k$ for each class

After classification, evaluate your model’s per-class performance using metrics like Sensitivity, Specificity, and ROC curves (applied in a one-vs-rest fashion).

Summary

Multiclass logistic regression extends binary classification to $K$ classes by replacing the sigmoid with the softmax function. The cross-entropy loss generalizes naturally, and the gradient $p_k - y_k$ has the same elegant form as the binary case. The softmax derivative is the trickiest part of the derivation, requiring careful application of the quotient rule and Kronecker delta — but the final result is clean and computationally efficient.