Principal Component Analysis (2D)

From centering the data to deriving Var(z) = w^T S w and solving the eigenvalue problem, this article explains PCA step-by-step using the maximum variance approach.

PCALinear AlgebraEigenvaluesEigenvectors

1. Introduction

Principal Component Analysis (PCA) can be understood as finding the direction in which the data has the largest variance.
Here we explain PCA in 2D, starting from raw data and ending at the eigenvalue problem — with a step-by-step derivation of the key formula:

Var(z)=wSw.\mathrm{Var}(z) = \mathbf{w}^\top \mathbf{S} \mathbf{w}.

2. Step 0: Data Centering

Given nn observations of two variables XX and YY, we first center each variable so that its mean is zero:

  • Column means:
    xˉ=1ni=1nXi,yˉ=1ni=1nYi\bar{x} = \frac{1}{n} \sum_{i=1}^n X_i,\quad \bar{y} = \frac{1}{n} \sum_{i=1}^n Y_i
  • Centering:
    Xi=Xixˉ,Yi=YiyˉX_i' = X_i - \bar{x},\quad Y_i' = Y_i - \bar{y}
  • Let x=(X,Y)\mathbf{x} = (X', Y')^\top be the centered random vector.
    Then E[x]=0E[\mathbf{x}] = \mathbf{0}.

3. Step 1: Covariance Matrix

From the centered data, we compute the covariance matrix:

S=E[xx]=(σxxσxyσxyσyy)\mathbf{S} = E[\mathbf{x}\mathbf{x}^\top] = \begin{pmatrix} \sigma_{xx} & \sigma_{xy} \\ \sigma_{xy} & \sigma_{yy} \end{pmatrix}

where:

  • σxx=Var(X)\sigma_{xx} = \mathrm{Var}(X')
  • σyy=Var(Y)\sigma_{yy} = \mathrm{Var}(Y')
  • σxy=Cov(X,Y)\sigma_{xy} = \mathrm{Cov}(X', Y')

4. Step 2: Defining the Maximum Variance Direction

Let wR2\mathbf{w} \in \mathbb{R}^2 be a unit vector (w=1\|\mathbf{w}\|=1) representing a direction.
The projection of x\mathbf{x} onto this direction is:

z=wxz = \mathbf{w}^\top \mathbf{x}

The first principal component is the direction w\mathbf{w} that maximizes the variance of zz:

maxw=1 Var(z).\max_{\|\mathbf{w}\|=1} \ \mathrm{Var}(z).

5. Detailed Derivation: Why Var(z)=wSw\mathrm{Var}(z) = \mathbf{w}^\top \mathbf{S} \mathbf{w}

5.1 Setup

  • Centered vector:
    x=(XY),E[x]=0\mathbf{x} = \begin{pmatrix} X' \\ Y' \end{pmatrix}, \quad E[\mathbf{x}] = \mathbf{0}
  • Covariance matrix:
    S=E[xx]=(σxxσxyσxyσyy)\mathbf{S} = E[\mathbf{x}\mathbf{x}^\top] = \begin{pmatrix} \sigma_{xx} & \sigma_{xy} \\ \sigma_{xy} & \sigma_{yy} \end{pmatrix}
  • Projection direction:
    w=(w1,w2)\mathbf{w} = (w_1, w_2)^\top
  • Projected scalar:
    z=wx=w1X+w2Yz = \mathbf{w}^\top \mathbf{x} = w_1 X' + w_2 Y'

5.2 Variance Definition

By definition:

Var(z)=E[z2](E[z])2\mathrm{Var}(z) = E[z^2] - (E[z])^2

Since the data is centered:

E[z]=wE[x]=0E[z] = \mathbf{w}^\top E[\mathbf{x}] = 0

Thus:

Var(z)=E[z2]\mathrm{Var}(z) = E[z^2]

5.3 Expanding E[z2]E[z^2]

First:

z2=(wx)2=(wx)(wx)z^2 = (\mathbf{w}^\top \mathbf{x})^2 = (\mathbf{w}^\top \mathbf{x})(\mathbf{w}^\top \mathbf{x})

Because this is a scalar, we can reorder terms:

(wx)(wx)=w(xx)w(\mathbf{w}^\top \mathbf{x})(\mathbf{w}^\top \mathbf{x}) = \mathbf{w}^\top (\mathbf{x}\mathbf{x}^\top) \mathbf{w}

5.4 Bringing constants outside the expectation

Since w\mathbf{w} is constant with respect to the expectation:

Var(z)=E[wxxw]=wE[xx]w\mathrm{Var}(z) = E[\mathbf{w}^\top \mathbf{x} \mathbf{x}^\top \mathbf{w}] = \mathbf{w}^\top E[\mathbf{x} \mathbf{x}^\top] \mathbf{w}

5.5 Recognizing the covariance matrix

By definition of S\mathbf{S}:

E[xx]=SE[\mathbf{x} \mathbf{x}^\top] = \mathbf{S}

So we obtain:

Var(z)=wSw\boxed{\mathrm{Var}(z) = \mathbf{w}^\top \mathbf{S} \mathbf{w}}

5.6 Component form for intuition

If w=(w1,w2)\mathbf{w} = (w_1, w_2)^\top, then:

wSw=w12σxx+2w1w2σxy+w22σyy\mathbf{w}^\top \mathbf{S} \mathbf{w} = w_1^2\sigma_{xx} + 2w_1w_2\sigma_{xy} + w_2^2\sigma_{yy}

This shows the variance is a quadratic form combining variances and covariance, weighted by direction coefficients.

6. Step 3: Solving via Lagrange Multipliers

We now solve:

maxw wSws.t.ww=1\max_{\mathbf{w}} \ \mathbf{w}^\top \mathbf{S} \mathbf{w} \quad\text{s.t.}\quad \mathbf{w}^\top \mathbf{w} = 1

Lagrangian:

L(w,λ)=wSwλ(ww1)\mathcal{L}(\mathbf{w}, \lambda) = \mathbf{w}^\top \mathbf{S} \mathbf{w} - \lambda (\mathbf{w}^\top \mathbf{w} - 1)

Differentiating and setting to zero:

2Sw2λw=0Sw=λw2\mathbf{S}\mathbf{w} - 2\lambda\mathbf{w} = 0 \quad\Rightarrow\quad \mathbf{S}\mathbf{w} = \lambda \mathbf{w}

We have reduced PCA to an eigenvalue problem.

7. Step 4: Eigenvalues and Principal Components — Meaning and Interpretation

From the Lagrange multiplier method, we obtained the eigenvalue equation:

Sw=λw.\mathbf{S}\mathbf{w} = \lambda \mathbf{w}.

This tells us two things:

7.1 Eigenvectors = Principal Component Directions

  • Each eigenvector wk\mathbf{w}_k of the covariance matrix S\mathbf{S} points in a direction in the data space.
  • Geometrically, if you draw an arrow in the direction of wk\mathbf{w}_k, it shows how you would “look” at the data to see a certain pattern of variation.
  • In PCA, these directions are orthogonal (perpendicular) to each other — they define a new coordinate system aligned with the data’s natural spread.

7.2 Eigenvalues = Variance Along Those Directions

  • The corresponding eigenvalue λk\lambda_k tells you how much variance the data has when projected onto wk\mathbf{w}_k.
  • If λk\lambda_k is large, it means the data is very spread out in that direction.
  • If λk\lambda_k is small, the data is tightly clustered along that direction.

7.3 Ordering by Variance

  • Sort the eigenvalues in descending order: λ1λ2\lambda_1 \ge \lambda_2 \ge \dots
  • The eigenvector w1\mathbf{w}_1 associated with the largest eigenvalue λ1\lambda_1 is the first principal component: the direction of maximum variance in the data.
  • w2\mathbf{w}_2 (second principal component) is orthogonal to w1\mathbf{w}_1 and corresponds to the second-largest variance λ2\lambda_2.
  • This continues for higher dimensions, ensuring each new axis is perpendicular to all previous ones.

7.4 Why This Matters in PCA

  • By keeping only the first few principal components (largest eigenvalues), we retain most of the variance while reducing dimensionality.
  • In 2D, the first principal component often captures the “main trend” of the data, while the second captures the orthogonal “secondary trend.”
  • This interpretation is the bridge between the geometry (rotation of coordinate axes) and the statistics (variance explained).

8. Step-by-Step Interactive Demo

Step-by-Step PCA in 2D

Step 0: Raw Data

Original 2D dataset with correlation.

Mean: (0, 0) n = 0
  1. Raw Data — Show original scatter plot.
  2. Centering — Animate subtraction of means so the centroid is at the origin.
  3. Covariance Matrix — Display S\mathbf{S} and explain its entries.
  4. Search — Find the eigenvectors that maximize the variance of the projected data.
  5. Principal Components — Display both PC1 and PC2 on the scatter, with their variances.

9. Key Takeaways

  • PCA can be seen as variance maximization.
  • The variance of a projection is a quadratic form wSw\mathbf{w}^\top \mathbf{S} \mathbf{w}.
  • Solving the maximization with a unit-length constraint leads to the eigenvalue problem.
  • Eigenvectors = PC directions, eigenvalues = variances along them.
← Back to Encyclopedia