Generalized Linear Models (GLM)
A step-by-step introduction to Generalized Linear Models (GLMs) starting from basic linear regression, explaining distributions, link functions, and model construction with an interactive tool.
Introduction to Generalized Linear Models (GLMs) Starting from Simple Regression
Introduction
Linear regression is a fundamental statistical model used to predict continuous outcomes. But what if your target variable is binary (like yes/no), or a count (like number of events)? Standard linear regression may fail: it can predict values outside the feasible range—like negative counts or probabilities over 1.
This is where Generalized Linear Models (GLMs) come in. GLMs extend linear regression by incorporating different distributions and transformations, making them applicable to a wide range of data types. In this article, we’ll build a GLM step by step, starting from the simple linear regression model you may already know.
1. A Quick Refresher: Simple Linear Regression
A typical simple linear regression model is written as:
This model assumes:
- The outcome is a continuous variable,
- The error term follows a normal distribution,
- The mean of is described by a linear function of : .
This setup works well when behaves like a normal variable. But what if it doesn’t?
2. When Linear Regression Falls Short
Real-world data often violates the assumptions of linear regression. For example:
- Binary outcome: Will a customer click an ad? ()
- Count data: How many accidents happen per day? ()
Linear regression can produce invalid predictions in these cases:
- Probabilities outside [0, 1],
- Negative counts,
- Non-constant variance and non-normal residuals.
To handle such cases, we need a more flexible framework: the Generalized Linear Model.
3. The Three Key Components of a GLM
A Generalized Linear Model is built from three components:
(1) Distribution of the Response Variable
GLMs assume the response variable follows a distribution from the exponential family, such as:
- Normal (for continuous data),
- Bernoulli (for binary outcomes),
- Poisson (for counts).
(2) Linear Predictor
Just like in linear regression, we define a linear combination of the predictors:
In this article, we use this single-variable predictor for simplicity and clarity. In practice, the linear predictor can involve multiple variables and interaction terms, but we’ll stick with one variable () to make the core ideas easier to follow.
(3) Link Function
The link function connects the mean of the response variable to the linear predictor:
This function transforms the mean into a scale suitable for a linear model.
Common choices include:
- Identity: (used in linear regression),
- Logit: (used for binary data),
- Log: (used for counts).
4. Example: Logistic Regression for Binary Data
Suppose we want to model whether someone clicks an ad (). We can use:
- Distribution: Bernoulli
- Link function: Logit
- Linear predictor:
Then the model becomes:
Solving for gives:
This formula ensures that always lies between 0 and 1—something linear regression can’t guarantee.
5. How to Build a GLM: Step-by-Step
Here’s how to construct a GLM for any type of data:
-
Check your response variable
Is it continuous, binary, or a count? -
Choose a distribution
Pick one from the exponential family (Normal, Bernoulli, Poisson, etc.). -
Select a link function
This should match the range of your outcome and the nature of your predictor. -
Define a linear predictor
Build an expression like . -
Estimate the parameters
Typically done using Maximum Likelihood Estimation (MLE). -
Evaluate the model
Use metrics like AIC, deviance, or residuals to check the fit.
Interactive GLM Builder
Explore how GLMs work by choosing a distribution, link function, and predictor below. Adjust parameters to see how the model behaves:
Interactive GLM Builder
Step 1: Choose Distribution
Summary
GLMs extend the familiar linear regression model by allowing for non-normal distributions and transforming the relationship between predictors and outcomes using link functions. This makes them powerful tools for modeling a wide range of data, from binary outcomes to counts and beyond.
Once you understand the GLM framework, you’ll be ready to dive into logistic regression, Poisson regression, and more specialized models.