Factor analysis through the lens of prediction

Factor analysis is a classical statistical method underlying some of the most important results in psychology. In this blog post, I study it through the lens of prediction.

Published

13 March 2026

Factor analysis was created by psychologists to study differences between people (Spearman, 1904; Thurstone, 1931). The core idea of factor analysis is that a person’s idiosyncratic behaviors – e.g. their specific responses to a set of survey questions – can be predicted using a small number of latent variables called factors.

For example, Spearman proposed that a single latent variable underlies a person’s cognitive ability, which he dubbed the general factor, or “g factor”. In personality research, factor analysis forms the basis of the standard “Big 5” model, which posits five factors underlie differences in personality.

A bit more abstractly, factor analysis seeks to model data-generating processes in which vectors $X \in \mathbb{R}^d$ are drawn from an unknown probability distribution $P$ . To do this, factor analysis uses vectors sampled from $P$ to fit an approximate distribution $Q\approx P$ , where $Q$ takes a particularly simple form that I’ll describe shortly.

In many introductions to the topic, factor analysis is presented as an inferential tool that discovers or tests for a particular latent structure. The conventional goal of performing factor analysis is to draw inferences about $P$ , by way of inspecting the parameters of a fitted $Q$ .

In this blog post, I’m going to provide a complementary perspective. I’ll present factor analysis as a model-fitting procedure, where the goal is to build a model that can predict out-of-sample data. Three simple questions I’ll address in this blog post are:

What is the model “architecture”?
How do you fit the model?
Once fitted, what predictions can it make?

In a sense, this is more of an ML-flavored introduction to factor analysis – one that, at least for me, felt like the more intuitive way into the topic.

Model

In classical factor analysis, the model $Q$ is a probability distribution over $\mathbb{R}^d$ . In particular, $Q$ is a multivariate normal distribution with a certain, restricted type of covariance matrix:

$Q = \mathcal{N}(\mu, \Lambda\Lambda^\top + \Psi)$

The parameters of $Q$ are:

The mean $\mu \in \mathbb{R}^d$
The “loading matrix” $\Lambda \in \mathbb{R}^{d \times k}$ , where $k \leq d$ is a hyperparameter of the model.
The diagonal matrix $\Psi \in \mathbb{R}^{d \times d}$ , which contains nonnegative entries.

The covariance matrix $\Sigma = \Lambda\Lambda^\top + \Psi$ is the heart of factor analysis. It expresses the core inductive bias of the model: that data $X$ are generated by sampling from a Gaussian lying in a $k$ -dimensional subspace (as summarized by $\Lambda \Lambda^\top$ ), then corrupting it with per-dimension noise (as summarized by the diagonal matrix $\Psi$ ).

Equivalently, the model $Q$ describes the following generative process:

$X = \mu + \Lambda F + \epsilon$

Where $F \sim \mathcal{N}(\mathbf{0}, I)$ is a sample from the standard multivariate normal over $\mathbb{R}^k$ , and $\epsilon \sim \mathcal{N}(\mathbf{0}, \Psi)$ , and is independent of $F$ .

Note that $k$ determines the number of free parameters. In general, $d \times d$ covariance matrices have $\frac{d(d+1)}{2}$ parameters. In factor analysis, we have $dk + d$ free parameters, which is less than the general case when $k<(d-1)/2$ .

Regardless of the choice of $k$ , factor analysis has a non-unique parameterization, as $\Sigma = \Lambda \Lambda^\top + \Psi$ is invariant to rotations of $\Lambda$ . In most applications of factor analysis, selecting one such rotation for $\Lambda$ is a key step in obtaining an interpretable model. However, from the prediction-oriented perspective of this blog post, the choice of basis is irrelevant, because it does not change the probabilities assigned by $Q$ .

Fitting

Our goal is to approximate some unknown probability distribution $P$ from which we have drawn $n$ samples $X_1, ..., X_n \sim P$ .

In ML terms, one could say factor analysis addresses an unsupervised learning problem: the learning of some unknown distribution $P(X)$ using empirical samples, using some intentionally restricted hypothesis class (here, low-rank Gaussian distributions).

Estimating the parameters of $Q$ from $[X_1, ..., X_n]$ is done using maximum likelihood estimation, typically via the EM algorithm (Jöreskog, 1969; Rubin & Thayer, 1982). The log-likelihood of the data is the usual formula for multivariate Gaussian distributions:

$\log p(X_1,\dots,X_n \mid \mu,\Sigma) = -\frac{nd}{2}\log(2\pi) -\frac{n}{2}\log |\Sigma| -\frac12 \sum_{i=1}^n (X_i-\mu)^\top \Sigma^{-1}(X_i-\mu).$

We noted earlier that $k$ (the number of factors) is a hyperparameter. In conventional applications of factor analysis, selecting $k$ is done using tools like scree plots, likelihood-ratio tests, and AIC/BIC statistics. As an alternative that would be more familiar to an ML practitioner, one could split $\mathbf{X}$ into training and validation sets (row-wise), and perform cross-validated parameter selection to identify the value of $k$ that maximizes the likelihood on the validation set.

Making predictions

I tend to think of models as input-output machines. In goes $x \in \mathcal{X}$ ; out comes some $y \in \mathcal{Y}$ . In such cases, it’s straightforward to understand what an “out-of-sample prediction” is – just feed in some $x$ that you didn’t use to build the machine, then get its output $y$ . Hopefully, that prediction is correct.

In the case of factor analysis, we have a probabilistic model $Q$ that maps vectors $x \in \mathbb{R}^d$ to probability densities. So one immediate notion of making an “out-of-sample prediction” might be assigning a probability density to an unseen observation $X'$ .

But factor analysis also supports a more useful notion of prediction. Because $Q$ is a distribution over $\mathbb{R}^d$ , we can condition on some observed dimensions then predict the others. Concretely, suppose we partition the $d$ dimensions (items) into two sets, which we’ll call “support” and “test”. Then for any sample $X'$ , we can write:

$X' = (X'_{\mathrm{support}}, X'_{\mathrm{test}})$

Then, we can derive the distribution of $X'_{\mathrm{test}}$ conditioned on $X'_{\mathrm{support}}$ . First note that the rows of $\Lambda$ and the diagonal of $\Psi$ can be partitioned the same way:

$\Lambda = \begin{pmatrix} \Lambda_s \\ \Lambda_t \end{pmatrix}, \qquad \Psi = \begin{pmatrix} \Psi_s & 0 \\ 0 & \Psi_t \end{pmatrix}.$

Since $X' \sim \mathcal{N}(\mu, \Lambda\Lambda^\top + \Psi)$ , the joint distribution of the two blocks is

$\begin{pmatrix} X'_{\mathrm{support}} \\ X'_{\mathrm{test}} \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mu_s \\ \mu_t \end{pmatrix},\ \begin{pmatrix} \Lambda_s \Lambda_s^\top + \Psi_s & \Lambda_s \Lambda_t^\top \\ \Lambda_t \Lambda_s^\top & \Lambda_t \Lambda_t^\top + \Psi_t \end{pmatrix} \right).$

Applying the standard Gaussian conditioning formula:

$X'_{\mathrm{test}} \mid X'_{\mathrm{support}} \sim \mathcal{N}(m, S),$

where

$m = \mu_t + \Lambda_t \Lambda_s^\top \left(\Lambda_s \Lambda_s^\top + \Psi_s\right)^{-1} (X'_{\mathrm{support}} - \mu_s),$

$S = \Lambda_t \Lambda_t^\top + \Psi_t - \Lambda_t \Lambda_s^\top \left(\Lambda_s \Lambda_s^\top + \Psi_s\right)^{-1} \Lambda_s \Lambda_t^\top.$

This conditional distribution is a sort of “personalized model”, tuned for the individual (or whatever) that $X'$ represents. Namely, it tells us what responses from that individual are likely on the test items $X'_{\mathrm{test}}$ given our observations of the individual’s responses on the support items $X'_{\mathrm{support}}$ .

Nicely, the support set can be any subset of the $d$ total items. We are not restricted to predicting one particular variable from one particular set of inputs chosen in advance.

This leads to at least one natural notion of train/test evaluation: after training $Q$ , we can split the $d$ items into support and test sets, gather fresh samples $X' \sim P$ , then see how well $Q$ can predict variation in the test items after conditioning on the support items.

tl;dr

In slightly more informal terms, here’s everything I wrote above:

Suppose you perform a measurement procedure in which you take $d$ real-valued measurements from a person. Plot their measurements as a point in this $d$ -dimensional “measurement space”. Repeat this across many people, and soon enough you have a cloud of points in $\mathbb{R}^d$ , where each point is a person.

In a nutshell, factor analysis consists of fitting a low-rank Gaussian over this cloud of points.

Once you fit the Gaussian, you can use it to make out-of-sample predictions in at least the following sense: when a new person walks into the lab, you can take $m < d$ measurements from that person, then use the Gaussian to predict their responses on the remaining $d-m$ through conditioning.

Appendix

Probabilistic PCA vs. Factor Analysis

Factor analysis and probabilistic principal component analysis (PPCA) have deep similarities, but they are not identical. Like factor analysis, PPCA aims to learn a multivariate Gaussian model of the data.The difference is the following:

PPCA has covariance matrices of the form $\Sigma = \Lambda \Lambda^\top + \sigma^2 I$
Factor analysis has covariance matrices of the form $\Sigma = \Lambda \Lambda^\top + \Psi$

The interpretational distinctions between PPCA and factor analysis have been discussed in detail (Fabrigar et al., 1999), but from the “prediction perspective” of this blog post, PPCA and factor analysis amount to slightly different model families.

PPCA with $k$ dimensions can be understood to be slightly less expressive than factor analysis with $k$ factors, as it does not have the flexibility of fitting per-dimension variance using $\Psi$ (instead, it fits a single $\sigma^2$ ). This has the usual variance-bias tradeoff implication (i.e. PPCA has a less expressive model space, but is easier to learn).

Assigning factors to an individual

Though the focus of this post is in predicting observable variables, it would be an omission to not mention how one estimates factor values $F \in \mathbb{R}^k$ for an individual, given their measurements $X$ . First, recall how the random variable $F$ relates to $X$ :

$X = \mu + \Lambda F + \epsilon$

Where $F \sim \mathcal{N}(\mathbf{0}, I)$ and $\epsilon \sim \mathcal{N}(\mathbf{0}, \Psi)$ . So the joint distribution of $X$ and $F$ can be written as a Gaussian:

$\begin{pmatrix} F \\ X \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mathbf{0} \\ \mu \end{pmatrix},\ \begin{pmatrix} I & \Lambda^\top \\ \Lambda & \Lambda\Lambda^\top + \Psi \end{pmatrix} \right).$

The standard conditioning formula for multivariate Gaussians gives the distribution of $F$ conditional on $X$ as:

$F \mid X \sim \mathcal{N}\!\left( \Lambda^\top \left(\Lambda\Lambda^\top + \Psi\right)^{-1} (X - \mu),\ I - \Lambda^\top \left(\Lambda\Lambda^\top + \Psi\right)^{-1} \Lambda \right).$

Spearman, C. (1904). "General Intelligence" Objectively Determined and Measured. The American Journal of Psychology.
Thurstone, L. L. (1931). Multiple factor analysis. Psychological Review, 38(5), 406.
Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34(2), 183–202.
Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69–76.
Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272.