Factor analysis through the lens of prediction

Factor analysis is a classical statistical method underlying some of the most important results in psychology. In this blog post, I study it through the lens of prediction.


Factor analysis was created by psychologists to study differences between people (Spearman, 1904; Thurstone, 1931). The core idea of factor analysis is that a person’s idiosyncratic behaviors – e.g. their specific responses to a set of survey questions – can be predicted using a small number of latent variables called factors.

For example, Spearman proposed that a single latent variable underlies a person’s cognitive ability, which he dubbed the general factor, or “g factor”. In personality research, factor analysis forms the basis of the standard “Big 5” model, which posits five factors underlie differences in personality.

A bit more abstractly, factor analysis seeks to model data-generating processes in which vectors XRdX \in \mathbb{R}^d are drawn from an unknown probability distribution PP. To do this, factor analysis uses vectors sampled from PP to fit an approximate distribution QPQ\approx P, where QQ takes a particularly simple form that I’ll describe shortly.

In many introductions to the topic, factor analysis is presented as an inferential tool that discovers or tests for a particular latent structure. The conventional goal of performing factor analysis is to draw inferences about PP, by way of inspecting the parameters of a fitted QQ.

In this blog post, I’m going to provide a complementary perspective. I’ll present factor analysis as a model-fitting procedure, where the goal is to build a model that can predict out-of-sample data. Three simple questions I’ll address in this blog post are:

In a sense, this is more of an ML-flavored introduction to factor analysis – one that, at least for me, felt like the more intuitive way into the topic.

Model

In classical factor analysis, the model QQ is a probability distribution over Rd\mathbb{R}^d. In particular, QQ is a multivariate normal distribution with a certain, restricted type of covariance matrix:

Q=N(μ,ΛΛ+Ψ)Q = \mathcal{N}(\mu, \Lambda\Lambda^\top + \Psi)

The parameters of QQ are:

The covariance matrix Σ=ΛΛ+Ψ\Sigma = \Lambda\Lambda^\top + \Psi is the heart of factor analysis. It expresses the core inductive bias of the model: that data XX are generated by sampling from a Gaussian lying in a kk-dimensional subspace (as summarized by ΛΛ\Lambda \Lambda^\top), then corrupting it with per-dimension noise (as summarized by the diagonal matrix Ψ\Psi).

Equivalently, the model QQ describes the following generative process:

X=μ+ΛF+ϵ X = \mu + \Lambda F + \epsilon

Where FN(0,I)F \sim \mathcal{N}(\mathbf{0}, I) is a sample from the standard multivariate normal over Rk\mathbb{R}^k, and ϵN(0,Ψ)\epsilon \sim \mathcal{N}(\mathbf{0}, \Psi), and is independent of FF.

Note that kk determines the number of free parameters. In general, d×dd \times d covariance matrices have d(d+1)2\frac{d(d+1)}{2} parameters. In factor analysis, we have dk+ddk + d free parameters, which is less than the general case when k<(d1)/2k<(d-1)/2.

Regardless of the choice of kk, factor analysis has a non-unique parameterization, as Σ=ΛΛ+Ψ\Sigma = \Lambda \Lambda^\top + \Psi is invariant to rotations of Λ\Lambda. In most applications of factor analysis, selecting one such rotation for Λ\Lambda is a key step in obtaining an interpretable model. However, from the prediction-oriented perspective of this blog post, the choice of basis is irrelevant, because it does not change the probabilities assigned by QQ.

Fitting

Our goal is to approximate some unknown probability distribution PP from which we have drawn nn samples X1,...,XnPX_1, ..., X_n \sim P.

In ML terms, one could say factor analysis addresses an unsupervised learning problem: the learning of some unknown distribution P(X)P(X) using empirical samples, using some intentionally restricted hypothesis class (here, low-rank Gaussian distributions).

Estimating the parameters of QQ from [X1,...,Xn][X_1, ..., X_n] is done using maximum likelihood estimation, typically via the EM algorithm (Jöreskog, 1969; Rubin & Thayer, 1982). The log-likelihood of the data is the usual formula for multivariate Gaussian distributions:

logp(X1,,Xnμ,Σ)=nd2log(2π)n2logΣ12i=1n(Xiμ)Σ1(Xiμ). \log p(X_1,\dots,X_n \mid \mu,\Sigma) = -\frac{nd}{2}\log(2\pi) -\frac{n}{2}\log |\Sigma| -\frac12 \sum_{i=1}^n (X_i-\mu)^\top \Sigma^{-1}(X_i-\mu).

We noted earlier that kk (the number of factors) is a hyperparameter. In conventional applications of factor analysis, selecting kk is done using tools like scree plots, likelihood-ratio tests, and AIC/BIC statistics. As an alternative that would be more familiar to an ML practitioner, one could split X\mathbf{X} into training and validation sets (row-wise), and perform cross-validated parameter selection to identify the value of kk that maximizes the likelihood on the validation set.

Making predictions

I tend to think of models as input-output machines. In goes xXx \in \mathcal{X}; out comes some yYy \in \mathcal{Y}. In such cases, it’s straightforward to understand what an “out-of-sample prediction” is – just feed in some xx that you didn’t use to build the machine, then get its output yy. Hopefully, that prediction is correct.

In the case of factor analysis, we have a probabilistic model QQ that maps vectors xRdx \in \mathbb{R}^d to probability densities. So one immediate notion of making an “out-of-sample prediction” might be assigning a probability density to an unseen observation XX'.

But factor analysis also supports a more useful notion of prediction. Because QQ is a distribution over Rd\mathbb{R}^d, we can condition on some observed dimensions then predict the others. Concretely, suppose we partition the dd dimensions (items) into two sets, which we’ll call “support” and “test”. Then for any sample XX', we can write:

X=(Xsupport,Xtest) X' = (X'_{\mathrm{support}}, X'_{\mathrm{test}})

Then, we can derive the distribution of XtestX'_{\mathrm{test}} conditioned on XsupportX'_{\mathrm{support}}. First note that the rows of Λ\Lambda and the diagonal of Ψ\Psi can be partitioned the same way:

Λ=(ΛsΛt),Ψ=(Ψs00Ψt). \Lambda = \begin{pmatrix} \Lambda_s \\ \Lambda_t \end{pmatrix}, \qquad \Psi = \begin{pmatrix} \Psi_s & 0 \\ 0 & \Psi_t \end{pmatrix}.

Since XN(μ,ΛΛ+Ψ)X' \sim \mathcal{N}(\mu, \Lambda\Lambda^\top + \Psi), the joint distribution of the two blocks is

(XsupportXtest)N ⁣((μsμt), (ΛsΛs+ΨsΛsΛtΛtΛsΛtΛt+Ψt)). \begin{pmatrix} X'_{\mathrm{support}} \\ X'_{\mathrm{test}} \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mu_s \\ \mu_t \end{pmatrix},\ \begin{pmatrix} \Lambda_s \Lambda_s^\top + \Psi_s & \Lambda_s \Lambda_t^\top \\ \Lambda_t \Lambda_s^\top & \Lambda_t \Lambda_t^\top + \Psi_t \end{pmatrix} \right).

Applying the standard Gaussian conditioning formula:

XtestXsupportN(m,S), X'_{\mathrm{test}} \mid X'_{\mathrm{support}} \sim \mathcal{N}(m, S),

where

m=μt+ΛtΛs(ΛsΛs+Ψs)1(Xsupportμs), m = \mu_t + \Lambda_t \Lambda_s^\top \left(\Lambda_s \Lambda_s^\top + \Psi_s\right)^{-1} (X'_{\mathrm{support}} - \mu_s),

S=ΛtΛt+ΨtΛtΛs(ΛsΛs+Ψs)1ΛsΛt. S = \Lambda_t \Lambda_t^\top + \Psi_t - \Lambda_t \Lambda_s^\top \left(\Lambda_s \Lambda_s^\top + \Psi_s\right)^{-1} \Lambda_s \Lambda_t^\top.

This conditional distribution is a sort of “personalized model”, tuned for the individual (or whatever) that XX' represents. Namely, it tells us what responses from that individual are likely on the test items XtestX'_{\mathrm{test}} given our observations of the individual’s responses on the support items XsupportX'_{\mathrm{support}}.

Nicely, the support set can be any subset of the dd total items. We are not restricted to predicting one particular variable from one particular set of inputs chosen in advance.

This leads to at least one natural notion of train/test evaluation: after training QQ, we can split the dd items into support and test sets, gather fresh samples XPX' \sim P, then see how well QQ can predict variation in the test items after conditioning on the support items.

tl;dr

In slightly more informal terms, here’s everything I wrote above:

Suppose you perform a measurement procedure in which you take dd real-valued measurements from a person. Plot their measurements as a point in this dd-dimensional “measurement space”. Repeat this across many people, and soon enough you have a cloud of points in Rd\mathbb{R}^d, where each point is a person.

In a nutshell, factor analysis consists of fitting a low-rank Gaussian over this cloud of points.

Once you fit the Gaussian, you can use it to make out-of-sample predictions in at least the following sense: when a new person walks into the lab, you can take m<dm < d measurements from that person, then use the Gaussian to predict their responses on the remaining dmd-m through conditioning.

Appendix

Probabilistic PCA vs. Factor Analysis

Factor analysis and probabilistic principal component analysis (PPCA) have deep similarities, but they are not identical. Like factor analysis, PPCA aims to learn a multivariate Gaussian model of the data.The difference is the following:

The interpretational distinctions between PPCA and factor analysis have been discussed in detail (Fabrigar et al., 1999), but from the “prediction perspective” of this blog post, PPCA and factor analysis amount to slightly different model families.

PPCA with kk dimensions can be understood to be slightly less expressive than factor analysis with kk factors, as it does not have the flexibility of fitting per-dimension variance using Ψ\Psi (instead, it fits a single σ2\sigma^2). This has the usual variance-bias tradeoff implication (i.e. PPCA has a less expressive model space, but is easier to learn).

Assigning factors to an individual

Though the focus of this post is in predicting observable variables, it would be an omission to not mention how one estimates factor values FRkF \in \mathbb{R}^k for an individual, given their measurements XX. First, recall how the random variable FF relates to XX:

X=μ+ΛF+ϵ X = \mu + \Lambda F + \epsilon

Where FN(0,I)F \sim \mathcal{N}(\mathbf{0}, I) and ϵN(0,Ψ)\epsilon \sim \mathcal{N}(\mathbf{0}, \Psi). So the joint distribution of XX and FF can be written as a Gaussian:

(FX)N ⁣((0μ), (IΛΛΛΛ+Ψ)). \begin{pmatrix} F \\ X \end{pmatrix} \sim \mathcal{N}\!\left( \begin{pmatrix} \mathbf{0} \\ \mu \end{pmatrix},\ \begin{pmatrix} I & \Lambda^\top \\ \Lambda & \Lambda\Lambda^\top + \Psi \end{pmatrix} \right).

The standard conditioning formula for multivariate Gaussians gives the distribution of FF conditional on XX as:

FXN ⁣(Λ(ΛΛ+Ψ)1(Xμ), IΛ(ΛΛ+Ψ)1Λ). F \mid X \sim \mathcal{N}\!\left( \Lambda^\top \left(\Lambda\Lambda^\top + \Psi\right)^{-1} (X - \mu),\ I - \Lambda^\top \left(\Lambda\Lambda^\top + \Psi\right)^{-1} \Lambda \right).


  1. Spearman, C. (1904). "General Intelligence" Objectively Determined and Measured. The American Journal of Psychology.
  2. Thurstone, L. L. (1931). Multiple factor analysis. Psychological Review, 38(5), 406.
  3. Jöreskog, K. G. (1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34(2), 183–202.
  4. Rubin, D. B., & Thayer, D. T. (1982). EM algorithms for ML factor analysis. Psychometrika, 47(1), 69–76.
  5. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psychological Methods, 4(3), 272.