6.2 Principal Components Analysis

Principal components analysis can be defined as follows.

Consider a data matrix:

X =|xij |

in which the columns represent the p variables and rows represent measurements of n objects or individuals on those variables. The data can be represented by a cloud of n points in a p-dimensional space, each axis corresponding to a measured variable. We can then look for a line OY1 in this space such that the dispersion of n points when projected onto this line is a maximum. This operation defines a derived variable of the form

with coefficients satisfying the condition

After obtaining OY1, consider the (p-1) - dimensional subspace orthogonal to OY1 and look for the line OY2 in this subspace such that the dispersion of points when projected onto this line is a maximum. This is equivalent to seeking a line OY2 perpendicular to OY1 such that the dispersion of points when they are projected onto this line is the maximum. Having obtained OY2, consider a line in the (p-2) - dimensional subspace, which is orthogonal to both OY1 and OY2, such that the dispersion of points when projected onto this line is as large as possible. The process can be continued, until p mutually orthogonal lines are determined. Each of these lines defines a derived variable:

where the constants are determined by the requirement that the variance of Yi is a maximum, subject to the constraint of orthogonality as well as

for each i.

The Yi thus obtained are called Principal Components of the system and the process of obtaining them is called Principal Components Analysis.

The p-dimensional geometric model defined above can be considered as the true picture of the data. If we wish to obtain the best q-dimensional representation of the p-dimensional true picture, then we simply have to project the points onto the q-dimensional subspace defined by the first q principal components Y1, Y2, …., Yq.

The variance of a linear composite:

is given by

where sij is the covariance between variables i and j.. The variance of a linear composite can also be expressed in the notation of matrix algebra as:

aT S a

where a is the vector of the variable weights and S is the covariance matrix. aT is the transpose of a.

Principal components analysis finds the weight vector a, that maximizes

aTS a

subject to the constraint that

It is essential to constrain the size of a, otherwise the variance of the linear composite can become arbitrarily large by selecting large weights.

It is important to note that principal components decomposition is not scale invariant. We would get different decompositions, depending upon whether the principal components are calculated from the un-scaled cross-products matrix (SSCP) or covariance matrix. The magnitudes of the diagonal elements of a cross-products matrix or a covariance matrix influence the nature of the principal components. Hence standardized variables are commonly used. The XTX matrix based on standardized variables is proportional to a correlation matrix. The covariance matrix can be viewed as a partial step between the SSCP and the correlation matrix. Since the covariance matrix is based on the deviations of the variables from their respective means, it corrects for the differences in the magnitudes of the elements of SSCP for the overall level, but it does not correct for the differences in the variances among the variables.

If we have a set of n observations (objects/cases) on p variables, then we can find the largest principal component (of a cross-products matrix, covariance matrix or correlation matrix) as the weight vector

which maximizes the variance of

subject to the constraint

We can then define the second largest principal component as the weight vector

a2=

which maximizes the variance of

subject to the constraints:

        

         Principal component 2 is linearly independent of principal component 1, i.e.

We can define the third largest principal component as the weight vector

a3=

which maximizes the variance of

subject to the constraints:

        

The third principal component is orthogonal to the first two principal components. These two orthogonality conditions are

This process can be continued till the last (i.e., the pth ) principal component is derived.

The sum of the variances of the principal components is equal to the variance of the original variables.

where is the variance of the principal component. If the variables are standardized then

In the matrix notation, the above definition of principal components leads to the following equation

R A = A L

where A is a matrix of eigenvectors as column vectors and L is a diagonal matrix of the corresponding latent roots (or the eigenvalues) of the correlation matrix R, rank- ordered from the largest to the smallest. The elements of L have to be in the same order as their associated latent vector (or eigenvector). The largest latent root (l 1) of R is the variance of the first or largest principal component of R and its associated vector.

is the set of weights for the first principal component, which maximizes the variance of

Similarly for the second principal component, and so on.

The last latent root (l p) is the variance of the last or the smallest principal component.

The i th latent root and its associated weight vector satisfy the matrix equation:

Rai = l iaI

Pre-multiplying the above equation by aiT leads to

aiTRai = aiTil iaI =li

since

The variance of the first principal component = l 1. Similarly for the second principal component, and so on. The last latent root (l p) is the variance of the last or the smallest principal component. Thus:

Ra 1 = l ia1

Ra2 = l 2a2

 

.

Rap = l pap

In matrix notation,

RA =A Λ

Where A is the matrix of eigenvectors, as column vectors, and Λ is a matrix of corresponding latent roots ordered from the largest to the smallest.

Since RA = A Λ

Pre-multiplying by AT leads to

ATRA = Λ TA

= Λ

because ATA = I

This means that we can decompose R into a product of three matrices, involving eigenvectors and eigenroots. In other words, the variation in R is expressed in terms of the weighting vectors (eigenvectors) of the principal components and variances (eigenvalues) of the principal components. This is called the singular value decomposition of the correlation matrix R. This is the key concept underlying Principal Components Analysis.

Interpretation of principal components

It becomes easier to interpret the principal components when the elements of the latent vectors are transformed to correlations of the variables with the particular principal components. This can be done by multiplying each element of a particular latent vector, ai by the square root of the associated latent root, . The correlations of variables with principal components are called loadings.

The purpose of principal components analysis is to reduce the complexity of the multivariate data into the principal components space and then choose the first q principal component (q < p) that explain most of the variation in the original variables. The following criteria for selecting the number of principal components are suggested in the literature:

  1. Plot the eigenvalues l j against j ( j = 1, 2,…, p). The resulting plot, called the Scree plot, provides a convenient visual method for separating the important components from the less important components. The plot is called a scree plot, since it resembles a mountainside with a jumble of boulders at the base (Scree is a geological term referring to the debris, which collects on the lower part of a rocky slope).
  2. Exclude those principal components with eigenvalues below the average. For the principal components calculated from the correlation matrix, the average eigenvalue is 1. This criterion excludes principal components with eigenvalues less than 1.
  3. Include just enough components that explain some arbitrary amount (typically 80%) of the variance.

Usually, the first approach includes too many components, whereas the second approach includes too few components. The 80% criterion can be a good compromise.