Correspondence analysis is an exploratory data analytic technique designed to analyze simple two-way and multi-way tables containing some measure of correspondence between the rows and columns. As opposed to traditional hypothesis testing designed to verify a priori hypotheses about relations between variables, exploratory data analysis is used to identify systematic relations between variables when there are no (or rather incomplete) a priori expectations as to the nature of those relations.
Correspondence analysis is also a (multivariate) descriptive data analytic technique. Even the most commonly used statistics for simplification of data may not be adequate for description or understanding of the data. Simplification of data provides useful information about the data, but that should not be at the expense of valuable information. Correspondence analysis remarkably simplifies complex data and provides a detailed description of practically every bit of information in the data, yielding a simple, yet exhaustive analysis.
Correspondence analysis has several features that distinguish it from other techniques of data analysis. An important feature of correspondence analysis is the multivariate treatment of the data through simultaneous consideration of multiple categorical variables. The multivariate nature of correspondence analysis can reveal relationships that would not be detected in a series of pair wise comparisons of variable. Another important feature is the graphical display of row and column points in biplots, which can help in detecting structural relationships among the variable categories and objects (i.e., cases). Finally, correspondence analysis has highly flexible data requirements. The only strict data requirement is a rectangular data matrix with non-negative entries. Correspondence analysis is most effective if the following conditions are satisfied:
A distinct advantage of correspondence analysis over other methods yielding joint graphical displays is that it produces two dual displays whose row and column geometries have similar interpretations, facilitating analysis and detection of relationships. In other multivariate approaches to graphical data representation, this duality is not present.
In a nutshell, correspondence analysis (CA) may be defined as a special case of principal components analysis (PCA) of the rows and columns of a table, especially applicable to a cross-tabulation. However CA and PCA are used under different circumstances. Principal components analysis is used for tables consisting of continuous measurement, whereas correspondence analysis is applied to contingency tables (i.e. cross-tabulations). Its primary goal is to transform a table of numerical information into a graphical display, in which each row and each column is depicted as a point.
The usual procedure for analyzing a cross-tabulation is to determine the probability of global association between rows and columns. The significance of association is tested by the Chi-square test, but this test provides no information as to which are the significant individual associations between row-column pairs of the data matrix. Correspondence analysis shows how the variables are related, not just that a relationship exists.
Basic Concepts and Definitions
There are certain fundamental concepts in correspondence analysis: which are described below.
The original data matrix, N ( I , J ), or contingency table, is called the primitive matrix or primitive table. The elements of this matrix are nij.
While interpreting a cross-tabulation, it makes little sense to compare the actual frequencies in each cell. Each row and each column has a different number of respondents, called the base of respondents. For comparison it is essential to reduce either the rows or columns to the same base.
Consider a contingency table N (I, J) with I rows (i=1, 2, I) and J columns ( j =1,2,…,J ) having frequencies nil. Marginal frequencies are denoted by ni+ and n+j
Total frequency is given by
The profile of each row i is a vector of conditional densities:
The complete set of the row profile may be denoted by I × J matrix R.
Matrix of Row Profiles
The profile of each column j is a vector of conditional densities . The complete set of the column profiles may be denoted by (i ´ j) matrix C.
Matrix of Column Profiles
Average row profile = n+j /N (j=1,2,….J )
Average column profile = ni+/N (i=1,2,….,I )
Another fundamental concept in correspondence analysis is the concept of mass. The mass of the ith row =
Marginal frequency of the ith row/Grand total
Similarly the mass of the jth column =
Marginal frequency of the jth column/Grand total
The correspondence matrix P is defined as the original table N divided by the grand total n, P = (1/n) N. Thus, each cell of the correspondence matrix is given by the cell frequency divided by the grand total.
The correspondence matrix shows how one unit of mass is distributed across the cells. The row and column totals of the correspondence matrix are the row mass and column mass, respectively.
Clouds of Points N (I ) and N ( J )
The cloud of points N (I) is the set of elements of points iÎ I, whose coordinates are the components of the profile and whose mass is
The cloud of points N ( J ) is the set of elements of points j Î J, whose coordinates are the components of the profile and whose mass is nj+ / n++.
A variant of Euclidean distance, called the weighted Euclidean distance, is used to measure and thereby depict the distances between profile points. Here, the weighting refers to differential weighting of the dimensions of the space and not to the weighting of the profiles.
Distance between two rows i and i¢ is given by
In a symmetric fashion, the distance between two columns j and j¢ is given by
The distance thus obtained is called the Chi-square distance. The Chi-square distance differs from the usual Euclidean distance in that each square is weighted by the inverse of the frequency corresponding to each term.
The division of each squared term by the expected frequency is "variance – standardizing" and compensates for the larger variance in high frequencies and the smaller variance in low frequencies. If no such standardization were performed, the differences between larger proportions would tend to be large and thus dominate the distance calculation, while the differences between the smaller proportions would tend to be swamped. The weighting factors are used to equalize these differences.
Essentially, the reason for choosing the Chi-square distance is that it satisfies the principle of distributional equivalence, expressed as follows:
Inertia is a term borrowed from the "moment of inertia" in mechanics. A physical object has a center of gravity (or centroid). Every particle of the object has a certain mass m and a certain distance d from the centroid. The moment of inertia of the object is the quantity md2 summed over all the particles that constitute the object.
Moment of inertia =
This concept has an analogy in correspondence analysis. There is a cloud of profile points with masses adding up to 1. These points have a centroid ( i.e., the average profile) and a distance (Chi-square distance) between profile points. Each profile point contributes to the inertia of the whole cloud. The inertia of a profile point can be computed by the following formula.
For the ith row profile,
where rij is the ratio nw/n i+ and is n.j/n
The inertia of the jth column profile is computed similarly.
The total inertia of the contingency table is given by:
which is the Chi-square statistic divided by n?
Reduction of Dimensionality
Another way of looking at correspondence analysis is to consider it as a method for decomposing the overall inertia by identifying a small number of dimensions in which the deviations from the expected values can be represented. This is similar to the goal of factor analysis, where the total variance is decomposed, so as to arrive at a lower - dimensional representation of variables that allows one to reconstruct most of the variance/covariance matrix of variables.
Criterion for Dimensionality Reduction
In correspondence analysis, we are essentially looking for a low-dimensional subspace, which is as close as possible to the set of profile points in the high-dimensional true space. . Let S denote any candidate subspace. For the i:th profile point, we can compute the Chi-square distance between the profile point and S, denoted by di (S). The weighted measure of the distance of the profile point and the subspace is given by:
ri [ di (S).] 2
The distance of all the profiles to the subspace S is given by:
S ri [ di (S).] 2
The objective of correspondence analysis is to discover which subspace S minimizes the above criterion.
The criterion used for dimensionality reduction implies that the inertia of a cloud in the optimal subspace is maximum, but that would still be less than that in the true space. What is lost in this process is the knowledge of how far and in which direction the profiles lie off this subspace. What is gained is a view of the profiles, which otherwise would not be possible. The ratio of inertia inside the subspace to the total inertia gives a measure of the accuracy of representation of a cloud in the subspace.
Correspondence analysis determines the principal axes of inertia and for each axis the corresponding eigenvalue, which is the same as the inertia of the cloud in the direction of the axis. The first factorial axis is the line in the direction of which the inertia of the cloud is a maximum. The second factorial axis is, among all the lines that are perpendicular to the first factorial axis, the one in whose direction the inertia of the cloud is a maximum. The third factorial axis is, among all the lines that are perpendicular to both the first and second factorial axes, the line in whose direction the inertia of the cloud is a maximum, and so on. The optimal subspace is a subspace spanned by the principal axes. The inertia of a profile along a principal axis is called the Principal Inertia.
Geometrically, the principal inertia is the weighted average of the Chi-squared distances from the centroid to the projections of the row profiles on the respective principal axis. It is an absolute measure of the dispersion of the row profiles in the direction of that axis. Each principal inertia can be decomposed into components due to each row profile (or column profile). Rows, which contribute highly to a principal axis, largely determine the orientation and the identity of the corresponding principal axis.
The cosines of the row profiles’ deviation vectors from the centroid and the principal axis describe how closely each profile vector lies or correlates with a principal axis. Thus, they measure how well the display approximates the profile’s true position.
The eigenvalues (l i), corresponding to the sequence of the principal axes are in the decreasing order of magnitude:
l 1 > l 2 > l 3 > . . . . > l k
Row and Column Analyses
The row analysis of a matrix consists in situating the row profiles in a multidimensional space and finding the low- dimensional subspace, which comes closest to the profile points. The row profiles are projected onto such a subspace for interpretation of the inter-profile positions. Similarly, the analysis of column profiles involves situating the column profiles in a multidimensional space and finding the low-dimensional subspace, which comes closest to the profile points.
The row and column analyses are intimately connected. If a row analysis is performed, the column analysis is also ipso facto performed, and vice versa. The two analyses are equivalent in the sense that each has the same total inertia, the same dimensionality and the same decomposition of inertia into principal inertias along principal axes.
Row and Column Contributions to Inertia
These contributions can be expressed in relative terms:
Maximum number of dimensions
Since the sums of the frequencies across the columns must be equal to the row totals, and the sums across the rows equal to the column totals, there are in a sense only (number, J, of olumns-1) independent entries in each row, and (number, I, of rows-1) independent entries in each column of the contingency table. Thus, the maximum number of eigenvalues that can be extracted from a two- way table is equal to the minimum of [ the number of columns minus 1, and the number of rows minus 1] . If we choose to extract (i.e., interpret) the maximum number of dimensions that can be extracted, then we can reproduce exactly all the information contained in the table.
Interpretation of correspondence analysis
The interpretation of the results of correspondence analysis comprises the interpretation of numerical results and factor graphics, yielded by CA. The former implies selection of significant axes and significant points.
Selection of Significant Axes
How many axes are significant and should be retained for further analysis or interpretation? Here significant means ‘necessary to study in detail’ – not in terms of statistical significance tests. Two types of factor axes are considered: First order factor axes and Second order factor axes. First order factor axes are considered on the basis of contributions to the total variance (or inertia), whereas the second order factor axes are considered on the basis of contributions to the eccentricity, that is. COS2 j .
Correspondence analysis issues eigenvalues for the min[(I, J)-1] factor axes; the eigenvalues are ranked in the decreasing order of magnitude.
First order factor axes:
The number of (significant) axes, M, can be determined by any of the following rules:
Second order factor axes:
After having selected the first order factor axes, the second order factor axes are selected as follows:
Let M/ be the rank of a factor axis for which a point i of N (I) and or j of N (J) exists, such that
COS2 j (i) ³ k
COS2 j (j) ³ k
where k is typically = 0.25.
Thus, the number of axes chosen for interpretation = M + M/.
Rules for interpreting factorial axes by individual points
An explicative is a point whose absolute contribution CTRa (i) (for i Ì I ) or CTRa (j) (for j Ì J) are distinctly higher than the contributions of other points. The points i Ì I whose contributions are higher than the average of the whole contribution are considered as explicative. The explicative points can be selected according to any of the following criteria:
The points explained by an a -axis are the variable points i of N (I ) [ or j of N ( J )],. whose contributions to the eccentricity are greater than a certain threshold. The contributions to the eccentricity are similar to a squared coefficient of correlation (COS2 j ). Usually a threshold of 0.25 is used.
A point j can be an explained point (by an a -axis) without being an explicative point. Suppose that point i has an absolute contribution 40% and a squared correlation of 0.15 to an axis. This means that it contributes strongly to the creation of the axis, but it probably participates in the building of many other axes.
Thus, two sets of coefficients are calculated for each axis. These coefficients apply equally to the rows and columns of the data matrix.
Absolute contributions, which indicate the proportion of variance (i.e., inertia) explained by each variable in relation to each principal axis. This proportion is calculated with respect to the entire set of variables.
The squared correlations, which indicate the part of the variance of a variable explained by a principal axis.
The interpretation of absolute contributions is opposite to that of the relative contributions (COS2 j ). The latter indicate the extent to which each row category and each column category is described by the axis. The contribution to inertia, on the other hand, indicates the extent to which the geometric orientation of the axis is determined by the single variable categories
Quality of representation
The quality of representation of a point in the coordinate system, defined by the chosen numbers of dimensions, is defined as the ratio of the squared distance of the point from the origin in the chosen number of dimensions, over the squared distance from the origin in the space defined by the maximum number of dimensions. It is also equal to the sum of COS2 j.
Quality = å COS2 j
A low quality means that the current number of dimensions does not represent well the respective column or row point.
The most distinguishing feature of correspondence analysis is the possibility of introducing supplementary elements (variables or objects) into factor graphics. The supplementary elements do not contribute to the orientation of the factorial axis, but their relative contributions to the factorial axes and their coordinates are computed by the correspondence analysis. A simple way to think of such points is that they have a position in the full space, but no mass.
Supplementary points are additional rows and columns of a contingency table, which have meaningful profiles and which exist in the full space of row and column profiles. They can be projected onto the low-dimensional subspace and their positions relative to the active elements can be determined.
.The relative contribution of a supplementary point to the eccentricity of an axis (i.e., COS2 j ) can be used to judge whether the supplementary point lies to a larger or lesser extent in the plot rather than out of it. This procedure is used:
Outlier points plague correspondence analysis. Occasionally, a row or column profile is rare in its set of points that it has a minor role in the determination of the higher order axes. This situation can be discerned easily by considering the point’s contribution to the axes. When a point has a large contribution (CTR), at a large principal coordinate at a major principal axis, it is called an outlier. Outlier points should be treated as supplementary variables.
As in principal components analysis, the results of correspondence analysis are presented on graphs that represent the configurations of points in projection planes, formed by the first principal axes taken two at a time. It is customary to summarize the row and column coordinates in a single plot. However, it is important to remember that in such plots, one can only interpret the distances between row points, and the distances between column points, but not the distances between row points and column points. However, it is legitimate to interpret the relative positions of one point of one set with respect to all the points of the other set
The joint display of row and column points shows the relation between a point from one set and all points of another set, not between individual points between each set. Except in special cases, it is extremely dangerous to interpret the proximity of two points corresponding to different sets of points.
Some keys for interpreting the factorial maps are:
Mathematics of Correspondence Analysis
Contingency table N (I´ J)
Row mass = row sums/grand total = ni+/n
Column mass = column sums/grand total = n+j/n
Correspondence matrix is defined as the original table (or matrix) N divided by the grand total n.
The matrix of row profiles can also be defined as the rows of the correspondence matrix P divided by their respective row sums (i.e. row masses), which can be written as:
Matrix of row profile = Dr –1 P
where Dr is the diagonal matrix of row masses.
The matrix of column profiles consists of the columns of the correspondence matrix P divided by their respective column sums.
Matrix of column profiles = Dc– 1 P
where Dc is the diagonal matrix of the column masses.
The correspondence analysis problem is to find a low-dimensional approximation to the original data matrix that represents both the row and column profiles
R= Dr –1 P
C= Dc– 1 P
In a low k-dimensional subspace, where k is less than I or J. These two k-dimensional subspaces (one for the row profiles and one for the column profiles) have a geometric correspondence that enables us to represent both the rows and columns in the same display.
Since we wish to graphically represent the distances between row (or column) profiles, we orient the configuration of points at the center of gravity of both sets. The centroid of the set of row points in its space is the vector of column masses. The centroid of the set of column point in its space is r, the vector of row masses. This is the average column profile.
To perform the analysis with respect to the center of gravity, P is centered "symmetrically" by rows and columns, i.e., P-rcT so that it correspondence to the average profiles of both sets of points. The solution to finding a representation of both sets of points is the singular value decomposition of the matrix of standardized residuals i.e, I ´ J matrix with elements:
The singular value decomposition (SVD) is defined as the decomposition of an I´ J matrix A as the product of three matrices
A=U G V T (1)
where the matrix G is a diagonal matrix of positive numbers in decreasing order:
g 1 ³ g 2 ³ ……g n ³ 0 (2)
where k is the rank of A, and the columns of the matrices U and V are orthonormal, i.e.,
UTU=I V T V=I (3)
where UT is the transpose of U, and VT is the transpose of V.
g 1, g 2, ……,g k are called singular values.
Columns of U (u1, u2, ……,uk) are called left singular vectors.
Columns of V (v1, v2, ……,vk) are called right singular vectors.
Consider a set of I points in J-dimensional space, where coordinates are in the rows of the matrix Y with masses m1, m2, ……,mI assigned to the respective points, where the space is structured by the weighted Euclidean (with dimension weights q1, q2, ……,qJ associated with the respective dimensions). In other words, the distance between any two points, say x and y, is equal to
[ (x – y) T D q (x – y) ] ½ (4)
Let Dm and Dq be the diagonal matrices of point masses and dimension weights respectively
Let m be the vector of point messes (we have already assumed that ):
I T m= I
where I is the vector of ones.
Any low-dimensional configuration of the points can be derived directly from the singular value decomposition of the matrix:
where is the centroid of the rows of Y.
Applying singular value decomposition to the above equation, we find that principal coordinates of row points (i.e. projections of row profiles onto principal axes) are contained in the following matrix:
F= Dm½ U G (6)
The coordinates of the points in an optimal a -dimensional subspace are contained in the first a columns. The principal axes of this space are contained in the matrix
A = D q - ½V
Here, we have two special cases of the above general result, viz. Row problem and Column problem. These problems involve the reduction of dimensionality of the row profiles and the column profiles, where each set of points has its associated masses and Chi-square distances. Both these problems reduce to singular value decomposition of the same matrix of standardized residuals.
The row problem consists of a set of I profiles in the rows of = Dr –1 P with masses r in the diagonal matrix Dr in a space with distance defined by the diagonal matrix Dr –1. The centroid of the row profiles can be derived as follows
r T D r - 1P = I T P = c T
where c T is the row vector of the column masses
The matrix A in (Equation 5) can be written as
A = Dr1/2(Dr-1P-ICT)Dc-1/2 (7)
which can be rewritten as
A = Dr-1/2 (P-ycT)Dr-1/2 (8)
The column problem consists of a set of J profiles in the columns of P Dc- 1 with masses c in the diagonal of Dc in a space with distance defined by the diagonal matrix D r - 1.
By transposing the matrix P Dc- 1 of column profiles, we obtain Dc-1PT. The centroid of these profiles is (i.e. the row vector of row masses).
The matrix in Equation (5)
can be written as
This is the transpose of the matrix derived for A., the row problem. It follows that both the row and column problems can be solved by singular value decomposition of the same matrix of standardized residuals:
The elements of this I´ J matrix are:
It can be easily seen that the centroid of these profiles is:
(the row vector of r masses)
The matrix in Equation 5 is thus reduced to
It can be easily seen that the matrix A is the transpose of the matrix derived for the row problem. These results imply that both the row problem and column problems are solved by computing the singular value decomposition of the same matrix (i.e. the matrix of the standard residuals).
whose elements are:
It follows from Equation ( 10 ) that the Chi-square statistic can be decomposed into I ´ J components of the form:
The sum of squares of the elements of A is the total inertia of the contingency table.
Total inertia =
which is the chi-square statistic divided by n.
Thus, there are k = min [I-1, J-1] dimensions in the solution. The squares of the singular values of A i.e. the eigenvalues of ATA or AAT also decompose the total inertia. These are denoted by and are called the principal inertias.
The principal coordinates of the row problem are:
or in the scalar notation:
The principal coordinates of the columns are obtained from:
or in the scalar notation:
The standard coordinates of the rows are the principal coordinates divided by their respective singular values, i.e.
X=FG -1= (17)
or in the scalar notation
The standard coordinates of the columns are the principal coordinates divided by their respective singular values:
Y=GG -1= Dc-1/2V (18)
Each principal inertia l k is decomposed into components for each row i:
or in the matrix notation
The contribution of the rows to the principal inertia l k is equal to:
For the ith row, the inertia components for all k axes sum up to the row inertia of the ith row:
The left hand side of the above equation is identical to the sum of squared elements in the ith row of A
There are k = min [I-1, J-1] dimensions in the solution. The square of the singular values of A, are denoted by are called singular values.
The principal coordinates of the rows are obtained using [Equation (6)], for the row problem.
or in scalar notation:
Similarly the principal coordinates of the columns are obtained using Equation (6), for the column problem.
The standard coordinates of the rows are the principal coordinates divided by their respective singular values:
The standard coordinates of the columns are the principal coordinates divided by their respective singular values: