Pearson correlation is unduly influenced by outliers, unequal variances,
non-normality, and nonlinearity. An important competitor of the Pearson
correlation coefficient is the *Spearman’s rank* *correlation
coefficient*. This latter correlation is calculated by applying the Pearson
correlation formula to the ranks of the data rather than to the actual data
values themselves. In so doing, many of the distortions that plague the Pearson
correlation are reduced considerably.

Pearson correlation measures the strength of linear relationship between *X*
and *Y*. In the case of nonlinear, but monotonic relationships, a useful
measure is *Spearman’s* rank correlation coefficient, *Rho**, *which is a *Pearson’s* type
correlation coefficient computed on the ranks of *X* and *Y* values.
It is computed by the following formula:

where

*d*_{i} is the difference between the
ranks of *X*_{i} and *Y*_{i}.

*r*_{s} = +1, if there is a perfect
agreement between the two sets of ranks.

*r*_{s} = - 1, if there is a complete
disagreement between the two sets of ranks.

This is a measure of correlation between two ordinal-level variables. It is
most appropriate for square tables. For any sample of *n* observations,
there are [n (*n-*1)/2] possible comparisons of points (*X*_{I},
*Y*_{I}) and (*X*_{J}, *Y*_{J}).

Let *C* = Number of pairs that are concordant.

Let *D* = Number of pairs that are not
concordant.

*Kendall’s* *Tau* =

Obviously, *Tau* has the range: - 1 £
*Tau* £ +1

If *X*_{I }= *X*_{J}, or *Y*_{I} = *Y*_{J}
or both, the comparison is called a ‘tie’. Ties are not counted as
concordant or discordant.

If there are a large number of ties, then the dominator has to be replaced by

where *n*_{X} is the number of ties
involving *X*, and *n*_{Y} is the number of ties involving *Y.*

In large samples, the statistic:

3 *Tau* {n (*n*-1)} ^{1/2}/ {2 (2*n*+5)}
^{1/2}

has a normal distribution, and therefore can be used as a test statistic for testing the null hypothesis of zero correlation.

Kendall’s *Tau* is equivalent to Spearman’*s* *Rho*,
with regard to the underlying assumptions, but Spearman’s *Rho* and
Kendall’s *Tau* are not identical in magnitude, since their
underlying logic and computational formulae are quite different. The
relationship between the two measures is given by

-1 £ {(3
Kendall’s *Tau) – *(2 Spearman’* Rho*)} £ +1

In most cases, these values are very similar, and when discrepancies occur,
it is probably safer to interpret the lower value. More importantly,
Kendall’s *Tau* and Spearman’s *Rho *imply different
interpretations*. *Spearman’s* Rho *is considered* *as the
regular Pearson’s correlation coefficient in terms of the proportion of
variability accounted for, whereas Kendall’s *Tau* represents a
probability, *i.e*., the difference between the probability that the
observed data are in the same order *versus* the probability that the
observed data are __not__ in the same order.

There are two different variants of *Tau*, viz. *Tau b *and *Tau
c. *These measures differ only as to how tied ranks are handled.

*Kendall's Tau-b* is a measure of association often used with but not
limited to 2-by-2 tables. It is computed as the excess of concordant over discordant
pairs (C - D), divided by a term representing the geometric mean between the
number of pairs not tied on *X* (*X*_{0}) and the number not
tied on *Y* (*Y*_{0}):

*Tau-b* = (*C* - *D*)/ SQRT [(*C*
+ *D* + *Y*_{0})(*C* + *D* + *Y*_{0})]

There is no well-defined intuitive meaning for *Tau* -*b*, which
is the surplus of concordant over discordant pairs as a percentage of
concordant, discordant, and approximately one-half of tied pairs. The rationale
for this is that if the direction of causation is unknown, then the surplus of
concordant over discordant pairs should be compared with the total of all
relevant pairs, where those relevant are the concordant pairs, the discordant
pairs, plus either the X-ties or Y-ties but not both, and since direction is
not known, the geometric mean is used as an estimate of relevant tied pairs.

*Tau*-*b* requires binary or ordinal data. It reaches 1.0 (or -1.0
for negative relationships) only for square tables when all entries are on one
diagonal. *Tau-b* equals 0 under statistical independence for both square
and non-square tables. *Tau*-c is used for non-square tables.

Kendall's *Tau-*c, also called *Kendall-Stuart Tau-c*, is a
variant of *Tau*-*b* for larger tables. It equals the excess of
concordant over discordant pairs, multiplied by a term representing an adjustment
for the size of the table.

*Tau-c* = (C - D)*[2*m*/(*n*^{2}(*m*-1))]

where

*m* = the number of rows or columns, whichever
is smaller

*n* = the sample size.

Another non-parametric measure of correlation is Goodman – Kruskal* *Gamma
( G) which** **is based on the difference
between concordant pairs (*C*) and discordant pairs (*D*). Gamma is
computed as follows:

Thus, *Gamma *is the surplus of concordant pairs over discordant pairs,
as a percentage of all pairs, ignoring ties. *Gamma* defines perfect
association as weak monotonicity. Under statistical independence, *Gamma*
will be 0, but it can be 0 at other times as well (whenever concordant minus
discordant pairs are 0).

Another useful way of looking at the relationship between two nominal (or
categorical) variables is to cross-classify* *the data and get a *count *of
the number of cases sharing a given combination of levels (*i.e*.,
categories), and then create a contingency table (*cross-tabulation*)
showing the levels and the counts.

A *contingency table *lists the frequency of the joint occurrence of
two levels (or possible outcomes), one level for each of the two categorical
variables. The levels for one of the categorical variables correspond to the
columns of the table, and the levels for the other categorical variable
correspond to the rows of the table. The primary interest in constructing
contingency tables is usually to determine whether there is any association (in
terms of statistical dependence) between the two categorical variables, whose
counts are displayed in the table. A measure of the global association between
the two categorical variables is the *Chi-square* statistic, which is
computed as follows:

Consider a contingency table with *k *rows and *h* columns. Let *n*_{ij}
denote the cross-frequency of cell (*i, j*). Let denote the expected
frequency of the cell. The deviation between the observed and expected
frequencies (*n*_{ij} – )
characterizes the disagreement between the observation and the hypothesis of
independence. The expected frequency for any cell can be calculated by the
following formula:

*= *(*RT **´ CT*)* / N*

where

= expected frequency in a given cell (i, j)

*RT *= row total for the
row containing that cell.

*CT *= column total for the
column containing that cell*.*

*N* = total number of observations.

All the deviations can be studied by computing the quantity, denoted by

This statistic is distributed according to *Pearson’s Chi-square*
law with (*k*-1) ´ (*h*-1)
degrees of freedom. Thus, the statistical significance of the relationship
between two categorical variables is tested by using the χ^{2}**–
**test which essentially finds out whether the observed frequencies in a distribution
differ significantly from the frequencies, which might be expected according to
a certain hypothesis (say the hypothesis of independence between the two
variables).

*Assumptions*

The c^{2}**– **test
requires that the expected frequencies are not very small. The reason for this
assumption is that the *Chi-square* inherently tests the underlying
probabilities in each cell; and when the expected cell frequencies fall, these
probabilities cannot be estimated with sufficient precision. Hence, it is
essential that the sample size should be large enough to guarantee the
similarity between the theoretical and the sampling distribution of the c2^{ }– statistic. In the
formula for computation of c^{2},
the expected value of the cell frequency is in the denominator. If this value
is too small, the c^{2}^{ }value
would be overestimated and would result in the rejection of the null
hypothesis.

To avoid making incorrect inferences from the c^{2}–test,
the general rule is that an expected frequency less than 5 in a cell is too
small to use. When the contingency table contains more than one cell with an
expected frequency < 5, one can
combine them to get an expected frequency ³
5. However, in doing so, the number of categories would be reduced and one
would get __less__ information.

It should be noted that the c^{2}**–**test
is quite sensitive to the sample size. If the sample size is too small, the c^{2} value is overestimated; if it
is too large, the c^{2} value
is underestimated. To overcome this problem, the following measures of
association are suggested in the literature: *Phi–square* (j^{2}), *Cramer’s V* and
Contingency Coefficient.

*Phi – square* is computed as follows:

j
^{2}= c^{2}/N

where *N* is the
total number of observations.

For all contingency tables, which are 2 ´
2, 2 ´ *k*, or 2 ´ *h*, *Phi–square *has a
very nice property* *that its value ranges from 0 (no relationship) to 1
(perfect relationship). However, *Phi-square* loses this nice property,
when both dimensions of the table are greater than 2. By a simple manipulation
of *Phi–square*, we get a measure (*Cramer’s V*), which
ranges from 0 to 1 for any size of the contingency table. *Cramer’s V*
is computed as follows:

where *L* = min(*h*,
*k*)

*Contingency coefficient*

The coefficient of contingency is a *Chi-square* -based measure of the
relation between two categorical variables (proposed by Pearson, the originator
of the *Chi-square* test). It is computed by the following formula:

Its advantage over the ordinary *Chi-square* is that it is more easily
interpreted, since its range is always limited to 0 through 1 (where 0 means
complete independence). The disadvantage of this statistic is that its specific
upper limit is ‘limited’ by the size of the table; *Contingency*
*coefficien*t can reach the limit of 1, only if the number of categories
is unlimited.

This is a measure of association for cross tabulations of nominal-level variables. It measures the percentage improvement in predictability of the dependent variable (row variable or column variable), given the value of the other variable (column variable or row variable). The formula is

*Lambda –A( Row
dependent*)

*Lambda B- (Columns dependent)*

*Symmetric Lambda*

This is a weighted average of the *Lambda A *and
*Lambda B ,*
The formula is

Fisher's exact test is a test for independence in a 2 ´ 2 table.. . This
test is designed to test the hypothesis that the two column percentages are
equal. It is particularly useful when sample sizes are small (even zero in some
cells) and the *Chi-square* test is not appropriate. The test determines
whether the two groups differ in the proportion with which they fall in two
classifications: The test is based on the probability of the observed outcome,
and is given by the following formula:

where *a, b, c, d *represent
the frequencies in the four cells;. *N *= total number of cases.

This test is the nonparametric substitute for the equal-variance *t*-test
when the assumption of normality is not valid. When in doubt about normality,
it is safer to use this test. Two fundamental assumptions of this test are:

- The distributions are at least ordinal in
nature.
- The distributions are identical, except for
location. This means that ties are not acceptable.

This particular test is based on ranks and has good properties (asymptotic relative efficiency) for symmetric distributions.

The Mann-Whitney test statistic, *U*, is defined as the total number of
times a *Y* precedes an *X* in the configuration of combined samples.
It is directly related to the sum of ranks. This is why this test is sometimes
called the *Mann-Whitney U test* and at other times called the *Wilcoxon** Rank Sum test* The *Mann-Whitney U*
*test* calculates *U _{X }*and

The formula for* U _{Y}* is obtained
by replacing

deviation makes little difference unless there are a lot of ties.

This nonparametric test makes use of the sign and the magnitude of the rank of the differences of two related samples, and utilizes information about both the direction and relative magnitude of the differences within pairs of variables.

*Sum Ranks (W)*

The basic statistic for this test is the minimum of the sum of the positive
ranks (*SR*+) and the sum of the negative ranks (*SR*-). This
statistic is called *W.*

*W *=Minimum [ å *R *_{+},
*R **- *]

*Mean of W*

This is the mean of the sampling distribution of
the sum of ranks for a sample of *n *items.

m
_{W} = *n*(*n *+1) ¤ 4

*Standard deviation of W*

This is the standard deviation of the sampling distribution of the sum of ranks. The formula is:

s
_{W} =SQRT [ { *n *(*n*+1) (2*n*+1)/24
} ] - { ( *t* _{i}) ^{3 - }( *t*
_{i}) /48} ]

where *t* _{i}
represents the number of times the *i*^{th} value occurs.

*Number of Zeros: *If there are zero differences, they are thrown out
and the number of pairs is reduced by the number of zeros.

*Number of Ties: - *The treatment of ties is to assign an average rank
for the particular set of ties. This is the number of sets of ties that occur
in the data.

*Approximations with (and without) Continuity
Correction: Z-Value*

If the sample size ³ 15, a normal approximation method may be used to approximate the distribution of the sum of ranks. Although this method does correct for ties, it does not have the continuity correction factor. The z value is as follows:

If the correction factor for continuity is used, the formula becomes: