4.1 Correlation

Correlation is a measure of the strength of relationship between random variables. The (population) correlation between two variables X and Y is defined as:

r (X, Y) = Covariance (X, Y) / {Variance (X) Variance (Y)} 1/2


Covariance (X, Y)=S (X- m X) (Y- m Y)

where mX and mY are the expected values of X and Y respectively.

r is called the Product Moment Correlation Coefficient or simply the Correlation Coefficient. If X and Y tend to increase together, r is positive. If, on the other hand, one tends to increase as the other tends to decrease, r is negative. The value of correlation coefficient lies between -1 and +1, inclusive.

The sample correlation of a set of N bivariate observations (X1, Y1), (X2, Y2), . . ., (XN, YN) is given by


is the mean value of X,

is the mean value of Y,

SX is the standard deviation of X, and

SY is the standard deviation of Y.

The coefficient r satisfies the inequality -1 r +1. Equality is achieved only if all the points in the scatter plot of X and Y lie exactly on a straight line. By definition, r must be used only if the relationship between X and Y is linear.

r @  1   

Strong correlation between X and Y.

@  0    

It must not be concluded that there is no relationship between X and Y. The scatter plot should be examined. If the scatter plot is a parabolic curve, r would be approximately equal to zero.

Numerically, r can be interpreted as the average product of X and Y coordinates of the scatter plot of the standardized data. If points with both X and Y coordinates with positive sign predominate in the scatter plot, r is positive; if the points with both X and Y coordinates with negative sign predominate, r is negative.

If it is assumed that (X1, Y1), (X2, Y2), . . ., (XN, YN) are N independent observations with the same bivariate distribution, r can be used to estimate the population correlation, r.

To make inference about r using r, we require the sampling distribution of r, which is quite complex. When r = 0 and (X, Y) is bivariate normal, the statistic:

has a t-distribution with N-2 degrees of freedom, and can be used to test the null hypothesis [r=0].

If X and Y are not bivariate normal and r = 0, the statistic has a standard normal distribution in large samples. Thus, the statistic can be used to test the null hypothesis [r=0], even if the joint distribution is not bivariate normal.

If the assumption of bivariate normality is not satisfied by the data, it may be possible to make a preliminary transformation of the data to bivariate normality. However, it would be difficult to assess the effect of transformation on subsequent procedures involving correlation coefficient. An alternative procedure would be to compute a non-parametric correlation coefficient.