Consider a random sample of n observations (xi1, xi2, . . . . , xip, yi), i = 1, 2, . . . , n.
The p + 1 random variables are assumed to satisfy the linear model
yi = b 0 + b 1xi1 + b 2xi2 , +b pxip + ui i = 1, 2, . . . , n
where ui are values of an unobserved error term, u, and. the unknown parameters are constants.
E [ui] = 0 V [ui] =
Equations relating the n observations can be written as:
The parameters b 0, b 1, . . . b p can be estimated using the least squares procedure, which minimizes the sum of squares of errors.
Minimizing the sum of squares leads to the following equations, from which the values of b can be computed:
The problem of multiple regression can be geometrically represented as follows. We can visualize that n observations (xi1, xi2, …..xip, yi) i = 1, 2, ….n are represented as points in a (p+1) - dimensional space. The regression problem is to determine the possible hyper-planes in the p – dimensional space, which will be the best- fit. We use the least squares criterion and locate the hyper-plane that minimizes the sum of squares of the errors, i.e., the distances from the points around the plane (observations) and the point on the plane.
(i.e. the estimate ŷ).
ŷ = a+b1x1+b2x2+…+bpxp
Standard error of the estimate
where yi = the sample value of the dependent variable
ŷi = corresponding value estimated from the regression equation
n = number observations
p = number of predictors or independent variable
The denominator of the equation indicates that in multiple regression with p independent variables, the standard error has n-p-1 degrees of freedom. This happens because the degrees of freedom are reduced from n by p+1 numerical constants a, b1, b2, …..bp, that have been estimated from the sample.
The fit of the multiple regression model can be assessed by the Coefficient of Multiple determination, which is a fraction that represents the proportion of total variation of y that is explained by the regression plane.
Sum of squares due to error
Sum of squares due to regression
Total sum of squares
SST = SSR + SSE
The ratio SSR/SST represents the proportion of the total variation in y explained by the regression model. This ratio, denoted by R2, is called the coefficient of multiple determination. R2 is sensitive to the magnitudes of n and p in small samples. If p is large relative to n, the model tends to fit the data very well. In the extreme case, if n = p+1, the model would exactly fit the data.
A better goodness of fit measure is the adjusted R2, which is computed as follows:
Adjusted R2= 1 – () (1-R2)
= 1 -
The overall goodness of fit of the regression model (i.e. whether the regression model is at all helpful in predicting the values of y can be evaluated, using an F-test in the format of analysis of variance.
Under the null hypothesis: Ho: β1 = β2 = ... = βp = 0, the statistic
has an F-distribution with p and n--1 degrees of freedom
ANOVA Table for Multiple Regression
Source of Variation
Sum of Squares
Degrees of freedom
Whether a particular variable contributes significantly to the regression equation can be tested as follows: For any specific variable xi, we can test the null hypothesis Ho: βi = 0, by computing the statistic
and performing a one or two tailed t-test with n-p-1 degrees of freedom.
The magnitude of the regression coefficients depends upon the scales of measurement used for the dependent variable y and the explanatory variables included in the regression equation. Unstandardized regression coefficients cannot be compared directly because of differing units of measurements and different variances of the x variables. It is therefore necessary to standardize the variables for meaningful comparisons.
The estimated model
ŷi = bo+b1xi1+b2xi2+….bpxip
can be written as:
The expressions in the parentheses are standardized variables; b’s; are unstandardized regression coefficients and s1, s2, …sp are the standard deviations of variables x1, x2, ….xp and sx is the standard deviation of variable y. The coefficients (bisi)/sy, j=1,2,…,p are called standardized regression coefficients. The standardized regression coefficient measures the impact of a unit change in the standardized value of xi on the standardized value of y. The larger the magnitude of standardized bi, the more xi contributes to the prediction of y. However, the regression equation itself should be reported in terms of the unstandardized regression coefficients so that prediction of y can be made directly from the x variables.
Multiple correlation coefficient, R, is a measure of the strength of the linear relationship between y and the set of variables x1, x2, …xp. It is the highest possible simple correlation between y and any linear combination of x1,x2,….,xp. This property explains that the computed value of R is never negative. In this sense, the least squares regression plane maximizes the correlation between the x variables and the dependent variable y. Hence, it represents a measure of how well the regression equation fits the data. When the value of the multiple correlation R is close to zero, the regression equation barely predicts y better than sheer chance. A value of R close to 1 indicates a very good fit.
A useful approach to study the relationship between two variables x and y in the presence of a third variable z is to determine the correlation between x and y after controlling the effect of z. This correlation is called partial correlation. Partial correlation is the correlation of two variables while controlling for a third or more other variables. For example r12.34 is the correlation of variables 1 and 2, controlling for variables 3 and 4. If partial correlation r12.34 is equal to uncontrolled correlation r12 , it implies that the control variables have no effect on the relationship between variables 1 and 2.. If partial correlation is nearly equal to zero, it implies that the correlation between original variable is spurious.
Partial correlation coefficient is a measure of the linear association between two variables after adjusting for the linear effect of a group of other variables. If the number of other variables is equal to 1, the partial correlation coefficient is called the first order coefficient. If the number of other variables is equal to 2, the partial correlation coefficient is called the second order coefficient, and so on.
First order Partial Correlation
The first order partial correlation between xi and xj holding constant xl is computed by the following formula
where rij, ril and rjl are zero order (Pearson’s r coefficient)
Second order Partial Correlation
Correlation between xi and xj holding constant xl and xm is computed by the following formula:
where rij, rim.l, rjm.l are first order partial correlation coefficients.
Statistical significance of partial correlation coefficients can be tested by using a test statistic similar to the one for simple correlation coefficient.
where q is the number of variables held constant. The value of t is compared with tabulated t for n-q-2 degrees of freedom.
In practice, the problem of multicollinearity occurs when some of the x variables are highly correlated. Multicollinearity can have significant impact on the quality and stability of the fitted regression model. A common approach to multicollinearity problem is to omit explanatory variables. For example if x1 and x2 are highly correlated (say correlation is greater than 0.9), then the simplest approach would be to use only one of them, since one variable conveys essentially all the information in the other variable.
The simplest method for detecting multicollinearity is the correlation matrix, which can be used to detect if there are large correlations between pairs of explanatory variables.
When more subtle patterns of correlation coefficients exist, the determinant of the correlation matrix computed by IDAMS can be used to detect multicollinearity. The determinant of the correlation matrix represents as a single number the generalized variance in the set of predictor variables, and varies from 0 to 1. The value of the determinant near zero indicates that some or all explanatory variables are highly correlated. The value of the determinant equal to zero indicates a singular matrix, which indicates that at least one of the predictors is a linear function of one or more other predictors.
Another approach is to compute the ‘tolerance’ associated with a predictor. The tolerance of xi is defined as 1 minus the squared multiple correlation between that xi and the remaining x variables. When tolerance is small, say less than 0.01, then it would be expedient to discard the variable with the smallest tolerance. The inverse of the tolerance is called the variance inflation factor (VIF).
Stepwise regression is a sequential process for fitting the least squares model, where at each step a single explanatory variable is either added to or removed from the model in the next fit.
The most commonly used criterion for the addition or deletion of variables in stepwise regression is based on partial F-statistic:
The suffix ‘Full’ refers to the larger model with p explanatory variables, whereas the suffix ‘Reduced’ refers to the reduced model with (p- q) explanatory variables.
Forward selection procedure begins with no explanatory variable in the model and sequentially adds a variable according to the criterion of partial F- statistic. At each step, a variable is added, whose partial F- statistic yields the smallest p - value. Variables are entered as long as the partial F-statistic p-value remains below a specific maximum value (PIN). The procedure stops when the addition of any of the remaining variables yields a partial p-value > PIN. This procedure has two limitations. Some of the variables never get into the model and hence their importance is never determined. Another limitation is that a variable once included in the model remains there throughout the process, even if it loses its stated significance, after the inclusion of other variable(s).
The backward elimination procedure begins with all the variables in the model and proceeds by eliminating the least useful variable at a time. A variable, whose partial F p-value is greater than a prescribed value, POUT, is the least useful variable and is therefore removed from the regression model. The process continues, until no variable can be removed according to the elimination criterion.
The stepwise procedure is a modified forward selection method which later in the process permits the elimination of variables that become statistically non- significant. At each step of the process, the p-values are computed for all variables in the model. If the largest of these p-values > POUT, then that variable is eliminated. After the included variables have been examined for exclusion, the excluded variables are re-examined for inclusion. At each step of the process, there can be at the most one exclusion, followed by one inclusion. It is necessary that PIN POUT to avoid infinite cycling of the process.
Sometimes, explanatory variables for inclusion in a regression model are not interval scale; they may be nominal or ordinal variables. Such variables can be used in the regression model by creating ‘dummy’ (or indicator) variables.
Dichotomous variables do not cause the regression variables to lose any of their properties. Since they have two categories, they manage to ‘trick’ least squares, while entering into the regression equation as interval scale variables with just two categories.
Consider for example, the relationship between income and gender
y = a + bx
y = income of an individual, and
x = a dichotomous variable, coded as
0 if female
1 if otherwise
The estimated value of y is
ŷ =a if x = 0
ŷ=a+b if x = 1
Since our best estimate for a given sample is the sample mean, a is estimated as the average income for females and a+b is estimated as average income for males. The regression coefficient b is therefore
male – female
In effect, females are considered as the reference group and males’ income is measured by how much it differs from females’ income.
Consider, for example, the relationship between the time spent by an academic scientist on teaching and his rank.
y = a+bx
y is the percentage of work time spent on teaching
x is a polytomous variable ‘rank’ with three modalities:
1 = Professor
2 = Reader
3 = Lecturer
We create two dummy variables:
X1 = 1 if rank = Professor
0 if otherwise
X2 = 1 if rank = Reader
0 if otherwise
Note that we have created two dummy variables to represent a trichotomous variable. If we create a third dummy variable X3 (score 1; if rank = Lecturer, and 0 otherwise), the parameters of the regression equation cannot be estimated uniquely. This is because if the score of any respondent on X1 and X2 is known, it would always be possible to predict his score on X3. For example if a respondent has score 0 on X1 (not Professor) and 0 on X2 (not Reader), then the respondent is certainly a Lecturer (i.e., score 1 on X3). This represents a situation of perfect multicollinearity. Hence the general rule for creating dummy variables is: Number of dummy variables = Number of modalities minus 1.
Statistical significance of regression coefficients and Multiple R2 is determined in the same way as for interval scale explanatory variables.