Consider a random sample of n observations (x_{i1}, x_{i2}, . . . . , x_{i}_{p}, y_{i}), i = 1, 2, . . . , n.
The p + 1 random variables are assumed to satisfy the linear model
y_{i} = b _{0} + b _{1}x_{i1} + b _{2}x_{i2} , +b _{p}x_{ip} + u_{i} i = 1, 2, . . . , n
where u_{i} are values of an unobserved error term, u, and. the unknown parameters are constants.
E [u_{i}] = 0 V [u_{i}] =
Equations relating the n observations can be written as:
The parameters b _{0}, b _{1}, . . . b _{p} can be estimated using the least squares procedure, which minimizes the sum of squares of errors.
Minimizing the sum of squares leads to the following equations, from which the values of b can be computed:
The problem of multiple regression can be geometrically represented as follows. We can visualize that n observations (x_{i1}, x_{i2}, …..x_{ip}, y_{i}) i = 1, 2, ….n are represented as points in a (p+1)  dimensional space. The regression problem is to determine the possible hyperplanes in the p – dimensional space, which will be the best fit. We use the least squares criterion and locate the hyperplane that minimizes the sum of squares of the errors, i.e., the distances from the points around the plane (observations) and the point on the plane.
(i.e. the estimate ŷ).
ŷ = a+b_{1}x_{1}+b_{2}x_{2}+…+b_{p}x_{p}
Standard error of the estimate
S_{e} =
where y_{i} = the sample value of the dependent variable
ŷ_{i }= corresponding value estimated from the regression equation
n = number observations
p = number of predictors or independent variable
The denominator of the equation indicates that in multiple regression with p independent variables, the standard error has np1 degrees of freedom. This happens because the degrees of freedom are reduced from n by p+1 numerical constants a, b_{1}, b_{2}, …..b_{p}, that have been estimated from the sample.
The fit of the multiple regression model can be assessed by the Coefficient of Multiple determination, which is a fraction that represents the proportion of total variation of y that is explained by the regression plane.
Sum of squares due to error
SSE =
Sum of squares due to regression
SSR =
Total sum of squares
SST =
Obviously,
SST = SSR + SSE
The ratio SSR/SST represents the proportion of the total variation in y explained by the regression model. This ratio, denoted by R^{2}, is called the coefficient of multiple determination. R^{2} is sensitive to the magnitudes of n and p in small samples. If p is large relative to n, the model tends to fit the data very well. In the extreme case, if n = p+1, the model would exactly fit the data.
A better goodness of fit measure is the adjusted R^{2}, which is computed as follows:
Adjusted R^{2}= 1 – () (1R^{2})
= 1 
The overall goodness of fit of the regression model (i.e. whether the regression model is at all helpful in predicting the values of y can be evaluated, using an Ftest in the format of analysis of variance.
Under the null hypothesis: H_{o}: β_{1} = β_{2} = ... = β_{p} = 0, the statistic
=
has an Fdistribution with p and n1 degrees of freedom
ANOVA Table for Multiple Regression
Source of Variation 
Sum of Squares 
Degrees of freedom 
Mean Squares 
F ratio 
Regression 
SSR 
p 
MSR 
MSR/MSE 
Error 
SSE 
(np1) 
MSE 

Total 
SST 
(n1) 


Whether a particular variable contributes significantly to the regression equation can be tested as follows: For any specific variable x_{i}, we can test the null hypothesis H_{o}: β_{i} = 0, by computing the statistic
t =
and performing a one or two tailed ttest with np1 degrees of freedom.
The magnitude of the regression coefficients depends upon the scales of measurement used for the dependent variable y and the explanatory variables included in the regression equation. Unstandardized regression coefficients cannot be compared directly because of differing units of measurements and different variances of the x variables. It is therefore necessary to standardize the variables for meaningful comparisons.
The estimated model
ŷ_{i} = b_{o}+b_{1}x_{i1}+b_{2}x_{i2}+….b_{p}x_{ip}
can be written as:
+
The expressions in the parentheses are standardized variables; b’s; are unstandardized regression coefficients and s_{1}, s_{2}, …s_{p} are the standard deviations of variables x_{1}, x_{2}, ….x_{p} and s_{x} is the standard deviation of variable y_{.} The coefficients (b_{i}s_{i})/s_{y}, j=1,2,…,p are called standardized regression coefficients. The standardized regression coefficient measures the impact of a unit change in the standardized value of x_{i} on the standardized value of y. The larger the magnitude of standardized b_{i}, the more x_{i} contributes to the prediction of y. However, the regression equation itself should be reported in terms of the unstandardized regression coefficients so that prediction of y can be made directly from the x variables.
Multiple correlation coefficient, R, is a measure of the strength of the linear relationship between y and the set of variables x_{1}, x_{2}, …x_{p}. It is the highest possible simple correlation between y and any linear combination of x_{1},x_{2},….,x_{p}. This property explains that the computed value of R is never negative. In this sense, the least squares regression plane maximizes the correlation between the x variables and the dependent variable y. Hence, it represents a measure of how well the regression equation fits the data. When the value of the multiple correlation R is close to zero, the regression equation barely predicts y better than sheer chance. A value of R close to 1 indicates a very good fit.
A useful approach to study the relationship between two variables x and y in the presence of a third variable z is to determine the correlation between x and y after controlling the effect of z. This correlation is called partial correlation. Partial correlation is the correlation of two variables while controlling for a third or more other variables. For example r_{12.34} is the correlation of variables 1 and 2, controlling for variables 3 and 4. If partial correlation r_{12.34} is equal to uncontrolled correlation r_{12} , it implies that the control variables have no effect on the relationship between variables 1 and 2.. If partial correlation is nearly equal to zero, it implies that the correlation between original variable is spurious.
Partial correlation coefficient is a measure of the linear association between two variables after adjusting for the linear effect of a group of other variables. If the number of other variables is equal to 1, the partial correlation coefficient is called the first order coefficient. If the number of other variables is equal to 2, the partial correlation coefficient is called the second order coefficient, and so on.
First order Partial Correlation
The first order partial correlation between x_{i} and x_{j} holding constant x_{l} is computed by the following formula
r_{ij}.l =
where r_{ij}, r_{il} and r_{jl} are zero order (Pearson’s r coefficient)
Second order Partial Correlation
Correlation between x_{i} and x_{j} holding constant x_{l} and x_{m} is computed by the following formula:
r_{ij}_{.}l_{m} =
where r_{ij}, r_{im.l}, r_{jm.}_{l} are first order partial correlation coefficients.
Statistical significance of partial correlation coefficients can be tested by using a test statistic similar to the one for simple correlation coefficient.
t =
where q is the number of variables held constant. The value of t is compared with tabulated t for nq2 degrees of freedom.
In practice, the problem of multicollinearity occurs when some of the x variables are highly correlated. Multicollinearity can have significant impact on the quality and stability of the fitted regression model. A common approach to multicollinearity problem is to omit explanatory variables. For example if x_{1} and x_{2} are highly correlated (say correlation is greater than 0.9), then the simplest approach would be to use only one of them, since one variable conveys essentially all the information in the other variable.
The simplest method for detecting multicollinearity is the correlation matrix, which can be used to detect if there are large correlations between pairs of explanatory variables.
When more subtle patterns of correlation coefficients exist, the determinant of the correlation matrix computed by IDAMS can be used to detect multicollinearity. The determinant of the correlation matrix represents as a single number the generalized variance in the set of predictor variables, and varies from 0 to 1. The value of the determinant near zero indicates that some or all explanatory variables are highly correlated. The value of the determinant equal to zero indicates a singular matrix, which indicates that at least one of the predictors is a linear function of one or more other predictors.
Another approach is to compute the ‘tolerance’ associated with a predictor. The tolerance of x_{i} is defined as 1 minus the squared multiple correlation between that x_{i} and the remaining x variables. When tolerance is small, say less than 0.01, then it would be expedient to discard the variable with the smallest tolerance. The inverse of the tolerance is called the variance inflation factor (VIF).
Stepwise regression is a sequential process for fitting the least squares model, where at each step a single explanatory variable is either added to or removed from the model in the next fit.
The most commonly used criterion for the addition or deletion of variables in stepwise regression is based on partial Fstatistic:
=
The suffix ‘Full’ refers to the larger model with p explanatory variables, whereas the suffix ‘Reduced’ refers to the reduced model with (p q) explanatory variables.
Forward selection
Forward selection procedure begins with no explanatory variable in the model and sequentially adds a variable according to the criterion of partial F statistic. At each step, a variable is added, whose partial F statistic yields the smallest p  value. Variables are entered as long as the partial Fstatistic pvalue remains below a specific maximum value (PIN). The procedure stops when the addition of any of the remaining variables yields a partial pvalue > PIN. This procedure has two limitations. Some of the variables never get into the model and hence their importance is never determined. Another limitation is that a variable once included in the model remains there throughout the process, even if it loses its stated significance, after the inclusion of other variable(s).
The backward elimination procedure begins with all the variables in the model and proceeds by eliminating the least useful variable at a time. A variable, whose partial F pvalue is greater than a prescribed value, POUT, is the least useful variable and is therefore removed from the regression model. The process continues, until no variable can be removed according to the elimination criterion.
The stepwise procedure is a modified forward selection method which later in the process permits the elimination of variables that become statistically non significant. At each step of the process, the pvalues are computed for all variables in the model. If the largest of these pvalues > POUT, then that variable is eliminated. After the included variables have been examined for exclusion, the excluded variables are reexamined for inclusion. At each step of the process, there can be at the most one exclusion, followed by one inclusion. It is necessary that PIN POUT to avoid infinite cycling of the process.
Sometimes, explanatory variables for inclusion in a regression model are not interval scale; they may be nominal or ordinal variables. Such variables can be used in the regression model by creating ‘dummy’ (or indicator) variables.
Dichotomous variables do not cause the regression variables to lose any of their properties. Since they have two categories, they manage to ‘trick’ least squares, while entering into the regression equation as interval scale variables with just two categories.
Consider for example, the relationship between income and gender
y = a + bx
where
y = income of an individual, and
x = a dichotomous variable, coded as
0 if female
1 if otherwise
The estimated value of y is
ŷ =a if x = 0
ŷ=a+b if x = 1
Since our best estimate for a given sample is the sample mean, a is estimated as the average income for females and a+b is estimated as average income for males. The regression coefficient b is therefore
_{male} – _{female}
In effect, females are considered as the reference group and males’ income is measured by how much it differs from females’ income.
Consider, for example, the relationship between the time spent by an academic scientist on teaching and his rank.
y = a+bx
where
y is the percentage of work time spent on teaching
x is a polytomous variable ‘rank’ with three modalities:
1 = Professor
2 = Reader
3 = Lecturer
We create two dummy variables:
X_{1} = 1 if rank = Professor
0 if otherwise
X_{2} = 1 if rank = Reader
0 if otherwise
Note that we have created two dummy variables to represent a trichotomous variable. If we create a third dummy variable X_{3} (score 1; if rank = Lecturer, and 0 otherwise), the parameters of the regression equation cannot be estimated uniquely. This is because if the score of any respondent on X_{1} and X_{2} is known, it would always be possible to predict his score on X_{3}. For example if a respondent has score 0 on X_{1} (not Professor) and 0 on X_{2} (not Reader), then the respondent is certainly a Lecturer (i.e., score 1 on X_{3}). This represents a situation of perfect multicollinearity. Hence the general rule for creating dummy variables is: Number of dummy variables = Number of modalities minus 1.
Statistical significance of regression coefficients and Multiple R^{2} is determined in the same way as for interval scale explanatory variables.