5.1 Key concepts and definitions
- Multiple regression: In multiple regression analysis, we are
studying the relationship between one dependent variable and several
independent variables (called predictors). The regression equation takes the
=b0+ b1x1 + b2x2
…+ bp+ e,
- where Y is the dependent variable,
the b's are the regression coefficients for the corresponding x
(independent) terms, b0 is a constant or intercept, and
e is the error term reflected in the residuals. The parameters of
the regression equation are estimated using the ordinary least squares
- Ordinary least squares: This method derives its name from the criterion used to draw the
best-fit regression line: a line such that the sum of the squared
deviations of the distances of all the points to the line is minimized.
- Intercept: The intercept, b0, is where the regression
plane intersects with the Y-axis. It is equal to the estimated Y
value when all the independents have a value of 0.
- Regression coefficient:
Regression coefficients bi are the slopes of the regression plane
in the direction of xi. Each regression coefficient
represents the net effect the ith variable has on the
dependent variable, holding the remaining x's in the equation
- Beta weights are
the regression coefficients for standardized data. Beta is the average
amount by which the dependent variable increases when the independent
variable increases one standard deviation and other independent variables
are held constant. The ratio of the beta weights is the ratio of the
predictive importance of the independent variables.
- Standardized means that for each datum the mean is
subtracted and the result divided by the standard deviation. The result
is that all variables have a mean of 0 and a standard deviation of 1.
- Residuals are the difference between the observed
values and those predicted by the regression equation
- Dummy variables: Regression assumes interval data,
but dichotomies may be considered a special case of intervalness. Nominal
and ordinal categories can be transformed into sets of dichotomies, called
dummy variables. To prevent perfect multicollinearity, one category must
be left out.
- Interpretation of b for dummy
variables. For b
coefficients for dummy variables, which have been binary coded
(the usual 1=present, 0=not present), b is relative to the reference
category (the category left out).
- Multiple R: The correlation coefficient between the
observed and predicted values. It ranges in value from 0 to 1. A small
value indicates that there is little or no linear
relationship between the dependent variable and the independent
- Multiple R 2 is the percent of the variance in the
dependent variable, explained by the independent variables. It is also
called the coefficient of multiple determination. Mathematically, R2
= [ 1 - (SSE/SST) ] , where
SSE = error sum of squares = S (Yi
- Est Yi) 2 where Yi is the
actual value of Y for the ith case and Est Yi
is the regression prediction for the ith case.
SST = total sum of squares =S (Yi
- MeanY) 2
R-Square: When there are a large
number of independent variables, it is possible that R2 may
become artificially large, simply because some independent variables' chance
variations "explain" small parts of the variance of the dependent
variable. It is therefore essential to adjust the value of R2
as the number of independent variables increases. In the case of a few
independent variables, R2 and adjusted R2
will be close. In the case of a large number of independent variables, adjusted
R2 may be noticeably lower.
Multicollinearity is the intercorrelation of the independent
variables. The values of r2's near 1 violate the assumption
of no perfect collinearity, while high r2's increase the
standard error of the regression coefficients and make assessment of the unique
role of each independent variable difficult or impossible. While simple
correlations tell something about multicollinearity, the preferred method of
assessing multicollinearity is to compute the determinant of the correlation
matrix. Determinants near zero indicate that some or all independent variables
are highly correlated.
correlation is the correlation of
two variables while controlling for a third or more other variables. For
example r12.34 is the correlation of variables 1 and 2,
controlling for variables 3 and 4. Partial correlation r12.34
equal to uncontrolled correlation r12 Þ No
effect of control variables Partial correlation near 0 Þ Original
correlation is spurious.
- Stepwise Regression: Stepwise
regression is a sequential process for fitting the least squares model,
where at each step a single predictor variable is either added to or
removed from the model in the next fit.
Multiple Classification Analysis
- Multiple classification analysis: Multiple Classification Analysis (MCA) is a
technique for examining the interrelationship between several predictor
variables and one dependent variable in the context of an additive model
Independent variables may be measured on nominal or ordinal scales and the
dependent variable may be interval scale or a dichotomy.
- Additive model: Such
a model assumes that the dependent variable can be predicted from an
additive combination of the independent (or predictor) variables. In other
words, they assume that the average score on the dependent variable for a
given set of individuals (objects or cases) is predictable by adding the
effects of several predictors.
- Eta: Eta indicates
the ability of a predictor, using the given categories, to explain
variation in the dependent variable.
- Eta square: Eta2
is the correlation ratio and indicates the proportion of the total
sum of squares, explained by the predictor.
- MCA Beta: This is
directly analogous to Eta statistic, but is based on the adjusted means
rather than the raw means. Beta is a measure of the ability of a predictor
to explain variation in the dependent variable, after adjusting for the
effects of all other predictors. Note that this is not in terms of
percentage of variance explained.
- Multiple correlation coefficient squared: This coefficient indicates the proportion
of variance explained in this run of the program.
- Adjustment for degrees of freedom: This is the factor used to correct for
capitalizing on chance in fitting the model in the particular sample
- Multiple correlation coefficient squared (Adjusted): This coefficient estimates the proportion of
variance in the dependent variable, explained by the predictor variables.