4.4 Regression Analysis

Regression analysis is one of the most commonly used statistical techniques in social and behavioral sciences as well as in physical sciences. Its main objective is to explore the relationship between a dependent variable and one or more independent variables (which are also called predictor or explanatory variables). Linear regression explores relationships that can be readily described by straight lines or their generalization to many dimensions.

A surprisingly large number of problems can be solved by linear regression, and even more by means of transformation of the original variables that result in linear relationships among the transformed variables.

Mathematically, the regression model is represented by the following equation:

Yi=a +S b i Xij + e i

where p is the number of predictors, the subscript i refers to the ith observation and the subscript j refers to the j th predictor and e i is the difference between the ith observation and the model; e i is also called the error term.

However in this Chapter, we would examine the case of Simple Linear Regression, involving only two variables, one dependent variable (Y) and one independent variable (X):

Yi =a + b Xi + e i

The first step in determining whether there is a relationship between two variables is to examine the graph of the observed data (Y*X). The graph is called a scatter plot. IDAMS modules Scatter or Graphid can be used to draw the scatter plot. If there is a relationship between the variables X and Y, the dots of the scatter plot would be more or less concentrated around a curve, which may be called the curve of regression. In the particular case when the curve is a straight line, it is called the line of regression and the regression is said to be linear. In addition to the linearity property, the scatter plot is also useful for observing whether there are any outliers in the data and whether there are two or more clusters of points.

For the population, the bivariate regression model is:

Yi =a + b Xi + e i

where the subscript i refers to the ith observation, a is the intercept and b is the regression coefficient. The intercept, a , is so called, because it intercepts the Y-axis. It estimates the average value of Y, when X=0.

Assumptions

The regression model is based on the following assumptions.

Estimation of Parameters

The random sample of observations can be used to estimate the parameters of the regression equation. The method of least squares is used to fit a continuous dependent variable (Y) as a linear function of a single predictor variable (X). The least squares method finds the line which minimizes the sum of squared deviations from each point in the sample to the point on the line corresponding to the X–value. Given a set of n observations Yi of the dependent variable corresponding to a set of values Xi of the predictor, and the assumed regression model, the ith residual is defined as the difference between the ith observation Yi and the fitted value Ŷi.

di = (Ŷi - Yi)

The least square line is:

Ŷ = A + BX

where

and

A = Ŷ - B

Here and Ŷ denote the sample means of X and Y, and Ŷ denotes the predicted value of Y for a given X.

The estimate of 2 is called the residual mean square and is computed as:

The number n – 2, called the residual degrees of freedom, is the sample size minus the number of parameter (in this case, α and β).

The square root of the Residual Mean Square (RMS) is called the standard error of the estimate and is denoted by S. In effect, it indicates the reliability of the estimating equation. Standard errors of A and B are

Standardized regression coefficient

The standardized regression coefficient is the slope in the regression equation if X and Y are standardized. After standardization, the intercept (A) will be equal to zero. And the standardized slope will be equal to the correlation coefficient r.

Significance of regression

For testing the null hypothesis H0: b =0 , it is expedient to represent the results of regression analysis in the form of an analysis of variance (ANOVA) table If X were useless in predicting Y, the best estimate of Y would be , regardless of the values of X. To measure how different the fitted line Ŷ is from , we calculate the sum of squares foe regression as (Y-)2, summed over each data point. The residual mean square is a measure of how poorly or how well the regression line fits the actual data points. A large residual mean square indicates poor fit. If residual mean square is large, the value of F would be low and F ratio may become non-significant. If F ratio is statistically significant it implies that the null hypothesis H0: b =0 is rejected.

ANOVA Table for Simple Linear Regression.

Source of Variation

Sums of Squares

Df

Mean Square

F

Regression

1

SSreg / 1

MSreg  /  MSres

Residual

N – 2

SSres / ( N – 2)

 

Total

N – 1