Regression analysis is one of the most commonly used statistical techniques in social and behavioral sciences as well as in physical sciences. Its main objective is to explore the relationship between a dependent variable and one or more independent variables (which are also called predictor or explanatory variables). Linear regression explores relationships that can be readily described by straight lines or their generalization to many dimensions.
A surprisingly large number of problems can be solved by linear regression, and even more by means of transformation of the original variables that result in linear relationships among the transformed variables.
Mathematically, the regression model is represented by the following equation:
Y_{i}=a +S b _{i} X_{ij} + e _{i}
where p is the number of predictors, the subscript i refers to the i^{th} observation and the subscript j refers to the j ^{th} predictor and e _{i} is the difference between the i^{th} observation and the model; e _{i} is also called the error term.
However in this Chapter, we would examine the case of Simple Linear Regression, involving only two variables, one dependent variable (Y) and one independent variable (X):
Y_{i }=a + b X_{i }+ e _{i}
The first step in determining whether there is a relationship between two variables is to examine the graph of the observed data (Y*X). The graph is called a scatter plot. IDAMS modules Scatter or Graphid can be used to draw the scatter plot. If there is a relationship between the variables X and Y, the dots of the scatter plot would be more or less concentrated around a curve, which may be called the curve of regression. In the particular case when the curve is a straight line, it is called the line of regression and the regression is said to be linear. In addition to the linearity property, the scatter plot is also useful for observing whether there are any outliers in the data and whether there are two or more clusters of points.
For the population, the bivariate regression model is:
Y_{i }=a + b X_{i }+ e _{i}
where the subscript i refers to the i^{th} observation, a is the intercept and b is the regression coefficient. The intercept, a , is so called, because it intercepts the Yaxis. It estimates the average value of Y, when X=0.
Assumptions
The regression model is based on the following assumptions.
Estimation of Parameters
The random sample of observations can be used to estimate the parameters of the regression equation. The method of least squares is used to fit a continuous dependent variable (Y) as a linear function of a single predictor variable (X). The least squares method finds the line which minimizes the sum of squared deviations from each point in the sample to the point on the line corresponding to the X–value. Given a set of n observations Y_{i} of the dependent variable corresponding to a set of values X_{i} of the predictor, and the assumed regression model, the i^{th} residual is defined as the difference between the i^{th} observation Y_{i} and the fitted value Ŷ_{i}.
d_{i }= (Ŷ_{i}  Y_{i})
The least square line is:
Ŷ = A + BX
where
and
A = Ŷ  B
Here and Ŷ denote the sample means of X and Y, and Ŷ denotes the predicted value of Y for a given X.
The estimate of ^{2} is called the residual mean square and is computed as:
The number n – 2, called the residual degrees of freedom, is the sample size minus the number of parameter (in this case, α and β).
The square root of the Residual Mean Square (RMS) is called the standard error of the estimate and is denoted by S. In effect, it indicates the reliability of the estimating equation. Standard errors of A and B are
The standardized regression coefficient is the slope in the regression equation if X and Y are standardized. After standardization, the intercept (A) will be equal to zero. And the standardized slope will be equal to the correlation coefficient r.
For testing the null hypothesis H_{0}: b =0 , it is expedient to represent the results of regression analysis in the form of an analysis of variance (ANOVA) table If X were useless in predicting Y, the best estimate of Y would be , regardless of the values of X. To measure how different the fitted line Ŷ is from , we calculate the sum of squares foe regression as å (Y)^{2}, summed over each data point. The residual mean square is a measure of how poorly or how well the regression line fits the actual data points. A large residual mean square indicates poor fit. If residual mean square is large, the value of F would be low and F ratio may become nonsignificant. If F ratio is statistically significant it implies that the null hypothesis H_{0}: b =0 is rejected.
ANOVA Table for Simple Linear Regression.
Source of Variation 
Sums of Squares 
Df 
Mean Square 
F 
Regression 
1 
SS_{reg} / 1 
MS_{reg} / MS_{res} 

Residual 
N – 2 
SS_{res} / ( N – 2) 


Total 
N – 1 

