Linear Regression (REGRESSN)

27    Linear Regression (REGRESSN)


27.1  General Description

REGRESSN provides a general multiple regression capability designed for either standard or stepwise linear regression analysis. Several regression analyses, using different parameters and variables, may be performed in one execution.

Constant term. If the input is raw data, the user may request that the equations have no constant term (see the regression parameter CONSTANT=0). In such case, a matrix based on the cross-product matrix is analyzed instead of a correlation matrix. This changes the slope of the fitted line and can substantially affect the results. In stepwise regression, variables may enter the equation in a different order than they would if a constant term were estimated. If a correlation matrix is input, the regression equation always includes a constant term.

Use of categorical variables as independent variables. An option is available to create a set of dummy (dichotomous) variables from specified categorical variables (see the parameter CATE). These can be used as independent variables in the regression analysis.

F-ratio for a variable to enter in the equation. In a stepwise regression, variables are added in turn to the regression equation until the equation is satisfactory. At each step the variable with the highest partial correlation with the dependent variable is selected. A partial F-test value is then computed for the variable and this value is compared to a critical value supplied by the user. As soon as the partial F for the next to be entered variable becomes less than the critical value, the analysis is terminated.

F-ratio for a variable to be removed from the equation. A variable which may have been the best single variable to enter at an early stage of a stepwise regression may, at a later stage, not be the best because of the relationship between it and other variables now in the regression. To detect this, the partial F-value for each variable in the regression at each step of the calculation is computed and compared with a critical value supplied by the user. Any variable whose partial F-value falls below the critical value is removed from the model.

Stepwise regression. If stepwise regression is requested, the program determines which variables or which sets of dummy variables among the specified set of independent variables will actually be used for the regression, and in which order they will be introduced, beginning with the forced variables and continuing with the other variables and sets of dummy variables, one by one. After each step the algorithm selects from the remaining predictor variables the variable or set of dummy variables which yields the largest reduction in the residual (unexplained) variance of the dependent variable, unless its contribution to the total F-ratio for the regression remains below a specified threshold. Similarly, the algorithm evaluates after each step whether the contribution of any variable or set of dummy variables already included falls below a specified threshold, in which case it is dropped from the regression.

Descending stepwise regression. Like the stepwise regression, except that the algorithm starts with all the independent variables and then drops variables and sets of dummy variables in a stepwise manner. At each step the algorithm selects from the remaining included predictor variables the variable or set of dummy variables which yields the smallest reduction in the explained variance of the dependent variable, unless this exceeds a specified threshold. Similarly, the algorithm evaluates at each step whether the contribution of any variable or set of dummy variables previously dropped from the regression has risen above a specified threshold, in which case it is added back into the regression.

Generating a residuals dataset. With raw data input, residuals may be computed and output as a data file described by an IDAMS dictionary. See the "Output Residuals Datasets" section for details on the content. Note that a separate residuals dataset is generated from each equation. Also, since REGRESSN has no facility to transfer specific variables of interest in a residuals analysis from the input raw data to the residuals dataset, it may be necessary to use the MERGE program to create the dataset containing all of the desired variables. A case ID variable from the input dataset is output to the residuals dataset to make matching possible.

Generating a correlation matrix. If raw data are input, the program computes correlation coefficients which may be output in the format of an IDAMS square matrix and used for further analysis. REGRESSN correlations include all variables across all regression equations and are based on cases which have valid data on all variables in the matrix. Thus, the correlations will usually differ from correlations obtained from the PEARSON program execution with the MDHANDLING=PAIR option. When missing data elimination in REGRESSN leaves the sample size acceptably large, REGRESSN is an alternative to PEARSON for generating a correlation matrix (see the paragraph "Treatment of missing data").


27.2  Standard IDAMS Features

Case and variable selection. If raw data are input, the standard filter is available to select a subset of cases from the input data. If a matrix of correlations is used as input to the program, case selection is not applicable. The variables for the regression equation are specified in the regression parameters DEPVAR and VARS.

Transforming data. If raw data are input, Recode statements may be used.

Weighting data. If raw data are input, a variable can be used to weight the input data; this weight variable may have integer or decimal values. The program will force the sum of the weights to equal the number of input cases. When the value of the weight variable for a case is zero, negative, missing or non-numeric, then the case is always skipped; the number of cases so treated is printed.

Treatment of missing data.

  1. Input. If raw data are input, the MDVALUES parameter is available to indicate which missing data values, if any, are to be used to check for missing data. Cases in which missing data occur in any regression variable in any analysis are deleted ("case-wise" missing data deletion). An option (see the parameter MDHANDLING) allows the user to specify the maximum number of missing data cases which can be tolerated before the execution is terminated. Warning: If multiple analyses are performed in one REGRESSN execution, a single correlation matrix is computed for all variables used in the different analyses. Because of the "case-wise" method of deleting cases with missing data, the number of cases used and thus the regression statistics produced may be different if the analyses are then performed separately.

    If a matrix is input, cases with missing data should have been accommodated when the matrix was created. If a cell of the input matrix has a missing data code (i.e. 99.999) any analysis involving that cell will be skipped.

  2. Output residuals. If residuals are requested, predicted values and residuals are computed for all cases which pass the (optional) filter. If a case has missing data on any of the variables required for these computations, output missing data codes are generated.

  3. Output correlation matrix. The REGRESSN algorithm for handling missing data on raw data input cannot result in missing data entries in the correlation matrix.


27.3  Results

Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if any, only for variables used in the execution.

Univariate statistics. (Raw data input only). The sum, mean, standard deviation, coefficient of variation, maximum, and minimum are printed for all dependent and independent variables used.

Matrix of total sums of squares and cross-products. (Raw data input only. Optional: see the parameter PRINT).

Matrix of residual sums of squares and cross-products. (Raw data input only. Optional: see the parameter PRINT).

Total correlation matrix. (Optional: see the parameter PRINT).

Partial correlation matrix. (Optional for each regression: see the regression parameter PARTIALS). The ij-th element is the partial correlation between variable i and variable j, holding constant the variables specified in the PARTIALS variable list.

Inverse matrix. (Optional for each regression: see the regression parameter PRINT).

Analysis summary statistics. The following statistics are printed for each regression or for each step of a stepwise regression:

standard error of estimate,
F-ratio,
multiple correlation coefficient (adjusted and unadjusted),
fraction of explained variance (adjusted and unadjusted),
determinant of the correlation matrix,
residual degrees of freedom,
constant term.

Analysis statistics for predictors. The following statistics are printed for each regression or for each step of a stepwise regression:

coefficient B (unstandardized partial regression coefficient),
standard error (sigma) of B,
coefficient beta (standardized partial regression coefficient),
standard error (sigma) of beta,
partial and marginal R squared,
t-ratio,
covariance ratio,
marginal R squared values for all predictors and t-ratios for all sets of dummy variables (for stepwise regression).

Residual output dictionary. (For raw data input only. Optional: see the regression parameter WRITE).

Residual output data. (For raw data input only. Optional: see the regression parameter PRINT). If there are less than 1000 cases, calculated values, observed values and residuals (differences) may be listed in ascending order of residual value. Any number of cases may be listed in input case sequence order. The Durbin-Watson statistic for association of residuals will be printed for residuals listed in case sequence order.


27.4  Output Correlation Matrix

The computed correlation matrix may be output (see the parameter WRITE). It is written in the form of an IDAMS square matrix (see "Data in IDAMS" chapter). The format is 6F11.7 for the correlations and 4E15.7 for the means and standard deviations. In addition, labeling information is written in columns 73-80 of the records as follows:

matrix-descriptor record N=nnnnn
correlation records REG xxx
means records MEAN xxx
standard deviation records SDEV xxx

(nnnnn is the REGRESSN sample size. The xxx is a sequence number beginning with 1 for the first correlation record and incremented by one for each successive record through the last standard deviation record).

The elements of the matrix are Pearson r's. They, as well as the means and standard deviations, are based on the cases that have valid data on all the variables specified in any of the regression variable lists. The correlations are for all pairs of variables from all the analysis variable lists taken together.


27.5  Output Residuals Dataset(s)

For each analysis, a residuals dataset can be requested (see the regression parameter WRITE). This is output in the form of a Data file described by an IDAMS dictionary. It contains either four or five variables per case, depending on whether or not the data were weighted: an ID variable, a dependent variable, a predicted (calculated) dependent variable, a residual, and a weight, if any. Cases are output in the order of the input cases. The characteristics of the dataset are as follows:

Variable Field No. of MD1
No. Name Width Decimals Code
(ID variable) 1 same as input * 0 same as input
(dependent variable) 2 same as input * ** same as input
(predicted variable) 3 Predicted value 7 *** 9999999
(residual) 4 Residual 7 *** 9999999
(weight-if weighted) 5 same as input * ** same as input

*  
transferred from input dictionary for V variables or 7 for R variables
**  
transferred from input dictionary for V variables or 2 for R variables
***  
6 plus no. of decimals for dependent variable minus width of dependent variable; if this is negative, then 0.

If the calculated value or residual exceeds the allocated field width, it is replaced by MD1 code.


27.6  Input Dataset

The input raw dataset is a Data file described by an IDAMS dictionary. All variables used for analysis must be numeric; they may be integer or decimal valued. The case ID variable can be alphabetic.


27.7  Input Correlation Matrix

This is an IDAMS square matrix. A correlation matrix generated by PEARSON or by a previous REGRESSN is an appropriate input matrix for REGRESSN.

The input matrix dictionary must contain variable numbers and names. The matrix must contain correlations, means and standard deviations. Both the means and standard deviations are used.


27.8  Setup Structure




     $RUN REGRESSN

     $FILES
          File specifications

     $RECODE (optional with raw data input; unavailable with matrix input)
          Recode statements

     $SETUP
          1. Filter (optional)
          2. Label
          3. Parameters
          4. Definition of dummy variables (conditional)
          5. Regression specifications (repeated as required)

     $DICT (conditional)
          Dictionary for raw data input

     $DATA (conditional)
          Data for raw data input

     $MATRIX (conditional)
          Matrix for correlation matrix input


     Files:
     FT02       output correlation matrix
     FT09       input correlation matrix
                (if $MATRIX not used and INPUT=MATRIX)
     DICTxxxx   input dictionary (if $DICT not used and INPUT=RAWDATA)
     DATAxxxx   input data (if $DATA not used and INPUT=RAWDATA)
     DICTyyyy   output residuals distionary )  one set for each
     DATAyyyy   output residuals data       )  residuals file requested
     PRINT      results (default  IDAMS.LST)


27.9  Program Control Statements

Refer to "The IDAMS setup file" chapter for further descriptions of the program control statements, items 1-3 and 5 below.

  1. Filter (optional). Selects a subset of cases to be used in the execution. Available only with raw data input.
    
         Example:  INCLUDE V3=5
    
  2. Label (mandatory). One line containing up to 80 characters to label the results.
    
         Example:  REGRESSION ANALYSIS
    
  3. Parameters (mandatory). For selecting program options.
    
         Example:  IDVAR=V1  MDHANDLING=100
    

    INPUT=RAWDATA /MATRIX

    RAWD 
    The input data are in the form of a Data file described by an IDAMS dictionary.
    MATR 
    The input data are correlation coefficients in the form of an IDAMS square matrix.

    Parameters only for raw data input

    INFILE=IN /xxxx

    A 1-4 character ddname suffix for the input Dictionary and Data files.
    Default ddnames: DICTIN, DATAIN.

    BADDATA=STOP /SKIP/MD1/MD2

    Treatment of non-numeric data values. See "The IDAMS Setup File" chapter.

    MAXCASES=n

    The maximum number of cases (after filtering) to be used from the input file.
    Default: All cases will be used.

    MDVALUES=BOTH /MD1/MD2/NONE

    Which missing data values are to be used for the variables accessed in this execution. See "The IDAMS Setup File" chapter.

    MDHANDLING=0 /n

    The number of missing data cases to be allowed before termination. A case is counted missing if it has missing data in any of the variables in the regression equations.

    WEIGHT=variable number

    The weight variable number if the data are to be weighted.

    CATE

    Specify CATE if a definition of dummy variables is provided.

    IDVAR=variable number

    Variable to be output or printed as case ID if residuals dataset is requested. The ID variable should not be included in any variable list.

    WRITE=MATRIX

    Write the correlation matrix computed from the raw data input to an output file.

    PRINT=(CDICT/DICT, XMOM, XPRODUCTS, MATRIX)

    CDIC 
    Print the input dictionary for the variables accessed with C-records if any.
    DICT 
    Print the input dictionary without C-records.
    XMOM 
    Print the matrix of residual sums of squares and cross-products.
    XPRO 
    Print the matrix of total sums of squares and cross-products.
    MATR 
    Print the correlation matrix.

    Parameters for correlation matrix input

    CASES=n

    Set CASES equal to the number of cases used to create the input matrix. This number is used in calculating the F-level.
    No default; must be supplied when correlation matrix input.

    PRINT=MATRIX

    Print the correlation matrix.

  4. Definition of dummy variables (conditional: if CATE was specified as a parameter). The REGRESSN program can transform a categorical variable to a set of dummy variables. To have a variable treated as categorical, the user must a) include the CATE parameter in the parameter list and b) specify the variables to be considered categorical and the codes to be used. Each categorical variable to be transformed is followed by the codes to be used enclosed in brackets. For each variable, any codes not listed will be excluded from the construction. Note: The list of codes should not be exhaustive, i.e. all existing codes should not be listed or else a singular matrix will result.
    
         Example:  V100(5,6,1), V101 (1-6)
    
    Codes 5, 6 and 1 of variable 100 will be represented in the regression as dummy variables, along with codes 1 through 6 of variable 101.

    A variable specified in the definition of dummy variables, when used in predictor (VARS), partials (PARTIALS) or forced (FORCE) variables lists for stepwise regression, will refer to the set of dummy variables created from that variable. In stepwise regressions, the codes of such a variable will be entered or excluded together, and marginal R-squares and F-ratios will be calculated for all codes of the variable together as well as for codes individually. A variable used in a definition of dummy variables may not be used as a dependent variable.

  5. Regression specifications. The coding rules are the same as for parameters. Each set of regression parameters must begin on a new line.
    
         Example:  DEPV=V5  METH=STEP  FORCE=(V7) VARS=(V7,V16,V22,V37-V47,R14)
    
    METHOD=STANDARD /STEPWISE/DESCENDING
    STAN 
    A standard regression will be done.
    STEP 
    A stepwise regression will be done.
    DESC 
    A descending stepwise regression will be done.

    DEPVAR=variable number

    Variable number of dependent variable.
    No default.

    VARS=(variable list)

    The independent variables to be used in this analysis.
    No default.

    PARTIALS=(variable list)

    Compute and print a partial correlation matrix with the specified variables removed from the independent variable list.
    Default: No partials.

    FORCE=(variable list)

    Force the variables listed to enter into the stepwise regression (METH=STEP) or to remain in the descending stepwise regression (METH=DESC).
    Default: No forcing.

    FINRATIO=.001 /n

    The F-ratio value below which a variable will not be entered in a stepwise procedure; this is the F-ratio to enter. The decimal point must be entered.

    FOUTRATIO=0.0 /n

    The F-ratio value above which a variable must remain in order to continue in a stepwise procedure; this is the F-ratio to remove. The decimal point must be entered.

    CONSTANT=0

    For raw data input only.
    The constant term is required to equal zero and no constant term will be estimated.
    Default: A constant term will be estimated.

    WRITE=RESIDUALS

    Residuals are to be written out as an IDAMS dataset.

    OUTFILE=OUT /yyyy

    Applicable only if WRITE=RESI specified.
    A 1-4 character ddname suffix for the residuals output Dictionary and Data files. If outputting residuals from more than 1 analysis, the default ddname, OUT, may be used only once.

    PRINT=(STEP, RESIDUALS, ERESIDUALS, INVERSE)

    STEP 
    Applies to the stepwise regression only: print marginal R-squares for all predictors in each step.
    RESI 
    Print residuals in input case sequence order and Durbin-Watson statistic.
    ERES 
    Print residuals, except for missing data, in error magnitude order, provided there are fewer than 1000 cases.
    INVE 
    Print the inverse correlation matrix.


27.10  Restrictions

  1. With raw data input, there may be as many as 99 or 100 (depending on whether a weight variable is used) distinct variables used in any single regression equation; the total number of variables across all analysis, including Recode variables, weight variable and ID variable, can be no more than 200.
  2. With matrix input, the matrix can be 200 x 200, and up to 100 variables may be used in any single regression equation.
  3. FINRATIO must be greater than or equal to FOUTRATIO.
  4. Residuals may be listed in ascending order of residual value only if there are fewer than 1000 cases.
  5. A variable specified in a definition of dummy variables may not be used as a dependent variable.
  6. Maximum 12 dummy variables can be defined from one categorical variable.
  7. If the ID variable is alphabetic with width > 4, only the first four characters are used.


27.11  Examples

Example 1. Standard regression with five independent variables using an IDAMS correlation matrix as input.


     $RUN REGRESSN
     $FILES
     FT09 = A.MAT                            input Matrix file
     SETUP
     STANDARD REGRESSION  -  USING MATRIX AS INPUT
     INPUT=MATR  CASES=1460
     DEPV=V116  VARS=(V18,V36,V55-V57)

Example 2. Standard regression with six independent variables and with two variables each with 3 categories transformed to 6 dummy variables; raw data are used as input; residuals are to be computed and written into a dataset (cases are identified by variable V2).

     $RUN REGRESSN
     $FILES
     PRINT   = REGR2.LST
     DICTIN  = STUDY.DIC                     input Dictionary file
     DATAIN  = STUDY.DAT                     input Data file
     DICTOUT = RESID.DIC                     Dictionary file for residuals
     DATAOUT = RESID.DAT                     Data file for residuals
     $SETUP
     STANDARD REGRESSION  -  USING RAW DATA AS INPUT AND WRITING RESIDUALS
     MDHANDLING=50  IDVAR=V2  CATE
     V5(1,5,6),V6(1-3)
     DEPV=V116  WRITE=RESI  VARS=(V5,V6,V8,V13,V75-V78)

Example 3. Two regressions: one standard and one stepwise using raw data as input.

     $RUN REGRESSN
     $FILES
     DICTIN = STUDY.DIC                      input Dictionary file
     DATAIN = STUDY.DAT                      input Data file
     $SETUP
     TWO REGRESSIONS
     PRINT=(XMOM,XPROD)
     DEPV=V10  VARS=(V101-V104,V35)  PRINT=INVERSE
     DEPV=V11  METHOD=STEP  PRINT=STEP VARS=(V1,V3,V15-V18,V23-V29)

Example 4. Two-stage regression; the first stage uses variables V2-V6 to estimate values of the dependent variable V122; in the 2nd stage, two additional variables V12, V23 are used to estimate the predicted values of V122, i.e. V122 with the effects of V2-V6 removed.

In the first regression, predicted values for the dependent variable (V122) are computed and written to the residuals file (OUTB) as variable V3. MERGE is then used to merge this variable with the variables from the original file that are required in the second stage. The output dataset from MERGE (a temporary file so it need not be defined) will contain the 5 variables from the build list, numbered V1 to V5 where A12 and A23 (to be used as predictors in the second stage) become V2 and V3, A122, the original dependent variable, becomes V4, and B3, the variable giving predicted values of V122 becomes V5. This output file is then used as input to the second stage regression.


     $RUN REGRESSN
     $FILES
     PRINT    = REGR4.LST
     DICTIN   = STUDY.DIC                    input Dictionary file
     DATAIN   = STUDY.DAT                    input Data file
     DICTOUTB = RESID.DIC                    Dictionary file for residuals
     DATAOUTB = RESID.DAT                    Data file for residuals
     $SETUP
     TWO STAGE REGRESSION  -  FIRST STAGE
     MDHANDLING=100  IDVAR=V1
     DEPV=V122  WRITE=RESI  OUTF=OUTB  VARS=(V2-V6)
     $RUN MERGE
     $SETUP
     MERGING PREDICTED VALUE (V3 IN RES FILE) INTO DATA FILE
     MATCH=INTE  INAF=IN  INBF=OUTB
     A1=B1
     A1,A12,A23,A122,B3
     $RUN REGRESSN
     $SETUP
     TWO STAGE REGRESSION  -  SECOND STAGE
     MDHANDLING=100  INFI=OUT
     DEPV=V5  VARS=(V2,V3)