REGRESSN provides
a general multiple regression capability designed for either standard
or stepwise linear regression analysis. Several regression analyses,
using different parameters and variables, may be performed in one
execution. Constant term. If the input is raw data, the
user may request that the equations have no constant term (see the
regression parameter CONSTANT=0). In such case, a matrix based on
the cross-product matrix is analyzed instead of a correlation matrix.
This changes the slope of the fitted line and can substantially affect
the results. In stepwise regression, variables may enter the equation
in a different order than they would if a constant term were estimated.
If a correlation matrix is input, the regression equation always includes
a constant term. Use of categorical variables as independent
variables. An option is available to create a set of dummy (dichotomous)
variables from specified categorical variables (see the parameter
CATE). These can be used as independent variables in the regression
analysis. F-ratio for a variable to enter in the equation.
In a stepwise regression, variables are added in turn to the regression
equation until the equation is satisfactory. At each step the variable
with the highest partial correlation with the dependent variable is
selected. A partial F-test value is then computed for the variable
and this value is compared to a critical value supplied by the user.
As soon as the partial F for the next to be entered variable becomes
less than the critical value, the analysis is terminated. F-ratio
for a variable to be removed from the equation. A variable which
may have been the best single variable to enter at an early stage
of a stepwise regression may, at a later stage, not be the best because
of the relationship between it and other variables now in the regression.
To detect this, the partial F-value for each variable in the regression
at each step of the calculation is computed and compared with a critical
value supplied by the user. Any variable whose partial F-value falls
below the critical value is removed from the model. Stepwise
regression. If stepwise regression is requested, the program determines
which variables or which sets of dummy variables among the specified
set of independent variables will actually be used for the regression,
and in which order they will be introduced, beginning with the forced
variables and continuing with the other variables and sets of dummy
variables, one by one. After each step the algorithm selects from
the remaining predictor variables the variable or set of dummy variables
which yields the largest reduction in the residual (unexplained) variance
of the dependent variable, unless its contribution to the total F-ratio
for the regression remains below a specified threshold. Similarly,
the algorithm evaluates after each step whether the contribution of
any variable or set of dummy variables already included falls below
a specified threshold, in which case it is dropped from the regression.
Descending stepwise regression. Like the stepwise regression,
except that the algorithm starts with all the independent variables
and then drops variables and sets of dummy variables in a stepwise
manner. At each step the algorithm selects from the remaining included
predictor variables the variable or set of dummy variables which yields
the smallest reduction in the explained variance of the dependent
variable, unless this exceeds a specified threshold. Similarly, the
algorithm evaluates at each step whether the contribution of any variable
or set of dummy variables previously dropped from the regression has
risen above a specified threshold, in which case it is added back
into the regression. Generating a residuals dataset. With
raw data input, residuals may be computed and output as a data file
described by an IDAMS dictionary. See the "Output Residuals Datasets"
section for details on the content. Note that a separate residuals
dataset is generated from each equation. Also, since REGRESSN has
no facility to transfer specific variables of interest in a residuals
analysis from the input raw data to the residuals dataset, it may
be necessary to use the MERGE program to create the dataset containing
all of the desired variables. A case ID variable from the input dataset
is output to the residuals dataset to make matching possible. Generating
a correlation matrix. If raw data are input, the program computes
correlation coefficients which may be output in the format of an IDAMS
square matrix and used for further analysis. REGRESSN correlations
include all variables across all regression equations and are based
on cases which have valid data on all variables in the matrix. Thus,
the correlations will usually differ from correlations obtained from
the PEARSON program execution with the MDHANDLING=PAIR option. When
missing data elimination in REGRESSN leaves the sample size acceptably
large, REGRESSN is an alternative to PEARSON for generating a correlation
matrix (see the paragraph "Treatment of missing data").
Case and variable
selection. If raw data are input, the standard filter is available
to select a subset of cases from the input data. If a matrix of correlations
is used as input to the program, case selection is not applicable.
The variables for the regression equation are specified in the regression
parameters DEPVAR and VARS. Transforming data. If raw data
are input, Recode statements may be used. Weighting data.
If raw data are input, a variable can be used to weight the input
data; this weight variable may have integer or decimal values. The
program will force the sum of the weights to equal the number of input
cases. When the value of the weight variable for a case is zero, negative,
missing or non-numeric, then the case is always skipped; the number
of cases so treated is printed. Treatment of missing data.
If
a matrix is input, cases with missing data should have been accommodated
when the matrix was created. If a cell of the input matrix has a missing
data code (i.e. 99.999) any analysis involving that cell will be skipped.
27.1  General Description
27.2  Standard IDAMS Features
| matrix-descriptor record | N=nnnnn |
| correlation records | REG xxx |
| means records | MEAN xxx |
| standard deviation records | SDEV xxx |
(nnnnn is the REGRESSN sample size. The xxx is a sequence number beginning with 1 for the first correlation record and incremented by one for each successive record through the last standard deviation record).
The elements of the matrix are Pearson r's. They, as well as the means and standard deviations, are based on the cases that have valid data on all the variables specified in any of the regression variable lists. The correlations are for all pairs of variables from all the analysis variable lists taken together.
| Variable | Field | No. of | MD1 | ||
| No. | Name | Width | Decimals | Code | |
| (ID variable) | 1 | same as input | * | 0 | same as input |
| (dependent variable) | 2 | same as input | * | ** | same as input |
| (predicted variable) | 3 | Predicted value | 7 | *** | 9999999 |
| (residual) | 4 | Residual | 7 | *** | 9999999 |
| (weight-if weighted) | 5 | same as input | * | ** | same as input |
If the calculated value or residual exceeds the allocated field width, it is replaced by MD1 code.
The input raw dataset is
a Data file described by an IDAMS dictionary. All variables used for
analysis must be numeric; they may be integer or decimal valued. The
case ID variable can be alphabetic. 27.6  Input Dataset
$RUN REGRESSN
$FILES
File specifications
$RECODE (optional with raw data input; unavailable with matrix input)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
4. Definition of dummy variables (conditional)
5. Regression specifications (repeated as required)
$DICT (conditional)
Dictionary for raw data input
$DATA (conditional)
Data for raw data input
$MATRIX (conditional)
Matrix for correlation matrix input
Files:
FT02 output correlation matrix
FT09 input correlation matrix
(if $MATRIX not used and INPUT=MATRIX)
DICTxxxx input dictionary (if $DICT not used and INPUT=RAWDATA)
DATAxxxx input data (if $DATA not used and INPUT=RAWDATA)
DICTyyyy output residuals distionary ) one set for each
DATAyyyy output residuals data ) residuals file requested
PRINT results (default IDAMS.LST)
|
Refer to "The
IDAMS setup file" chapter for further descriptions of the program
control statements, items 1-3 and 5 below.
INPUT=RAWDATA /MATRIX
Parameters only for raw
data input INFILE=IN /xxxx
BADDATA=STOP /SKIP/MD1/MD2
MAXCASES=n
MDVALUES=BOTH /MD1/MD2/NONE
MDHANDLING=0 /n
WEIGHT=variable number
CATE
IDVAR=variable number
WRITE=MATRIX
PRINT=(CDICT/DICT, XMOM, XPRODUCTS, MATRIX)
Parameters
for correlation matrix input CASES=n
PRINT=MATRIX
A variable specified in the definition of dummy variables, when
used in predictor (VARS), partials (PARTIALS) or forced (FORCE) variables
lists for stepwise regression, will refer to the set of dummy variables
created from that variable. In stepwise regressions, the codes of
such a variable will be entered or excluded together, and marginal
R-squares and F-ratios will be calculated for all codes of the variable
together as well as for codes individually. A variable used in a definition
of dummy variables may not be used as a dependent variable.
DEPVAR=variable number
VARS=(variable list)
PARTIALS=(variable list)
FORCE=(variable list)
FINRATIO=.001 /n
FOUTRATIO=0.0 /n
CONSTANT=0
WRITE=RESIDUALS
OUTFILE=OUT /yyyy
PRINT=(STEP, RESIDUALS, ERESIDUALS, INVERSE)
27.9  Program Control Statements
Example: INCLUDE V3=5
Example: REGRESSION ANALYSIS
Example: IDVAR=V1 MDHANDLING=100
Example: V100(5,6,1), V101 (1-6)
Example: DEPV=V5 METH=STEP FORCE=(V7) VARS=(V7,V16,V22,V37-V47,R14)
METHOD=STANDARD /STEPWISE/DESCENDING