Discriminant Analysis (DISCRAN)

25    Discriminant Analysis (DISCRAN)


25.1  General Description

The task of discriminant analysis is to find the best linear discriminant function(s) of a set of variables which reproduce(s), as far as it is possible, an a priori grouping of the cases considered.

A stepwise procedure is used in this program, i.e. in each step the most powerful variable is entered into the discriminant function. The criterion function for selecting the next variable depends on the number of groups specified (number of groups varies between 2 and 20). In the case of two groups the Mahalanobis distance is used. When the number of groups is greater than 2 then the variable selection criterion is the trace of a product of the covariance matrix for the variables involved and the inter-class covariance matrix at a particular step. This is a generalization of Mahalanobis distance defined for two groups.

Besides executing the main discriminant analysis steps on a basic sample there are two optional possibilities: checking the power of the discriminant function(s) with the help of a test sample , in which the group assignment of the cases is known (as in the basic sample) but which cases were not used in the analysis, and classifying the cases with the help of discriminant function(s) provided by the analysis in an anonymous sample where the group assignment of the cases is unknown, or at least is not used.


25.2  Standard IDAMS Features

Case and variable selection. The standard filter is available to select a subset of cases from the input data. A further subsetting is possible with the use of the sample and group variables. Analysis variables are selected with the VARS parameter.

Transforming data. IDAMS Recoding may be used.

Weighting data. A variable can be used to weight the input data; this weight variable may have integer or decimal values. When the value of the weight variable for a case is zero, negative, missing or non-numeric, then the case is always skipped; the number of cases so treated is printed.

Treatment of missing data. The MDVALUES parameter is available to indicate which missing data values, if any, are to be used to check for missing data. Cases with missing data in the sample variable, the group variable and/or the analysis variables can be optionally excluded from the analysis.


25.3  Printed Output

Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if any, only for variables used in the execution.

Number of cases in samples. The number of cases in the basic, test and anonymous samples according to the sample definition parameters.

Revised number of cases in samples. The number of cases in the basic, test and anonymous samples revised according to the sample and group definition parameters. Note that the revised figures may be smaller than the non-revised ones for the basic and the test samples if the groups defined do not cover completely the samples.

Basic sample. (Optional: see the parameter PRINT). The identification and the analysis variables of the cases in the basic sample are printed by groups, while the groups are separated from each other by a line of asterisks.

Test sample. As for basic sample.

Anonymous sample. As for basic sample except that there are no groups.

Univariate statistics. For each variable used in the analysis the program prints the group means and standard deviations as well as the total mean.

Stepwise procedure results (for each step)

Step number. The sequence number of the step.

Variables entered. The list of variables retained in this step.

Linear discriminant function. (Conditional: only if 2 groups specified). The constant term and the coefficients of the linear discriminant function corresponding to the variables already entered.

Classification table for basic sample. Bivariate frequency table showing the re-distribution of cases between the original groups and the groups to which they are allocated on the basis of the discriminant function, followed by the percentage of the correctly classified cases.

Classification table for test sample. As for basic sample.

Case assignment list. (Optional: see the parameter PRINT). The cases of the three samples are printed here with case identification, case allocation, and discriminant function value (for 2 groups) or distances to each group (for more than 2 groups).

Discriminant factor analysis results. (Conditional: only if more than 2 groups specified). Overall discriminant power and the discriminant power of the first three factors, followed by the values of discriminant factors for group means. In addition, a graphical representation of cases and means in the space of the first two factors is also given.


25.4  Input Dataset

The input is a data file described by an IDAMS dictionary. Three types of sample can be specified in the input file, namely:

    - basic sample,
    - test sample, and
    - anonymous sample.
The analysis is based on the basic sample. The test sample is used for testing the discriminant function(s) while the cases of the anonymous sample are simply classified using the discriminant functions.

The samples are defined by a "sample variable". The basic sample must not be empty. The groups to be separated by the discriminant function(s) should be defined by a "group variable". This variable defines an a priori classification of the basic and test sample cases.

All variables used for analysis must be numeric; they may be integer or decimal valued. The case ID variable can be alphabetic.


25.5  Setup Structure



 
     $RUN DISCRAN
   
     $FILES
          File definitions
 
     $RECODE (optional)
          Recode statements
 
     $SETUP
          1. Filter (optional)
          2. Label
          3. Parameters
 
     $DICT (conditional)
          Dictionary
 
     $DATA (conditional)
          Data
 
 
     Files:
     DICTxxxx   input dictionary (omit if $DICT used)
     DATAxxxx   input data (omit if $DATA used)
     PRINT      printed output (default  IDAMS.LST)
  


25.6  Program Control Statements

Refer to "The IDAMS Setup File" chapter for further descriptions of the program control statements, items 1-3 below.

  1. Filter (optional). Selects a subset of cases to be used in the execution.
    
         Example:  INCLUDE V3=6 OR V11=99
    
  2. Label (mandatory). One line containing up to 80 characters to label the printed output.
    
         Example:  DISCRIMINANT ANALYSIS ON AGRICULTURAL SURVEY
    
  3. Parameters (mandatory). For selecting program options.
    
         Example:  MDHA=SAMPVAR  IDVAR=V4  SAVAR=R5  BASA=(1,5) VARS=(V12-V15)
    
    INFILE=IN /xxxx
    A 1-4 character ddname suffix for the input dictionary and data files.
    Default ddnames: DICTIN, DATAIN.

    BADDATA=STOP /SKIP/MD1/MD2

    Treatment of non-numeric data values. See "The IDAMS Setup File" chapter.

    MAXCASES=n

    The maximum number of cases (after filtering) to be used from the input file.
    Default: All cases will be used.

    VARS=(variable list)

    List of V- and/or R-variables to be used in the analysis.
    No default.

    MDVALUES=BOTH /MD1/MD2/NONE

    Which missing data values are to be used for the variables accessed in this execution. See "The IDAMS Setup File" chapter.

    MDHANDLING=(SAMPVAR, GROUPVAR, ANALVARS)

    Choice of missing data treatment.
    SAMP 
    Cases with missing data in the sample variable are excluded from the analysis.
    GROU 
    Cases of basic and test samples with missing data in the group variable are excluded from the analysis.
    ANAL 
    Cases with missing data in the analysis variables are excluded from the analysis.
    Default: Cases with missing data are included.

    WEIGHT=variable number

    The weight variable number if the data are to be weighted.

    IDVAR=variable number

    Case identification variable for the data and/or case assignment listing.
    Default: "DISC" is used as identifier for all cases.

    STEPMAX=n

    Maximum number of steps to be performed. It must be less than or equal to the number of analysis variables.
    Default: Number of analysis variables.

    PRINT=(CDICT/DICT, DATA, GROUP)

    CDIC 
    Print the input dictionary for the variables accessed with C-records if any.
    DICT 
    Print the input dictionary without C-records.
    DATA 
    Print the data with original group assignments of cases.
    GROU 
    Print for each case the group assignment based on discriminant function.

    Sample specification

    These parameters are optional. If they are not specified, all cases from the input file are taken for the basic sample. Test and anonymous samples, if they exist, must always be explicitly defined. The pair-wise intersection of the samples must be empty. However, they need not cover the whole input data file. A single value or a range of values can be used for selecting the cases which belong to the corresponding sample.

    m1 = value of sample variable
    or
    m1 <= value of sample variable < m2

    where m1 and m2 may be integer or decimal values.

    SAVAR=variable number

    The variable used for sample definition. V- or R-variable can be used.

    BASA=(m1, m2)

    Conditional: defines the basic sample. Must be provided if SAVAR specified.

    TESA=(m1, m2)

    Conditional and optional: if SAVAR is specified. Defines the test sample.

    ANSA=(m1, m2)

    Conditional and optional: if SAVAR is specified. Defines the anonymous sample.

    Basic sample classification

    These parameters define the a priori groups used in the discriminant analysis procedure. All the groups must be defined explicitly and their pair-wise intersection must be empty. However, they need not cover the whole basic sample.

    GRVAR=variable number

    The variable used for group definition. V- or R-variable can be used.
    No default.

    GR01=(m1, m2)

    Defines the first group in the basic sample.

    GR02=(m1, m2)

    Defines the second group in the basic sample.

    GRnn=(m1, m2)

    Defines the n-th group in the basic sample (nn <= 20).

    Note. At least two groups have to be specified.


25.7  Restrictions

  1. Maximum number of a priori groups is 20.
  2. Same variable cannot be used twice.
  3. Maximum field width of case ID variable is 4.


25.8  Examples

Example 1. Discriminant analysis on all cases together; cases are identified by the V1; 5 steps of analysis are requested; a priori groups are defined by the variable V111 which includes categories 1-6.


     $RUN DISCRAN
     $FILES
     PRINT  = DISC1.LST
     DICTIN = MY.DIC                   input dictionary file
     DATAIN = MY.DAT                   input data file
     $SETUP
     CANONICAL LINEAR DISCRIMINANT ANALYSIS
     PRINT=(DATA,GROUP)  IDVAR=V1  STEP=5  VARS=(V101-V105)  -
        GVAR=V111  GR01=(1,2)  GR02=(3,4)  GR03=(5,6)
Example 2. Repeat analysis described in the Example 1 using the subset of respondents having the value 1 on V5 as the basic sample and test the results on the respondents having the value 2 on V5.

     $RUN DISCRAN
     $FILES
          as for Example 1
     $SETUP
     CANONICAL LINEAR DISCRIMINANT ANALYSIS USING BASIC AND TEST SAMPLES
     PRINT=(DATA,GROUP)  IDVAR=V1  STEP=5  VARS=(V101-V105)  -
        SAVAR=V5  BASA=1  TESA=2  -
        GVAR=V111  GR01=(1,2)  GR02=(3,4)  GR03=(5,6)