Multidimensional Scaling (MDSCAL)

28    Multidimensional Scaling (MDSCAL)


28.1  General Description

MDSCAL is a non-metric multidimensional scaling program for the analysis of similarities. The program, which operates on a matrix of similarity or dissimilarity measures, is designed to find, for each dimensionality specified, the best geometric representation of the data in the space.

The uses of non-metric multidimensional scaling are similar to those of factor analysis, e.g. clusters of variables can be spotted, the dimensionality of the data can be discovered, and dimensions can sometimes be interpreted. The CONFIG program can be used to perform analysis on an MDSCAL output configuration.

Input configuration. Normally an internally created arbitrary starting configuration is used to begin the computation. The user may, however, supply an initial configuration. There are several possible reasons for providing a starting configuration. The user may have theoretical reasons for beginning with a certain configuration; one may wish to perform further iteration on a configuration which is not yet close enough to the best configuration; or, to save computing time, one may wish to provide a higher dimensional configuration as a starting point for a lower dimensional configuration.

Scaling algorithm. The program starts with an initial configuration, either generated arbitrarily or supplied by the user, and iterates (using a procedure of the "steepest descent" type) over successive trial configurations, each time comparing the rank order of inter-point differences in the trial configuration with the rank order of the corresponding measure in the data. A "badness of fit" measure (stress coefficient) is computed after each iteration and the configuration is rearranged accordingly to improve the fit to the data, until, ideally, the rank order of distances in the configuration is perfectly monotonic with the rank order of dissimilarities given by the data; in that case, the "stress" will be zero. In practice, the scaling computation stops, in any given number of dimensions, because the stress reaches a sufficiently small value (STRMIN), the scale factor (magnitude) of the gradient reaches a sufficiently small value (SRGFMN), the stress has been improving too slowly (SRATIO), or the preset maximum number of iterations is reached (ITERATIONS). The program stops on whichever condition comes first. The same procedure is repeated for the next lower dimensionality using the previous results as the initial configuration, until a specified minimum number of dimensions is reached. During computation, the cosine of the angle between successive gradients plays an important role in several ways; optionally, two internal weighting parameters may be specified (see parameters COSAVW and ACSAVW).

Dimensionality and metric. Solutions may be obtained in 2 to 10 dimensions. The user controls the dimensionality of the configurations obtained by specifying the maximum and minimum number of dimensions desired, and the difference between the dimensionality of the successive solutions produced (see parameters DMAX, DMIN, and DDIF). The user also specifies, using parameter R, whether the distance metric should be Euclidean (R=2), the usual case, or some other Minkowski r-metric.

Stress. Stress is a measure of how well the configuration matches the data. The user may choose between two alternate formulas for computing the stress coefficient: either the stress is standardized by the sum of the squared distances from the mean (SQDIST) or the stress is standardized by the sum of the squared deviations from the mean (SQDEV). In many situations, the configurations reached by the two formulas will not be substantially different. Larger values of stress result from formula 2 for the same degree of fit.

Ties in input coefficients. There are two alternative methods for handling ties among the input data values; the corresponding distances can be required to be equal (TIES=EQUAL) or they can be allowed to differ (TIES=DIFFER). When there are few ties, it makes little difference which approach is used. When there are a great many ties it does make a difference, and the context must be considered in making the choice.


28.2  Standard IDAMS Features

Case and variable selection. Filtering of cases must be performed at the time the matrix is created, not in MDSCAL. The parameter VARS allows the computation to be performed on subsets of the matrix rather than on the entire matrix.

Transforming data. Use of Recode statements is not applicable in MDSCAL. Data transformations must be performed at the time the input matrix is created.

Weighting data. Weighting in the usual sense (weighting cases to correct for different sampling rates or different levels of aggregation) must be accomplished before using MDSCAL; such weighting must be incorporated in the input data matrix. There is a weight option of a quite different sort available in MDSCAL (see parameter INPUT=WEIGHTS). It may be used to assign weights to cells of the input matrix; the user supplies a matrix of values which are to be used as weights for the corresponding elements in the input matrix.

Treatment of missing data. Missing data for individual cases must be accounted for at the time the input data matrix is created, not in MDSCAL. If, after the matrix has been created, an entry in the matrix is missing, i.e. contains a missing data code, there is a possibility of processing it in MDSCAL: the MDSCAL cutoff option (see parameter CUTOFF) can be used to exclude from analysis missing data values if these are less than valid data values. MDSCAL has no option for recognizing missing data values that are large numbers (such as 99.99901, the missing data code output by PEARSON). If large missing data values do exist, these should be edited to small numbers. If one particular variable has many missing entries, possibly it should be dropped from the analysis.


28.3  Results

Input matrix. (Optional: see the parameter PRINT).

Input weights. (Optional: see the parameter PRINT).

Input configuration. If a starting configuration is supplied, it is always printed.

History of the computation. For each solution, the program prints a complete history of computations, reporting the stress value and its ancillary parameters for each iteration:

Iteration the iteration number
Stress the current value of the stress
SRAT the current value of the stress ratio
SRATAV the current stress ratio average (it is an exponentially weighted average)
CAGRGL the cosine of the angle between the current gradient and the previous gradient
COSAV the current value of the average cosine of the angle between successive gradients
(a weighted average)
ACSAV the current value of the average absolute value of the cosine of the angle
between successive gradients (a weighted average)
SFGR the length (more properly, the scale factor) of the gradient
STEP the step size.

Reason for termination. When computation is terminated, the reason is indicated by one of the remarks: "Minimum was achieved", "Maximum number of iterations were used", "Satisfactory stress was reached", or "Zero stress was reached".

Final configuration. For each solution, the Cartesian coordinates of the final configuration are printed.

Sorted configuration. (Optional: see the parameter PRINT). For each solution, the projections of points of the final configuration are sorted separately on each dimension into ascending order and printed.

Summary. For each solution, the original data values are sorted and printed together with their corresponding final distances (DIST) and the hypothetical distances required for a perfect monotonic fit (DHAT).


28.4  Output Configuration Matrix

As the final configuration for each dimensionality is calculated, it may be output as an IDAMS rectangular matrix. The configuration is centered and normalized. The rows represent variables and the columns represent dimensions. The matrix elements are written in 10F7.3 format. Dictionary records are generated. This matrix may be submitted as a configuration input for another execution of MDSCAL or it may be input to another program such as CONFIG for additional analysis.


28.5  Input Data Matrix

The usual input to MDSCAL is an IDAMS square matrix (see "Data in IDAMS" chapter). This matrix is the upper-right-half matrix with no diagonal and it is defined by the parameter INPUT=STANDARD. TABLES and PEARSON generate matrices suitable for input to MDSCAL. Means and standard deviations are not used but appropriate (dummy) records must be supplied. MDSCAL will accept matrices in other formats than the upper-right triangle with no diagonal. However, such matrices must contain the dictionary portion of an IDAMS square matrix and must have records containing pseudo means and standard deviations at the end.

The following INPUT parameters indicate the exact format of matrix being input:

STAN upper-right triangle, no diagonal
STAN, DIAG upper-right triangle, with diagonal
LOWER, DIAG lower-left triangle, with diagonal
LOWER lower-left triangle, no diagonal
SQUARE full square matrix with diagonal.

The measures contained in the data matrix may either be measures of similarity (such as correlations) or dissimilarities. Although the input to MDSCAL is usually a matrix of correlation coefficients (e.g. a matrix of gammas or a matrix of Pearson r's), the input matrix may contain any measure that makes sense as a measure of proximity. Because non-metric scaling uses only ordinal properties of the data, nothing need be assumed about the quantitative or numerical properties of the data. There should be, at the very least, twice as many variables as dimensions.


28.6  Input Weight Matrix

If a weight matrix is supplied, it must be in exactly the same format as the input data matrix. The parameter INPUT=(STAN/LOWE/SQUA, DIAG) applies to the weight matrix as well as to the data matrix. The dictionary for the weight matrix should be the same as for the input data matrix. Means and standard deviations are not used, but corresponding "dummy" lines should be supplied.

This matrix contains values, in one-to-one correspondence with elements of the data matrix, which are to be used as weights for the data. These values are used in conjunction with the value for the parameter CUTOFF when applied to the data. If a data value is greater than the cutoff value, but the corresponding weight value is less than or equal to zero, an error condition is signaled. Likewise, if the data value is less than or equal to the cutoff value, and the corresponding weight value is greater than zero, an error condition is set. If either of these inconsistencies occurs, the execution terminates.


28.7  Input Configuration Matrix

The input configuration must be in the format of an IDAMS rectangular matrix. See "Data in IDAMS" chapter.

It provides a starting configuration to be used in the computations. The rows should represent variables and the columns dimensions. It is usually produced by a previous execution of MDSCAL and is submitted in order that a previous execution may start where it left off.

The matrix must contain at least as many dimensions as the value given for the parameter DMAX.

Note: If a variable list (VARS) is specified, MDSCAL uses the first n rows of the input configuration where n is the number of variables in the list, without checking the variable numbers.


28.8  Setup Structure




     $RUN MDSCAL

     $FILES
          File specifications

     $SETUP
          1. Label
          2. Parameters

     $MATRIX (conditional)
          Data matrix
          Weight matrix
          Starting configuration matrix
     (Note:  Not all of the matrices need be included here; however, if
      more than one matrix is included, they must be in the above order).

     Files:
     FT02    output configuration matrix
     FT03    input weight matrix if INPUT=WEIGHTS specified (omit if $MATRIX used)
     FT05    input starting configuration if INPUT=CONFIG specified
             (omit if $MATRIX used)
     FT08    input data matrix (omit if $MATRIX used)
     PRINT   results (default  IDAMS.LST)


28.9  Program Control Statements

Refer to "The IDAMS Setup File" chapter for further descriptions of the program control statements, items 1-2 below.

  1. Label (mandatory). One line containing up to 80 characters to label the results.
    
         Example:  MDSCAL EXECUTION ON DATASET X4952
    
  2. Parameters (mandatory). For selecting program options.
    
         Example:  DMAX=5  ITER=75  WRITE=CONFIG
    

    INPUT=(STANDARD /LOWER/SQUARE, DIAGONAL, WEIGHTS, CONFIG)

    STAN 
    The input is an IDAMS square matrix, i.e. off-diagonal, upper-right-half matrix.
    LOWE 
    The input matrix is a lower-left-half matrix.
    SQUA 
    The input matrix is a full square matrix.
    DIAG 
    The input matrix has the diagonal elements.
    WEIG 
    A matrix of weight values is being supplied.
    CONF 
    The starting configuration matrix is being supplied.

    VARS=(variable list)

    List of variables in the matrix on which analysis is to be performed.
    Default: The entire input matrix is used.

    FILE=(DATA, WEIGHTS, CONFIG)

    DATA 
    The input data matrix is in a file.
    WEIG 
    The weight matrix is in a file.
    CONF 
    The input configuration matrix is in a file.
    Default: All matrices are assumed to follow a $MATRIX command in the order data, weight, configuration.

    COEFF=SIMILARITIES /DISSIMILARITIES

    SIMI 
    Large coefficients in the data matrix indicate that points are similar or close.
    DISS 
    Large coefficients indicate that points are dissimilar or far.

    DMAX=2 /n

    The dimension maximum: scaling starts with the space of maximum dimension.

    DMIN=2 /n

    The dimension minimum: scaling proceeds until it reaches or would pass the minimum dimension.

    DDIF=1 /n

    The dimension difference: scaling proceeds from maximum dimension to minimum dimension by steps of the dimension difference.

    R=2.0 /n

    Indicate which Minkowski r-metric is to be used. Any value >= 1.0 can be used.
    R=1.0 
    City block metric.
    R=2.0 
    Ordinary Euclidean distance.

    CUTOFF=0.0 /n

    Data values less than or equal to n are discarded. If the legitimate values of the input coefficients range from -1.0 to 1.0, CUTOFF=-1.01 should be used.

    TIES=DIFFER /EQUAL

    DIFF 
    Unequal distances corresponding to equal data values do not contribute to the stress coefficient and no attempt is made to equalize these distances.
    EQUA 
    Unequal distances corresponding to equal data values do contribute to the stress and there is an attempt to equalize these distances.

    ITERATIONS=50 /n

    The maximum number of iterations to be performed in any given number of dimensions. This maximum is a safety precaution to control execution time.

    STRMIN=.01 /n

    Stress minimum. The scaling procedure will stop if the stress reaches the minimum value.

    SFGRMN=0.0 /n

    Minimum value of the scale factor of the gradient. The scaling procedure will stop if the magnitude of the gradient reaches the minimum value.

    SRATIO=.999 /n

    The stress ratio. Scaling procedure stops if the stress ratio between successive steps reaches n.

    ACSAVW=.66 /n

    The weighting factor for the average absolute value of the cosine of the angle between successive gradients.

    COSAVW=.66 /n

    The weighting factor for the average cosine of the angle between successive gradients.

    STRESS=SQDIST /SQDEV

    SQDI 
    Compute the stress using the standardization by the sum of the squared distances.
    SQDE 
    Compute the stress using the standardization by the sum of the squared deviations from the mean.

    WRITE=CONFIG

    Output the final configuration of each solution into a file.

    PRINT=(MATRIX, SORTCONF, LONG /SHORT)

    MATR 
    Print the input data matrix and the weight matrix if one is supplied.
    SORT 
    Sort each dimension of the final configuration and print it.
    LONG 
    Print matrices on long lines.
    SHOR 
    Print matrices on short lines.


28.10  Restrictions

  1. The capacity of the program is 1800 data points (e.g. 1800 elements of the similarity or dissimilarity matrix). This is equivalent to a triangle of a 60 x 60 matrix or to a 42 x 42 square matrix.
  2. Variables may be scaled in up to 10 dimensions.
  3. The starting configuration matrix may have a maximum of 60 rows and 10 columns.


28.11  Example

Generation of an output configuration matrix; the input data matrix is in standard IDAMS form and in a file; there is neither input weight matrix nor input configuration matrix; 20 iterations are requested; analysis is to be performed on a subset of variables.


     $RUN MDSCAL
     $FILES
     FT02 = MDS.MAT                  output configuration Matrix file
     FT08 = ABC.COR                  input data Matrix file
     $SETUP
     MULTIDIMENSIONAL SCALING
     ITER=20  WRITE=CONFIG  FILE=DATA  VARS=(V18-V36)