Data in IDAMS

2    Data in IDAMS


2.1  The IDAMS Dataset


2.1.1  General Description

The dataset consists of 2 separate files: a Data file and a Dictionary file which describes some or all of the fields (variables) in the records of the data file. All Dictionary/Data files output by IDAMS programs are IDAMS datasets.

2.1.2  Method of Storage and Access

Both Dictionary and Data files are read and written sequentially. Thus they may be stored on any media. There is no special IDAMS internal "system" file as in some other packages. The files are in character/text format (ASCII) and can be processed at any time with general utilities or editors, or input directly to other statistical packages.

2.2  Data Files


2.2.1  The Data Array

Irrespective of its actual format in the data file, the data can be visualized as a rectangular array of variable values, where element xij is the value of the variable represented by the j-th column for the case represented by the i-th row. For example, the data from a survey can be displayed in the following way:


           Cases                         Variables

                       identification     education    sex    age    ...
         _________________________________________________________________

           case 1      1300                   6         2     31     ...     ...
           case 2      1301                   2         1     25     ...
            .          1302                   3         1     55     ...
            .           .                     .         .     .      ...

In the example, each row represents a respondent in a survey and each column represents an item from the questionnaire.

2.2.2  Characteristics of the Data File

These files contain normally, but not necessarily, fixed length records, since the end of the record is recognized by carriage return/line feed characters. However, the length of the longest record must be supplied on the file definition (see $FILES command). There is no limit to the number of records in the Data file.

The maximum record length is 4096 characters.

Each "case" may consist of more than one record (up to a maximum of 50). If, in a particular program execution, variables are to be accessed from more than one type of record, then there must be exactly the same number of records for each case. The MERCHECK program can be used to create files complying with this condition. Note that any Data file output by an IDAMS program is always restructured to contain a single record per case.

If a raw data file contains different record types (and the record type is coded) and does not have exactly the same number of records per case, IDAMS programs can be executed using variables from one record type at a time by selecting only that record type at the start.

2.2.3  Hierarchical Files

IDAMS only processes "rectangular" files as described above. Hierarchical files can be handled by storing records from the different levels in different files and then using the AGGREG and MERGE programs to produce composite records containing variables from the different levels. Alternatively, the complete hierarchical data file can be processed one level at a time by "filtering" records for that level only (providing record types are coded).

2.2.4  Variables

Referencing variables. The variables in a Data file are identified by a unique number between 1 and 9999. This number, preceded by a V (e.g. V3) is used to refer to a particular variable in control statements to programs. The variable number is used to index a variable-descriptor record in the dictionary which provides all other necessary information about the variable such as its name and its location in the data record.

Variable types. Variables can be of numeric or alphabetic type, both stored in character mode.

Numeric variables. These can be positive or negative valued with the following characteristics:

Alphabetic variables. Alphabetic variables can be held in Data files and can be up to 255 characters long. They can be used in data management programs. 1-4 character alphabetic variables can be also used in filters. In order to be used in analysis, 1-4 character alphabetic variables must be recoded to numeric values. This can be done with Recode's BRAC function.

2.2.5  Missing Data Codes

The value of a variable for a particular case may be unknown for a number of reasons, for example a question may be inapplicable to certain respondents or a respondent may refuse to answer a question. Special missing data codes can be established for each numeric variable and coded into the data when needed. Two missing data codes are allowed: MD1 and MD2. If used, any value in the data equal to MD1 is considered a missing value; any value greater than or equal to MD2 (if MD2 is positive or zero) or less than or equal to MD2 (if MD2 is negative) is also considered missing.

These missing data codes are stored in the dictionary record for the variable. Similar to data values, they can be integer or decimal valued, with an implicit or explicit decimal point. If MD1 or MD2 is specified with an implicit decimal point, NDEC gives the number of digits to be treated as decimal places. If an explicit decimal point is coded in MD1 or MD2, then NDEC determines the number of digits to the right of the decimal point to be retained, rounding up the value accordingly.

When a variable's MD1 and MD2 codes are blank in the dictionary, this means that there are no special numeric missing data codes. During an IDAMS program execution, blank dictionary MD1 and MD2 fields are filled in by the default missing data codes of 1.5 * 109 and 1.6 * 109 respectively.

Since the missing data codes are each limited to a maximum of 7 digits (or 6 digits and a negative sign), they can present a problem for 8 and 9 digit variables. The user should consider the use of a negative first missing data code in this case.

2.2.6  Non-numeric or Blank Values in Numeric Variables - Bad Data

In IDAMS data management programs, data values are merely copied from one place to another and conversion to a computational (binary) mode is not carried out; in this case there is no check on whether numeric variables have numeric values. However, when variables are being used for analysis or in Recode operations, then their values are converted to binary mode and values containing non-numeric characters will cause problems. Normally data should be cleaned of such characters prior to analysis. In addition, blank values in numeric variables are not automatically treated as missing values; they are also considered to be non-numeric or "bad" data.

To allow for analysis of incompletely cleaned data and for the handling of unrecoded blank fields, the BADDATA parameter may be used to treat blank and other non-numeric values as missing and thus have the possibility of eliminating them from analysis. Specification of the parameter BADDATA=MD1 or BADDATA=MD2 results in the conversion of "bad" values to the MD1 or MD2 code for the variable. If the MD1 or MD2 codes are blank, then bad data values are converted to the corresponding default missing data code (see above) and are thus treated as missing values (see the description of BADDATA parameter in "The IDAMS Setup File" chapter).

2.2.7  Editing Rules for Variables Output by IDAMS Programs

IDAMS programs always create a Data file and a corresponding IDAMS dictionary, i.e. an IDAMS dataset.

The Data file contains one record for each case. The record length is the sum of the field widths of all variables output and is determined by the program.

Numeric variable values are edited to a standard form as described below:

Alphabetic variable values are not edited and are the same on input and output.

2.3  The IDAMS Dictionary


2.3.1  General Description

The dictionary is used to describe the variables in the data. For each variable it must contain at minimum the variable's number, its type and its location in the data record. In addition, a variable name, two missing data codes, the number of decimal places and a reference number or name may be given. This information is stored in variable-descriptor records sometimes known as T-records. Optional C-records for categorical variables give labels for the different possible codes. The first record in the dictionary, the dictionary-descriptor record, identifies the dictionary type, gives the first and last variable numbers used in the dictionary and specifies the number of data records making up a "case".

The original dictionary is prepared by the user to describe the raw data. IDAMS programs which output datasets always produce new dictionaries reflecting the new format of the data.

Dictionary records have fixed format and are 80-characters long.

A detailed description of each type of dictionary record is given below.

Dictionary-descriptor record. This is always the first record in the dictionary.

Columns Content

4  
3 (indicates the type of dictionary).
5-8  
First variable number (right justified).
9-12  
Last variable number (right justified).
13-16  
Number of records per case (right justified).
20  
Form in which variable location is specified (columns 32-39) on the variable-descriptor records.
blank  
Record number and starting and ending columns. Record length must be 80 to use this format if the number of records per case is > 1.
1  
Starting location and field width.
Variable-descriptor records (T-records). The dictionary contains one such record for each variable. These records are arranged in ascending order by the variable number. The variable numbers need not be contiguous. The maximum number of variables is 1000.

Columns Content

1  
T
2-5  
Variable number.
7-30  
Variable name.
32-39  
Location; according to column 20 of the dictionary-descriptor record.
  
Either
32-33  
Record sequence number containing starting column of variable.
34-35  
Starting column number.
36-37  
Record sequence number containing ending column of variable.
38-39  
Ending column number.
  
Or
32-35  
Starting location of the variable within the case.
36-39  
Field width (1-9 for numeric variables and 1-255 for alphabetic variables).
40  
Number of decimal places (numeric variables only).
Blank implies no decimal places.
41  
Type of variable.
blank  
Numeric.
1  
Alphabetic.
45-51  
First missing data code for numeric variables (or blanks if no 1st missing data code).
Right justified.
52-58  
Second missing data code for numeric variables (or blanks if no 2nd missing data code).
Right justified.
59-62  
Reference number (optional - can be used to contain some unchangeable alphanumeric reference for the variable, e.g. the original variable number or a question reference).
73-75  
Study ID (optional - can be used to identify the study to which this dictionary belongs).

Note 1: When record and column numbers are used to indicate variable location, listings of the dictionary records do not show the record and column numbers as they appear on the dictionary record. Rather, the variable location is translated to and printed in the starting location/width format. For example, for a variable in columns 22-24 of the third record of a multiple record (record length 80) per case data file, the starting location will be 182 (2 * 80 + 22) and the width 3.

Note 2: If there is more than one record per case and the record length is not 80, then starting location and field width notation must be used on the T-records. The starting location is counted from the start of the first record. For example, for records of length 121, the starting location of a field at position 11 of the 2nd record for a case would be 132.

Code-label records (C-records). The dictionary may optionally contain these records for any of the variables. They follow immediately after the T-record for the variable to which they apply and provide codes and their labels for different possible values of the variable. They are used by programs such as TABLES to print row and column labels along with the corresponding codes. They can also be used as the specification of valid codes for a variable during data entry with the WinIDAMS User Interface and for data validation with the program CHECK.

Columns Content

1  
C
2-5  
Variable number.
6-9  
Reference number (optional - can be used to contain some unchangeable alphanumeric reference for the variable, e.g. the original variable number or a question reference).
15-19  
Code value left justified.
22-72  
Label for this code. (Note that only the first 8 characters will be used by analysis programs printing code labels although the complete label will appear in listings of the dictionary).
73-75  
Study ID (optional).


2.3.2  Example of a Dictionary


        Columns:           1         2         3         4         5         6...
                  123456789012345678901234567890123456789012345678901234567890...

                     3   1  20   1   1
                  T   1 Identification              1   5
                  T   2 Age                         6   2          99
                  T   3 Sex                         8   1
                  C   3         1      Female
                  C   3         2      Male
                  T  11 Region                     16   1
                  C  11         1      North
                  C  11         2      South
                  C  11         3      East
                  C  11         4      West
                  T  12 Grade average              17   31        000    900
                  T  20 Name                       31  30 1
This is a dictionary describing 6 data fields in a data record as shown diagrammatically below.

1-5 6-7 8 16 17-19 31-60
V1 V2 V3 V11 V12 V20
ID Age Sex Region Grade Name

Locations of variables are expressed in terms of starting position and field width (1 in column 20 of dictionary-descriptor) and there is one record per case (1 in column 16). There is one implied decimal place in the grade average variable (V12). The age variable has a code 99 for missing data. For the grade average, 0's imply missing data as do all values greater than or equal to 90.0. The name of each respondent (V20) is recorded as a 30 character alphabetic (type 1) variable. Note that variable numbers need not be contiguous and that not all fields in the data need to be described.

2.4  IDAMS Matrices

There are two types of IDAMS matrices: square and rectangular. Both types are self-described, but unlike the IDAMS dataset, the "dictionary" is stored in the same file as the array of values. In general, these matrices are created by one IDAMS program to be used as input to another program and the user need not be familiar with the format. If, however, it is necessary to prepare a similarity matrix, a configuration matrix, etc. by hand, then the formats described below must be observed.

Regardless of type, all records are fixed length 80-character records.

2.4.1  The IDAMS Square Matrix

The square matrix can be used only for a square and symmetric array. Only the values in the upper-right triangular, off-diagonal portion of the array are actually stored in the square matrix. An array of Pearsonian correlation coefficients is suitably stored like this.

Programs which input/output square matrices. PEARSON outputs square matrices of correlations and covariances; REGRESSN outputs square matrix of correlations; TABLES outputs square matrices of bivariate measures of association. These matrices are appropriate input to other programs, e.g. the correlation matrix output from PEARSON can be input to REGRESSN and to CLUSFIND. Moreover, CLUSFIND and MDSCAL input square matrix of similarities or dissimilarities.

Example.


                   Columns:                 111111111122222222223...
                                   123456789012345678901234567890...

           Matrix descriptor          2   4
           Format statements    |  #F  (12F6.3)
                                |  #F  (6E12.5)
           Variable identifi-   |  #T   1 AGE
                    cations     |  #T   3 EDUCATION
                                |  #T   9 RELIGION
                                |  #T  10 SEX
           Array of values      |   -.011 -.174 -.033
                                |    .131 -.105
                                |   -.133
           Means & standard     |   0.33350E 01 0.54950E 01 0.50251E 01 0.40960E 01
                   deviations   |   0.20010E 01 0.19856E 01 0.15000E 01 0.12345E 01
Format. The square matrix contains the following:
  1. A matrix-descriptor record. This, the first record, gives the matrix type and the dimensions of the array of values.

    Columns Content

    4  
    2 (indicates square matrix).
    5-8  
    The number of variables (right justified).

  2. A Fortran format statement describing each row of the array of values. The format statement describes the number of value fields per 80-character record and the format of each. For example, a format of (12F6.3) indicates that each row of the array is recorded with up to 12 values per record, each value occupying 6 columns, 3 of which are decimals. If a row contains more than 12 values, a new record contains the 13th value, etc. Each new row of the array always starts on a new record.

    Columns Content

    1-2  
    #F
    3-80  
    The format statement, enclosed in parentheses.

  3. A Fortran format statement describing the vectors of the variable means and standard deviations. The format statement describes the number of values per record and the format of each.

    Columns Content

    1-2  
    #F
    3-80  
    The format statement, enclosed in parentheses.

  4. Variable identification records. These are n records, where n is the number of variables specified on the matrix-descriptor record. The order of these records corresponds to the order of variables indexing the rows (and columns) of the array of values. When a matrix is created by an IDAMS program, the variable numbers and names are retained from the IDAMS dataset from which the bivariate statistics were generated.

    Columns Content

    1-2  
    #T or #R (indicates variable identification for a row of the matrix).
    3-6  
    The variable number (right justified).
    8-31  
    The variable name.

    The above four sections of the matrix are referred to as the matrix "dictionary". Following the matrix dictionary is the array of values.

  5. The array of values. Since the array is symmetric and has diagonal cells usually containing a constant (e.g. a correlation of 1.0 for a variable correlated with itself), only the off-diagonal, upper-right corner of the array is stored. Note that for a covariance matrix the diagonal elements can be calculated using standard deviations which are included in the matrix file (see point 7 below).

    In the example of the 4-variable matrix above, the full array (before entering in the square format) would be as follows:

    
                  vars        1       3       9      10
                   1        1.000   -.011   -.174   -.033
                   3        -.011   1.000    .131   -.105
                   9        -.174    .131   1.000   -.133
                  10        -.033   -.105   -.133   1.000
    
    The portion of the array that is stored is:
    
                 vars        1       3       9       10
                  1                -.011   -.174   -.033
                  3                         .131   -.105
                  9                                -.133
                 10
    
    Each row of this reduced array begins a new record and is written according to the format specification in the matrix dictionary (see above).

  6. A vector of variable means. The n values are recorded in accordance with the format statement in the matrix dictionary.

  7. A vector of variable standard deviations. The n values are recorded in accordance with the format statement in the matrix dictionary.

2.4.2  The IDAMS Rectangular Matrix

The rectangular matrix differs from the square matrix in that the array of values may be square (and non-symmetric) or rectangular. Further, since the rows of some arrays are not indexed by variables, e.g. a frequency table, the rectangular matrix may or may not contain variable identification records; the rectangular matrix does not contain variable means and standard deviations.

Programs which input/output rectangular matrices. These matrices are created by the CONFIG, MDSCAL, TABLES and TYPOL programs. They are appropriate input for CONFIG, MDSCAL and TYPOL.

Example.


                   Columns:                      111111111122222222223...
                                        123456789012345678901234567890...

           Matrix descriptor               3   4   3
           Format statement             #F (l6F5.0)
           Variable identifications   | #T   2 IQ
                                      | #T   5 EDUCATION
                                      | #T   8 MOBILITY
                                      | #T  12 SIBLING RIVALRY
           Array of values            |    59   20   10
                                      |    37   15    2
                                      |    50   40    7
                                      |     8   26   31
Format. The rectangular matrix contains the following:
  1. A matrix-descriptor record.

    Columns Content

    4  
    3 (indicates rectangular matrix).
    5-8  
    The number of rows (right justified).
    9-12  
    The number of columns (right justified).
    16  
    Number of format (#F) statement records. (Blank implies 1).
    20  
    Presence of row and column labels.
    blank/0  
    Row labels only are present (#R or #T records).
    1  
    Column labels only are present (#C records).
    2  
    Row and column labels are present (#R or #T, and #C records).
    3  
    No row or column labels are present.
    21-40  
    Row variable name (optional).
    41-60  
    Column variable name (optional).
    61-80  
    Description of the matrix contents (optional):
    Weighted frequencies
    Unweighted freqs
    Row percentages
    Column percentages
    Total percentages
    Name of the variable for which mean values are included in the matrix.

  2. A Fortran format statement describing each row of the array of values. The format describes an 80-character record. For example, a format of (16F5.0) indicates that each row of the array is recorded with up to 16 values per record and with each value occupying 5 columns, none of which is a decimal place.

    Columns Content

    1-2  
    #F
    3-80  
    The format statement, enclosed in parentheses.

  3. Variable identification records. The order of these records corresponds to the order of the variables/codes indexing the rows and columns of the matrix. When a rectangular matrix is created by an IDAMS program, the variable/code numbers and names are retained from the input dataset or matrix from which the array of values was derived.

    Columns Content

    1-2  
    #T or #R for row labels, #C for column labels.
    3-6  
    The variable number or the code value (right justified).
    The code values longer than 4 characters are replaced by ****.
    8-58  
    The variable name or the code label.

    The above three sections of the matrix are referred to as the matrix "dictionary". Following the matrix dictionary is the array of values.

  4. The array of values. The full array is stored. Each row of the array begins a new record and is written according to the format specified in the matrix dictionary.

2.5  Use of Data from Other Packages


2.5.1  Raw Data

Any data in the form of fixed format records in character (ASCII) mode can be input directly to IDAMS programs. Nearly all data base and statistical packages have an "export" or "convert" function to produce fixed format character mode data files. An IDAMS dictionary must be prepared to describe the fields required from the data.

Free format data files with Tab, comma or semicolon used as separator can be imported directly through the WinIDAMS User Interface. See the "User Interface" chapter for details.

Free format (any character being used as delimiter including blank) and DIF format text files can also be imported using the IMPEX program.

Data stored in an CDS/ISIS data base can be imported to IDAMS using the WinIDIS program.

2.5.2  Matrices

The IMPEX program can be used to import free format matrices. Furthermore, matrices produced outside IDAMS, for example a matrix provided in a publication, may also be entered according to the format given above.