The dataset consists
of 2 separate files: a Data file and a Dictionary file which describes
some or all of the fields (variables) in the records of the data file.
All Dictionary/Data files output by IDAMS programs are IDAMS datasets.
Both Dictionary
and Data files are read and written sequentially. Thus they may be
stored on any media. There is no special IDAMS internal "system" file
as in some other packages. The files are in character/text format
(ASCII) and can be processed at any time with general utilities or
editors, or input directly to other statistical packages.
Irrespective of its actual
format in the data file, the data can be visualized as a rectangular
array of variable values, where element xij is the value
of the variable represented by the j-th column for the case represented
by the i-th row. For example, the data from a survey can be displayed
in the following way: These
files contain normally, but not necessarily, fixed length records,
since the end of the record is recognized by carriage return/line
feed characters. However, the length of the longest record must be
supplied on the file definition (see $FILES command). There is no
limit to the number of records in the Data file. The maximum record
length is 4096 characters. Each "case" may consist of more than
one record (up to a maximum of 50). If, in a particular program execution,
variables are to be accessed from more than one type of record, then
there must be exactly the same number of records for each case. The
MERCHECK program can be used to create files complying with this condition.
Note that any Data file output by an IDAMS program is always restructured
to contain a single record per case. If a raw data file contains
different record types (and the record type is coded) and does not
have exactly the same number of records per case, IDAMS programs can
be executed using variables from one record type at a time by selecting
only that record type at the start.
IDAMS only processes
"rectangular" files as described above. Hierarchical files can be
handled by storing records from the different levels in different
files and then using the AGGREG and MERGE programs to produce composite
records containing variables from the different levels. Alternatively,
the complete hierarchical data file can be processed one level at
a time by "filtering" records for that level only (providing record
types are coded).
Referencing variables.
The variables in a Data file are identified by a unique number between
1 and 9999. This number, preceded by a V (e.g. V3) is used to refer
to a particular variable in control statements to programs. The variable
number is used to index a variable-descriptor record in the dictionary
which provides all other necessary information about the variable
such as its name and its location in the data record. Variable
types. Variables can be of numeric or alphabetic type, both stored
in character mode. Numeric variables. These can be positive
or negative valued with the following characteristics:
The value of a variable
for a particular case may be unknown for a number of reasons, for
example a question may be inapplicable to certain respondents or a
respondent may refuse to answer a question. Special missing data codes
can be established for each numeric variable and coded into the data
when needed. Two missing data codes are allowed: MD1 and MD2. If used,
any value in the data equal to MD1 is considered a missing value;
any value greater than or equal to MD2 (if MD2 is positive or zero)
or less than or equal to MD2 (if MD2 is negative) is also considered
missing. These missing data codes are stored in the dictionary
record for the variable. Similar to data values, they can be integer
or decimal valued, with an implicit or explicit decimal point. If
MD1 or MD2 is specified with an implicit decimal point, NDEC gives
the number of digits to be treated as decimal places. If an explicit
decimal point is coded in MD1 or MD2, then NDEC determines the number
of digits to the right of the decimal point to be retained, rounding
up the value accordingly. When a variable's MD1 and MD2 codes are
blank in the dictionary, this means that there are no special numeric
missing data codes. During an IDAMS program execution, blank dictionary
MD1 and MD2 fields are filled in by the default missing data codes
of 1.5 * 109 and 1.6 * 109 respectively. Since
the missing data codes are each limited to a maximum of 7 digits (or
6 digits and a negative sign), they can present a problem for 8 and
9 digit variables. The user should consider the use of a negative
first missing data code in this case.
In IDAMS data management programs, data values are merely copied from
one place to another and conversion to a computational (binary) mode
is not carried out; in this case there is no check on whether numeric
variables have numeric values. However, when variables are being used
for analysis or in Recode operations, then their values are converted
to binary mode and values containing non-numeric characters will cause
problems. Normally data should be cleaned of such characters prior
to analysis. In addition, blank values in numeric variables are not
automatically treated as missing values; they are also considered
to be non-numeric or "bad" data. To allow for analysis of incompletely
cleaned data and for the handling of unrecoded blank fields, the BADDATA
parameter may be used to treat blank and other non-numeric values
as missing and thus have the possibility of eliminating them from
analysis. Specification of the parameter BADDATA=MD1 or BADDATA=MD2
results in the conversion of "bad" values to the MD1 or MD2 code for
the variable. If the MD1 or MD2 codes are blank, then bad data values
are converted to the corresponding default missing data code (see
above) and are thus treated as missing values (see the description
of BADDATA parameter in "The IDAMS Setup File" chapter).
IDAMS programs always create a Data file and a corresponding IDAMS
dictionary, i.e. an IDAMS dataset. The Data file contains one record
for each case. The record length is the sum of the field widths of
all variables output and is determined by the program. Numeric
variable values are edited to a standard form as described below:
The dictionary is
used to describe the variables in the data. For each variable it must
contain at minimum the variable's number, its type and its location
in the data record. In addition, a variable name, two missing data
codes, the number of decimal places and a reference number or name
may be given. This information is stored in variable-descriptor records
sometimes known as T-records. Optional C-records for categorical variables
give labels for the different possible codes. The first record in
the dictionary, the dictionary-descriptor record, identifies the dictionary
type, gives the first and last variable numbers used in the dictionary
and specifies the number of data records making up a "case". The
original dictionary is prepared by the user to describe the raw data.
IDAMS programs which output datasets always produce new dictionaries
reflecting the new format of the data. Dictionary records have
fixed format and are 80-characters long. A detailed description
of each type of dictionary record is given below. Dictionary-descriptor
record. This is always the first record in the dictionary. Columns
Content
Columns Content
Note 1: When record and column numbers are used to indicate variable
location, listings of the dictionary records do not show the record
and column numbers as they appear on the dictionary record. Rather,
the variable location is translated to and printed in the starting
location/width format. For example, for a variable in columns 22-24
of the third record of a multiple record (record length 80) per case
data file, the starting location will be 182 (2 * 80 + 22) and the
width 3. Note 2: If there is more than one record per case and
the record length is not 80, then starting location and field width
notation must be used on the T-records. The starting location is counted
from the start of the first record. For example, for records of length
121, the starting location of a field at position 11 of the 2nd record
for a case would be 132. Code-label records (C-records).
The dictionary may optionally contain these records for any of the
variables. They follow immediately after the T-record for the variable
to which they apply and provide codes and their labels for different
possible values of the variable. They are used by programs such as
TABLES to print row and column labels along with the corresponding
codes. They can also be used as the specification of valid codes for
a variable during data entry with the WinIDAMS User Interface and
for data validation with the program CHECK. Columns Content
Locations of variables are expressed in terms of starting
position and field width (1 in column 20 of dictionary-descriptor)
and there is one record per case (1 in column 16). There is one implied
decimal place in the grade average variable (V12). The age variable
has a code 99 for missing data. For the grade average, 0's imply missing
data as do all values greater than or equal to 90.0. The name of each
respondent (V20) is recorded as a 30 character alphabetic (type 1)
variable. Note that variable numbers need not be contiguous and that
not all fields in the data need to be described.
There are two types of IDAMS
matrices: square and rectangular. Both types are self-described, but
unlike the IDAMS dataset, the "dictionary" is stored in the same file
as the array of values. In general, these matrices are created by
one IDAMS program to be used as input to another program and the user
need not be familiar with the format. If, however, it is necessary
to prepare a similarity matrix, a configuration matrix, etc. by hand,
then the formats described below must be observed. Regardless of
type, all records are fixed length 80-character records.
The square matrix
can be used only for a square and symmetric array. Only the values
in the upper-right triangular, off-diagonal portion of the array are
actually stored in the square matrix. An array of Pearsonian correlation
coefficients is suitably stored like this. Programs which input/output
square matrices. PEARSON outputs square matrices of correlations
and covariances; REGRESSN outputs square matrix of correlations; TABLES
outputs square matrices of bivariate measures of association. These
matrices are appropriate input to other programs, e.g. the correlation
matrix output from PEARSON can be input to REGRESSN and to CLUSFIND.
Moreover, CLUSFIND and MDSCAL input square matrix of similarities
or dissimilarities. Example. Columns
Content
Columns Content
Columns
Content
Columns Content
The above four sections of the matrix are referred to as the matrix
"dictionary". Following the matrix dictionary is the array of values.
In the example
of the 4-variable matrix above, the full array (before entering in
the square format) would be as follows:
The rectangular
matrix differs from the square matrix in that the array of values
may be square (and non-symmetric) or rectangular. Further, since the
rows of some arrays are not indexed by variables, e.g. a frequency
table, the rectangular matrix may or may not contain variable identification
records; the rectangular matrix does not contain variable means and
standard deviations. Programs which input/output rectangular
matrices. These matrices are created by the CONFIG, MDSCAL, TABLES
and TYPOL programs. They are appropriate input for CONFIG, MDSCAL
and TYPOL. Example. Columns Content
Columns Content
Columns
Content
The above three sections of the matrix are referred to as the
matrix "dictionary". Following the matrix dictionary is the array
of values.
Any data in the form of fixed
format records in character (ASCII) mode can be input directly to
IDAMS programs. Nearly all data base and statistical packages have
an "export" or "convert" function to produce fixed format character
mode data files. An IDAMS dictionary must be prepared to describe
the fields required from the data. Free format data files with
Tab, comma or semicolon used as separator can be imported directly
through the WinIDAMS User Interface. See the "User Interface" chapter
for details. Free format (any character being used as delimiter
including blank) and DIF format text files can also be imported using
the IMPEX program. Data stored in an CDS/ISIS data base can be
imported to IDAMS using the WinIDIS program.
The IMPEX program can be used
to import free format matrices. Furthermore, matrices produced outside
IDAMS, for example a matrix provided in a publication, may also be
entered according to the format given above.
2.1  The IDAMS Dataset
2.1.1  General Description
2.1.2  Method of Storage and Access
2.2  Data Files
2.2.1  The Data Array
Cases Variables
identification education sex age ...
_________________________________________________________________
case 1 1300 6 2 31 ... ...
case 2 1301 2 1 25 ...
. 1302 3 1 55 ...
. . . . . ...
In the example, each row represents a respondent in a
survey and each column represents an item from the questionnaire.
2.2.2  Characteristics of the Data File
2.2.3  Hierarchical Files
2.2.4  Variables
Alphabetic variables. Alphabetic variables can be held in
Data files and can be up to 255 characters long. They can be used
in data management programs. 1-4 character alphabetic variables can
be also used in filters. In order to be used in analysis, 1-4 character
alphabetic variables must be recoded to numeric values. This can be
done with Recode's BRAC function.
2.2.5  Missing Data Codes
2.2.6  Non-numeric or Blank Values in Numeric Variables - Bad Data
2.2.7  Editing Rules for Variables Output by IDAMS Programs
Alphabetic variable values are not edited and are the same
on input and output.
2.3  The IDAMS Dictionary
2.3.1  General Description
Variable-descriptor records (T-records). The dictionary contains
one such record for each variable. These records are arranged in ascending
order by the variable number. The variable numbers need not be contiguous.
The maximum number of variables is 1000.
Blank
implies no decimal places.
Right justified.
Right justified.
2.3.2  Example of a Dictionary
Columns: 1 2 3 4 5 6...
123456789012345678901234567890123456789012345678901234567890...
3 1 20 1 1
T 1 Identification 1 5
T 2 Age 6 2 99
T 3 Sex 8 1
C 3 1 Female
C 3 2 Male
T 11 Region 16 1
C 11 1 North
C 11 2 South
C 11 3 East
C 11 4 West
T 12 Grade average 17 31 000 900
T 20 Name 31 30 1
This is a dictionary describing 6 data fields in a data
record as shown diagrammatically below.
1-5
6-7
8
16
17-19
31-60
V1
V2
V3
V11
V12
V20
ID
Age
Sex
Region
Grade
Name
2.4  IDAMS Matrices
2.4.1  The IDAMS Square Matrix
Columns: 111111111122222222223...
123456789012345678901234567890...
Matrix descriptor 2 4
Format statements | #F (12F6.3)
| #F (6E12.5)
Variable identifi- | #T 1 AGE
cations | #T 3 EDUCATION
| #T 9 RELIGION
| #T 10 SEX
Array of values | -.011 -.174 -.033
| .131 -.105
| -.133
Means & standard | 0.33350E 01 0.54950E 01 0.50251E 01 0.40960E 01
deviations | 0.20010E 01 0.19856E 01 0.15000E 01 0.12345E 01
Format. The square matrix contains the following:
vars 1 3 9 10
1 1.000 -.011 -.174 -.033
3 -.011 1.000 .131 -.105
9 -.174 .131 1.000 -.133
10 -.033 -.105 -.133 1.000
The portion of the array that is stored is:
vars 1 3 9 10
1 -.011 -.174 -.033
3 .131 -.105
9 -.133
10
Each row of this reduced array begins a new record and
is written according to the format specification in the matrix dictionary
(see above). 2.4.2  The IDAMS Rectangular Matrix
Columns: 111111111122222222223...
123456789012345678901234567890...
Matrix descriptor 3 4 3
Format statement #F (l6F5.0)
Variable identifications | #T 2 IQ
| #T 5 EDUCATION
| #T 8 MOBILITY
| #T 12 SIBLING RIVALRY
Array of values | 59 20 10
| 37 15 2
| 50 40 7
| 8 26 31
Format. The rectangular matrix contains the following:
The code values longer than 4 characters are replaced by ****.
2.5  Use of Data from Other Packages
2.5.1  Raw Data
2.5.2  Matrices