Data Management and Analysis

5    Data Management and Analysis


5.1  Data Validation with IDAMS


5.1.1  Overview

Before starting analysis of data with whatever software, data normally need to be validated. Such validation typically comprises three stages:

  1. Checking data completeness, i.e. verifying that all cases expected are present in the data file and that the correct records exist for each case if there are multiple records per case.
  2. Checking that numeric variables have only numeric values and checking that values are valid.
  3. Consistency checking between variables.
Like much other statistical software, IDAMS requires that there must be the same amount of data for each case. If the data for one case spans several records, then each case must comprise exactly the same set of records. If certain variables are not applicable to some cases, then "missing" values must none-the-less be assigned. Record merge checking capabilities in IDAMS allow for checking that each case of data has the correct set of records. This is performed by the program MERCHECK which produces a "rectangular" output file where extra/duplicate records have been deleted and cases with missing records have either been dropped or else padded with dummy records.

Checking for non-numeric values in numeric variables and the optional conversion of blank fields to user specified numeric values is performed by the BUILD program. Checking for other invalid codes is performed by the program CHECK where what are valid codes are defined on special control statements or taken from C-records in the dictionary describing the data.

If data are entered using the WinIDAMS User Interface, non-numeric characters (except empty fields) in numeric fields are not allowed. Moreover, there is a possibility of code checking during data entry and of an overall check for invalid codes in the whole data file. C-records in the dictionary are used for this purpose.

Consistency checks can be expressed in the IDAMS Recoding language and used with the CONCHECK program to list cases with inconsistencies.

Errors found in any of these steps can be corrected directly through the User Interface or by using the IDAMS program CORRECT. A typical sequence of steps for data error detection and correction with IDAMS is described in more detail below.

5.1.2  Checking Data Completeness

Step 1 
Produce summary tables showing the distribution of cases amongst sampling units, geographical areas, etc. for checking against expected totals. This is particularly useful in a sample survey. For example, suppose a survey of households is done. A sample is taken by first selecting primary sampling units (PSU) then up to 5 areas within each PSU and then interviewing households in those areas. The distribution of households by PSU and area in the data can be produced by preparing a small dictionary containing just the 2 variables: PSU and area. The table would look something like this:

                              V2 AREA
   
                       01   02   03   04   05
 
              01        3    6    2
     V1  PSU  02       10    4    2    8    5
              03
               .
               .
This table could be compared with the interviewers' log-book to check whether the data for all interviews taken exist in the file.

Steps 2, 3 and 4 are necessary only when cases are composed of more than one record.

Step 2 
The original "raw" data records are sorted into case identification/record identification order using the SORMER program.

Step 3 
The sorted raw data are checked with MERCHECK to see if they have the correct set of records for each case. The output file contains only "good" cases, i.e. ones with the correct records. Extra records and duplicate records are dropped. Cases with missing records are either dropped or padded. All cases with merge errors are listed.

Step 4 
Corrections are now made for the errors detected by MERCHECK. These can be done in a variety of ways:
  • Re-enter "bad" cases and merge them with the output file of MERCHECK using SORMER.
  • Correct the original raw data with an editor and re-do steps 2 and 3.
  • Re-enter "bad" cases, perform steps 2 and 3 on these and then merge the output from this execution of step 3 with the original output from step 3.
Whichever method is selected, MERCHECK should be re-executed on the corrected file to make sure all errors have been dealt with.

5.1.3  Checking for Non-numeric and Invalid Variable Values

Step 5 
Prepare a dictionary for all variables with appropriate instructions for dealing with blank fields. Execute BUILD. An IDAMS dataset is output (Data file and Dictionary file). All unexpected non-numeric values are converted to 9's and reported in the results.

Step 6 
Using TABLES, print frequency distributions of all qualitative variables and minimum, maximum and mean values for quantitative variables. This gives an initial idea of the content of the data and shows which variables have invalid codes (qualitative variables) or too large/small values (quantitative variables). It also can be compared later with a similar distributions and values obtained after cleaning to see how data validation has affected the data.

Step 7 
Prepare control statements specifying the valid codes or range of values for each variable. These can be prepared ahead of time for all variables or alternatively, after step 6 for only those variables which are known to have invalid codes. Use the output dataset from step 5 as input to the CHECK program to get a list of cases with invalid values. Note that the specification of valid codes for variables can also be taken from C-records in the dictionary if these were introduced in step 5.

Step 8 
Prepare corrections for errors detected at step 5 and step 7. Use the CORRECT program to update the IDAMS dataset created in step 5.
Note that corrections could also be done with the WinIDAMS User Interface if the number of cases is not too large. However using CORRECT is a less error prone method.

Perform steps 7 and 8 until no errors are reported.

5.1.4  Consistency Checking

Step 9 
Prepare logical statements of the consistency checks to be performed, e.g.
PREGNANT (V32) = inapplicable if and only if SEX (V6) = Male.

Assign a "result" number to each consistency check and translate the logic into Recode statements where the result is set to 1 for an inconsistency, e.g.


          IF V6 EQ 1 AND V32 NE 9 THEN R1001=1
          IF V6 NE 1 AND V32 EQ 9 THEN R1001=1 ELSE R1001=0
Use the set of Recode statements with CONCHECK to print cases with errors.

Step 10 
Correct cases with errors as in step 8.
Perform steps 9 and 10 until no errors are reported. The data output from the final execution of CORRECT will be ready for analysis.

5.2  Data Management/Transformation

IDAMS contains an extensive set of facilities for generating indices, derived measures, aggregations, and other transformations of the data, including alphabetic recoding. The most frequently used capabilities are provided by the Recode facility, which can perform temporary operations in all analysis programs that input an IDAMS dataset. Results of recoding can be saved as permanent variables using the TRANS program. These facilities operate on variables within one case and permit recoding of the values of one or more variables, generation of variables by combinations of variables, control of the sequence of these operations through tests of logical expressions, and a number of specialized statements and functions. The necessary new dictionary information to describe the results of the operations performed is automatically produced.

For aggregation across cases, the AGGREG program is available. AGGREG provides arithmetic sums and related measures, ranges, and counts of valid data values within groups of cases. Typical use of AGGREG involves the prior use of the SORMER program to order the Data file into the desired groups.

There are a number of circumstances in which it is necessary to combine the records from two different files, for example, data collected at different points in time. As values for variables for each new wave are received, the objective is to add them to the record containing all the previous data for the same respondent or case. The MERGE program will accomplish this, including appropriate padding with missing data where respondents are not found in the new wave. Similar examples occur when residuals or some form of scale scores are generated for each case by an analysis program and need to be included with the original data.

A somewhat different combination process occurs when data from different levels of analysis are to be combined. One illustration of this is the addition of household data to individual respondent's records. When a dataset is ordered such that all respondents in the same household are together, MERGE will provide the necessary duplicate record merge. A similar situation occurs when group summaries from AGGREG are to be added to the records for each case in each respective group.

Another dataset combination process, often also termed a merge, occurs when additional cases are to be added to a dataset. The new records must be described by the same dictionary as the original data. This type of merge may be achieved with the SORMER program.

Sub-setting functions are available as temporary operations in most IDAMS programs (by using a "filter") to select particular cases for processing. Permanent files containing subsets of IDAMS datasets (a subset of variables or a subset of cases, or both) may also be created. The SUBSET and TRANS programs are most likely to be used for such tasks, although several other programs that output datasets, such as MERGE, may also be used. Selection of cases may be done on the basis that only certain cases are logically of interest (such as only the female respondents), or it may be done on a random basis using the Recode function RAND with the TRANS program.

A display of the actual values stored in an IDAMS dataset is often of substantial help for checking the results from data modification steps and indeed at any other stages. The LIST program is available for this purpose, and allows complete listings of a selection of specific cases and variables. The selection or filtering of cases for display may be done using combinations of several variables in logical expressions; an example would be a selection of only records for unmarried women between 21 and 25 years of age. Numeric and alphabetic variables from a dataset as well as variables constructed with Recode statements can be listed. The User Interface also has an option to print the data in a table format.

5.3  Data Analysis

The paramount consideration for the user in selecting analysis programs is whether the appropriate statistical functions are provided. Guidance on such matters is well beyond the scope of this manual. A summary of the functions of each IDAMS analysis program can be found in the Introduction. More details are given in the individual program write-ups. The formulas used for computing the statistics in each program, and references are given in relevant chapters of the part "Statistical Formulas and Bibliographic References".

5.4  Example of a Small Task to be Performed with IDAMS

Suppose that an IDAMS dataset contains responses to a survey questionnaire and includes the following variables:

V11 gives the sex of the respondent according to the following code:

    1. Male 2. Female 9. Not ascertained
V12 is the respondent's income in dollars (99999 = not ascertained).

V13 through V16 are attitudinal measures on different issues. The variables are each coded to reflect the feelings of the respondent as follows:

    1. Very positive 2. Positive 3. Neutral 4. Negative 5. Very negative 8. Don't know
    9. Not ascertained 0. The question is irrelevant for this respondent
Suppose that only a grouping or recoding of income levels is needed of the following kind:
New code Meaning
1 Income in the range $0 to $9999
2 Income in the range $10,000 to $29,999
3 Income $30,000 and over
9 Refused, Not ascertained, Don't know

Cross-tabulations are desired between the recoded version of the income variable, V12, and each of the attitudinal variables, V13 to V16. Only the female respondents are to be selected for this analysis.

An IDAMS "setup" containing the necessary control statements to perform this work is shown below. The numbers in parentheses on the left identify each control statement and link it to the subsequent explanation.


     (1)  $RUN TABLES
     (2)  $FILES
     (3)  DICTIN = ECON.DIC
     (4)  DATAIN = ECON.DAT
     (5)  $RECODE
     (6)      R101=BRAC(V12,0-9999=1,10000-29999=2,30000-99998=3, -
     (7)           ELSE=9)
     (8)      NAME R101 'GROUPED INCOME'
     (9)  $SETUP
    (10)  INCLUDE V11=2
    (11)  EXAMPLE OF TABLES USING ECONOMIC DATA
    (12)  *
    (13)  TABLES
    (14)  ROWVARS=(R101,V13-V16)
    (15)  ROWVAR=R101  COLVARS=(V13-V16) CELLS=(FREQS,ROWPCT)  STATS=CHI

Briefly, this is what each statement does:
(1) 
"$RUN TABLES" is an IDAMS command specifying that the TABLES program is to be executed.
(2) 
This statement signals the start of file definitions for the execution.
(3) &(4) 
The IDAMS dataset is stored in two separate files. One contains the dictionary, the other the data.
(5) 
This statement signals that transformations of the data are required. The statements following this are the specific commands to the Recode facility.
(6)(7) 
These two lines (an original and a continuation) form a statement to the Recode facility indicating the desired grouping for the income variable, V12, following the scheme outlined earlier. The result of the BRAC function is stored as result variable R101.
(8) 
This statement assigns name to the variable R101.
(9) 
"$SETUP" is a command which indicates the end of Recode statements and that the TABLES program control statements follow.
(10) 
This is a "filter" which states that the only data cases to be used are those where variable V11 has the code value 2, for females.
(11) 
This is a label, which contains the text to be used to title the results.
(12) 
This line specifies the main parameters. Since only the asterisk is given, all the default options for the parameters are chosen for the current execution.
(13) 
The word TABLES is supplied here to separate the preceding global information for the entire execution from the specifications for individual tables that follow.
(14) 
This statement requests univariate frequency distributions for 5 variables.
(15) 
Now bivariate (2-way) tables are requested. The cells are to contain the counts (frequencies) and row percentages; a Chi-square statistic will be printed for each table. The 2 lists of variables following the keywords ROWVAR and COLVARS specify the variables that will be used for the rows and columns of the tables respectively. Four tables will be produced: R101 (grouped income) by V13, V14, V15 and V16).