Aggregating Data (AGGREG)

10    Aggregating Data (AGGREG)


10.1  General Description

AGGREG aggregates individual records (data cases) into groups defined by the user and computes summary descriptive statistics on specified variables for each group. The statistics include sums, means, variances, standard deviations, as well as minimum and maximum values and the counts of non-missing data values. An output IDAMS dataset is created, i.e. the grouped (aggregated) data file described by an IDAMS dictionary; the aggregated data file contains one record (case) per group with variables that are the summary to the group level of each of the selected input variables.

Formulas for calculating mean, variance and standard deviation can be found in Part "Statistical Formulas and Bibliographic References", chapter "Univariate and Bivariate Tables". However, they need to be adjusted since cases are not weighted and the coefficient N/(N-1) is not used in computation of sample variance and/or standard deviation. Note that the summary statistics are selected for the entire set of aggregate variables. Thus, if there were 2 aggregate variables and if 3 statistics were selected, there would be 6 computed variables.

AGGREG enables the user to change the level of aggregation of data e.g. from individual family members to household, or from district to regional level, etc. For example, suppose a data file contains records on every individual in a household and that we wish to analyze these data at the household level. AGGREG would permit us to aggregate values of variables across all the individual records for each household to create a file of household level records for further analysis. If, to be more specific, the individual level data file contained a variable giving the persons income, AGGREG could create household level records with a variable on the total household income.

Grouping the data. The user specifies up to 20 group definition (ID) variables which determine the level of aggregation for the output file. For example, if one wanted to aggregate individual level data to the household level, a variable identifying the household would be the group definition variable. Each time AGGREG reads an input record, it checks for a change in any of the ID variables. When this is encountered, a record is output containing the summary statistics on the specified aggregate variables for the group of records just processed.

Inserting constants into the group records. Constants can be inserted into each group record using the parameters PAD1, ... , PAD5, which specify so called pad variables. The value of a pad variable is a constant.

Transferring variables. Variables can be transferred to the output group records. Note that only the values of the first case in the group are transferred.


10.2  Standard IDAMS Features

Case and variable selection. The standard filter is available to select a subset of the cases from the input data. ID variables defining the groups and the variables to be aggregated are specified with the parameters. The ID variables are automatically included in the output group dataset.

Transforming data. Recode statements may be used.

Treatment of missing data. Each aggregate variable value is compared to both missing data codes and if found to be a missing data value, is automatically excluded from any calculation. A user-supplied percentage, the "cutoff point" (see the parameter CUTOFF) determines the number of missing data values allowed before the summarization value is output as a missing data code. Thus, for example, suppose the mean value of an aggregate variable within a group was to be computed, and the group contained 12 records and 6 of them had missing data values, i.e. 50%. If the CUTOFF value was 75%, the mean of the 6 non-missing values would be calculated and output for that group. If the CUTOFF value was 25%, however, the mean would not be calculated and the first missing data code would be output.

10.3  Results

Missing data summary. (Optional: see the parameter PRINT). For each variable in each group, the input variable number, the output variable number, the number of records with substantive data (i.e. non-missing data) and the percentage of records with missing data are printed.

Group summary. (Optional: see the parameter PRINT). The number of input records for each group.

Input dictionary. (Optional: see the parameter PRINT). Variable descriptor records, and C-records if any, only for variables used in the execution.

Output dictionary. (Optional: see the parameter PRINT).

Statistics. (Optional: see the parameter PRINT). All of the computed variables can be printed for each aggregate record. The variable number of the corresponding aggregate variable and the ID variables are also given.

10.4  Output Dataset

The grouped output dataset is a Data file, described by an IDAMS dictionary. Each record contains values of the ID variables, computed variables, transferred variables and pad constants; there is one record produced for each group.

Variable sequence and variable numbers. The output variables are in the same relative order as the input variables from which they were derived, regardless of whether the input variable is used as an ID, aggregate, or variable to be transferred. Thus, if the first variable in the input is used, the variable(s) derived from it will be the first output variable(s). Each input variable used as an ID or variable to be transferred corresponds to one output variable; each aggregate variable corresponds to from 1 to 7 output variables, according to the number of summary statistics requested (these variables are output in the relative order: sum, mean, variance, standard deviation, count, minimum, maximum). The output variables are always renumbered, starting with the number supplied in the parameter VSTART. Pad constants always come last.

Variable names. The output variables have the same names as input variables from which they were derived except that for the aggregate variables, the 23rd and 24th characters of the name field are coded:

S = sum
M = mean
V = variance
D = standard deviation
CT = count
MN = minimum
MX = maximum.

Pad constants are given names "Pad variable 1", "Pad variable 2", etc.

Variable type. ID variables and transferred variables are output in their input type. Computed variables are always output as numeric.

Field width and number of decimals. Field widths for output aggregated variables depend on the statistic, the input field width (FW), the input number of decimal places (ND) and the extra decimal places requested by the user with the DEC parameter. Field widths and decimal places are assigned as shown below, where FW=input field width and ND=input number of decimal places for input variables, and FW=6 and ND=0 for recoded variables.

Statistic Field Width Decimal Places
SUM FW + 3 * ND
MEAN FW + DEC ** ND + DEC ***
VARIANCE FW + DEC ** ND + DEC ***
SD FW + DEC ** ND + DEC ***
MIN FW ND
MAX FW ND
COUNT 4 0

*  
If the field width exceeds 9, then it is reduced to 9.
**  
If the field width exceeds 9, then the number of extra decimals (DEC) is reduced accordingly.
***  
If the number of decimals exceeds 9, then DEC is reduced accordingly.

Missing data codes. Missing data codes for ID variables and transferred variables are taken from the input dictionary. The second missing data code (MD2) for the computed variables is always blank. The value of the first missing data code (MD1) is allocated as follows:

Output variable Output MD1
Output FW <= 7 9's
Output FW > 7 -999999
COUNT variable 9999

Reference numbers. Computed variables are given the reference number of their base variable.

C-records. C-records in the input dictionary are transferred to the output dictionary for ID and transfer variables.

A note on computation of the statistics. Before output, computed values are rounded up to the calculated width and number of decimal places. If the computed value exceeds 999999999 or is less than -99999999, it is output as 999999999.

10.5  Input Dataset

The input is a Data file described by an IDAMS dictionary. Group-definition (ID) variables and variables to be transferred may be numeric or alphabetic, although numeric variables are treated as strings of characters, i.e. a value of '044' is different from ' 44'. They cannot be recoded variables. Variables to be aggregated must be numeric and may be recoded variables.

The file is processed serially and contiguous records with the same value on the ID variables are aggregated. Thus, the input file should be sorted on the ID variables prior to using AGGREG. Note that AGGREG does not check the input file sort order.


10.6  Setup Structure




     $RUN AGGREG

     $FILES
          File specifications

     $RECODE (optional)
          Recode statements

     $SETUP
          1. Filter (optional)
          2. Label
          3. Parameters

     $DICT (conditional)
          Dictionary

     $DATA (conditional)
          Data


     Files:
     DICTxxxx   input dictionary (omit if $DICT used)
     DATAxxxx   input data (omit if $DATA used)
     DICTyyyy   output dictionary
     DATAyyyy   output data
     PRINT      results (default IDAMS.LST)


10.7  Program Control Statements

Refer to "The IDAMS Setup File" chapter for further descriptions of the program control statements, items 1-3 below.

  1. Filter (optional). Selects a subset of cases to be used in the execution.
    
         Example:  INCLUDE V1=10,20,30,50 OR V10=90-300
    
  2. Label (mandatory). One line containing up to 80 characters to label the results.
    
         Example:  AGGREGATION TEACHER/STUDENT DATA
    
  3. Parameters (mandatory). For selecting program options.
    
         Example:  IDVARS=(V1,V2)  STATS=(SUM,VARI)  DEC=3 -
                      AGGV=(V5-V10,V50-V75)  PAD1=80
    
    INFILE=IN /xxxx
    A 1-4 character ddname suffix for the input Dictionary and Data files.
    Default ddnames: DICTIN, DATAIN.

    BADDATA=STOP /SKIP/MD1/MD2

    Treatment of non-numeric data values in aggregates variables and in variables used in Recode. See "The IDAMS Setup File" chapter.

    MAXCASES=n

    The maximum number of cases (after filtering) to be used from the input file.
    Default: All cases will be used.

    IDVARS=(variable list)

    Up to 20 variable numbers to define the groups. R-variables are not allowed.
    No default.

    AGGV=(variable list)

    V- or R-variables to be aggregated.
    No default.

    STATS=(SUM, MEAN, VARIANCE, SD, COUNT, MIN, MAX)

    Parameters for selecting required statistics (at least one of: SUM, MEAN, VARIANCE, SD must be selected). They are output for each group and for each AGGV variable.
    SUM 
    Sum.
    MEAN 
    Mean.
    VARI 
    Variance.
    SD 
    Standard deviation.
    COUN 
    Number of valid cases.
    MIN 
    Minimum value.
    MAX 
    Maximum value.

    SAMPLE /POPULATION

    SAMP 
    Compute the variance and/or standard deviation using the sample equation.
    POPU 
    Use the population equation.

    OUTFILE=OUT /yyyy

    A 1-4 character ddname suffix for the output Dictionary and Data files.
    Default ddnames: DICTOUT, DATAOUT.

    VSTART=1 /n

    Variable number for the first variable in the output dataset.

    CUTOFF=100 /n

    The percentage of cases with MD codes allowed before a MD code is output. An integer value.

    DEC=2 /n

    For computed variables involving mean, variance or standard deviation: the number of decimal places in addition to those of the corresponding input variables (see Restriction 7).

    TRANSVARS=(variable list)

    Variables whose values, as given for the first case of each group, are to be transferred to the output file. R-variables are not allowed.

    PAD1=constant
    PAD2=constant
    PAD3=constant
    PAD4=constant
    PAD5=constant

    Up to 5 constants can be added to the output dataset. The number of characters given determines the field width of the constant.

    PRINT=(MDTABLES, GROUPS, DATA, CDICT/DICT, OUTDICT /OUTCDICT/NOOUTDICT)

    MDTA 
    Print a table giving the percentage of missing data found for each aggregate variable in each group.
    GROU 
    Print the number of cases per group.
    DATA 
    Print values for each computed variable in each group record.
    CDIC 
    Print the input dictionary for the variables accessed with C-records if any.
    DICT 
    Print the input dictionary without C-records.
    OUTD 
    Print the output dictionary without C-records.
    OUTC 
    Print the output dictionary with C-records of ID and transfer variables if any.
    NOOU 
    Do not print the output dictionary.

10.8  Restrictions

  1. Maximum number of variables to be aggregated is 400.
  2. Maximum number of ID variables is 20.
  3. Maximum number of characters in ID variables is 180.
  4. Maximum number of variables to be transferred is 100.
  5. Recoded variables are not allowed as IDVARS or as TRANSVARS.
  6. Same variable cannot appear in two variable lists.

10.9  Example

Output a dataset containing one aggregate case for each unique value of V5 and V7; the variables in each case are to be the sum, mean and standard deviation of 4 input variables and 1 recoded variable, aggregated over the cases forming the group (i.e. with the same values for V5, V7); values of V10, V11 for the first case of each group are to be transferred to the output records; a listing of the values output for each case is requested; in the output file, variables are to be numbered starting from 1001.


      $RUN AGGREG
      $FILES
      PRINT   = AGGR.LST
      DICTIN  = IND.DIC                       input Dictionary file
      DATAIN  = IND.DAT                       input Data file
      DICTOUT = AGGR.DIC                      output Dictionary file
      DATAOUT = AGGR.DAT                      output Data file
      $RECODE
          R100=COUNT(1,V20-V29)
          NAME R100'WEALTH INDEX'
      $SETUP
      AGGREGATION OF 4 INPUT VARIABLES AND 1 RECODED VARIABLE
      IDVARS=(V5,V7)  AGGV=(V31,V41-V43,R100)  STATS=(SUM, MEAN, SD)   -
         VSTART=1001  PRINT=DATA  TRANS=(V10,V11)