AGGREG aggregates
individual records (data cases) into groups defined by the user and
computes summary descriptive statistics on specified variables for
each group. The statistics include sums, means, variances, standard
deviations, as well as minimum and maximum values and the counts of
non-missing data values. An output IDAMS dataset is created, i.e.
the grouped (aggregated) data file described by an IDAMS dictionary;
the aggregated data file contains one record (case) per group with
variables that are the summary to the group level of each of the
selected input variables. Formulas for calculating mean, variance
and standard deviation can be found in Part "Statistical Formulas
and Bibliographic References", chapter "Univariate and Bivariate Tables".
However, they need to be adjusted since cases are not weighted and
the coefficient N/(N-1) is not used in computation of sample variance
and/or standard deviation. Note that the summary statistics are selected
for the entire set of aggregate variables. Thus, if there were 2 aggregate
variables and if 3 statistics were selected, there would be 6 computed
variables. AGGREG enables the user to change the level of aggregation
of data e.g. from individual family members to household, or from
district to regional level, etc. For example, suppose a data file
contains records on every individual in a household and that we wish
to analyze these data at the household level. AGGREG would permit
us to aggregate values of variables across all the individual records
for each household to create a file of household level records for
further analysis. If, to be more specific, the individual level data
file contained a variable giving the persons income, AGGREG could
create household level records with a variable on the total household
income. Grouping the data. The user specifies up to 20 group
definition (ID) variables which determine the level of aggregation
for the output file. For example, if one wanted to aggregate individual
level data to the household level, a variable identifying the household
would be the group definition variable. Each time AGGREG reads an
input record, it checks for a change in any of the ID variables. When
this is encountered, a record is output containing the summary statistics
on the specified aggregate variables for the group of records just
processed. Inserting constants into the group records.
Constants can be inserted into each group record using the parameters
PAD1, ... , PAD5, which specify so called pad variables. The value
of a pad variable is a constant. Transferring variables.
Variables can be transferred to the output group records. Note that
only the values of the first case in the group are transferred.
Case and variable
selection. The standard filter is available to select a subset
of the cases from the input data. ID variables defining the groups
and the variables to be aggregated are specified with the parameters.
The ID variables are automatically included in the output group dataset.
Transforming data. Recode statements may be used. Treatment
of missing data. Each aggregate variable value is compared to
both missing data codes and if found to be a missing data value, is
automatically excluded from any calculation. A user-supplied percentage,
the "cutoff point" (see the parameter CUTOFF) determines the number
of missing data values allowed before the summarization value is output
as a missing data code. Thus, for example, suppose the mean value
of an aggregate variable within a group was to be computed, and the
group contained 12 records and 6 of them had missing data values,
i.e. 50%. If the CUTOFF value was 75%, the mean of the 6 non-missing
values would be calculated and output for that group. If the CUTOFF
value was 25%, however, the mean would not be calculated and the first
missing data code would be output.
Missing data summary.
(Optional: see the parameter PRINT). For each variable in each group,
the input variable number, the output variable number, the number
of records with substantive data (i.e. non-missing data) and the percentage
of records with missing data are printed. Group summary.
(Optional: see the parameter PRINT). The number of input records for
each group. Input dictionary. (Optional: see the parameter
PRINT). Variable descriptor records, and C-records if any, only for
variables used in the execution. Output dictionary. (Optional:
see the parameter PRINT). Statistics. (Optional: see the
parameter PRINT). All of the computed variables can be printed for
each aggregate record. The variable number of the corresponding aggregate
variable and the ID variables are also given.
The grouped output dataset
is a Data file, described by an IDAMS dictionary. Each record contains
values of the ID variables, computed variables, transferred variables
and pad constants; there is one record produced for each group. Variable
sequence and variable numbers. The output variables are in the
same relative order as the input variables from which they were derived,
regardless of whether the input variable is used as an ID, aggregate,
or variable to be transferred. Thus, if the first variable in the
input is used, the variable(s) derived from it will be the first output
variable(s). Each input variable used as an ID or variable to be transferred
corresponds to one output variable; each aggregate variable corresponds
to from 1 to 7 output variables, according to the number of summary
statistics requested (these variables are output in the relative order:
sum, mean, variance, standard deviation, count, minimum, maximum).
The output variables are always renumbered, starting with the number
supplied in the parameter VSTART. Pad constants always come last.
Variable names. The output variables have the same names
as input variables from which they were derived except that for the
aggregate variables, the 23rd and 24th characters of the name field
are coded:
10.1  General Description
10.2  Standard IDAMS Features
10.3  Results
10.4  Output Dataset
| S | = | sum |
| M | = | mean |
| V | = | variance |
| D | = | standard deviation |
| CT | = | count |
| MN | = | minimum |
| MX | = | maximum. |
Pad constants are given names "Pad variable 1", "Pad variable 2", etc.
Variable type. ID variables and transferred variables are output in their input type. Computed variables are always output as numeric.
Field width and number of decimals. Field widths for output aggregated variables depend on the statistic, the input field width (FW), the input number of decimal places (ND) and the extra decimal places requested by the user with the DEC parameter. Field widths and decimal places are assigned as shown below, where FW=input field width and ND=input number of decimal places for input variables, and FW=6 and ND=0 for recoded variables.
| Statistic | Field Width | Decimal Places |
| SUM | FW + 3 * | ND |
| MEAN | FW + DEC ** | ND + DEC *** |
| VARIANCE | FW + DEC ** | ND + DEC *** |
| SD | FW + DEC ** | ND + DEC *** |
| MIN | FW | ND |
| MAX | FW | ND |
| COUNT | 4 | 0 |
Missing data codes. Missing data codes for ID variables and transferred variables are taken from the input dictionary. The second missing data code (MD2) for the computed variables is always blank. The value of the first missing data code (MD1) is allocated as follows:
| Output variable | Output MD1 |
| Output FW <= 7 | 9's |
| Output FW > 7 | -999999 |
| COUNT variable | 9999 |
Reference numbers. Computed variables are given the reference number of their base variable.
C-records. C-records in the input dictionary are transferred to the output dictionary for ID and transfer variables.
A note on computation of the statistics.
Before output, computed values are rounded up to the calculated width
and number of decimal places. If the computed value exceeds 999999999
or is less than -99999999, it is output as 999999999.
The input is a Data file
described by an IDAMS dictionary. Group-definition (ID) variables
and variables to be transferred may be numeric or alphabetic, although
numeric variables are treated as strings of characters, i.e. a value
of '044' is different from ' 44'. They cannot be recoded variables.
Variables to be aggregated must be numeric and may be recoded variables.
The file is processed serially and contiguous records with the
same value on the ID variables are aggregated. Thus, the input file
should be sorted on the ID variables prior to using AGGREG. Note that
AGGREG does not check the input file sort order. 10.5  Input Dataset
$RUN AGGREG
$FILES
File specifications
$RECODE (optional)
Recode statements
$SETUP
1. Filter (optional)
2. Label
3. Parameters
$DICT (conditional)
Dictionary
$DATA (conditional)
Data
Files:
DICTxxxx input dictionary (omit if $DICT used)
DATAxxxx input data (omit if $DATA used)
DICTyyyy output dictionary
DATAyyyy output data
PRINT results (default IDAMS.LST)
|
Refer to "The IDAMS Setup File" chapter for further descriptions of the program control statements, items 1-3 below.
Example: INCLUDE V1=10,20,30,50 OR V10=90-300
Example: AGGREGATION TEACHER/STUDENT DATA
Example: IDVARS=(V1,V2) STATS=(SUM,VARI) DEC=3 -
AGGV=(V5-V10,V50-V75) PAD1=80
INFILE=IN /xxxx
BADDATA=STOP /SKIP/MD1/MD2
MAXCASES=n
IDVARS=(variable list)
AGGV=(variable list)
STATS=(SUM, MEAN, VARIANCE, SD, COUNT, MIN, MAX)
SAMPLE /POPULATION
OUTFILE=OUT /yyyy
VSTART=1 /n
CUTOFF=100 /n
DEC=2 /n
TRANSVARS=(variable list)
PAD1=constant
PAD2=constant
PAD3=constant
PAD4=constant
PAD5=constant
PRINT=(MDTABLES, GROUPS, DATA, CDICT/DICT, OUTDICT /OUTCDICT/NOOUTDICT)
Output a dataset containing one aggregate case for each unique value of V5 and V7; the variables in each case are to be the sum, mean and standard deviation of 4 input variables and 1 recoded variable, aggregated over the cases forming the group (i.e. with the same values for V5, V7); values of V10, V11 for the first case of each group are to be transferred to the output records; a listing of the values output for each case is requested; in the output file, variables are to be numbered starting from 1001.
$RUN AGGREG
$FILES
PRINT = AGGR.LST
DICTIN = IND.DIC input Dictionary file
DATAIN = IND.DAT input Data file
DICTOUT = AGGR.DIC output Dictionary file
DATAOUT = AGGR.DAT output Data file
$RECODE
R100=COUNT(1,V20-V29)
NAME R100'WEALTH INDEX'
$SETUP
AGGREGATION OF 4 INPUT VARIABLES AND 1 RECODED VARIABLE
IDVARS=(V5,V7) AGGV=(V31,V41-V43,R100) STATS=(SUM, MEAN, SD) -
VSTART=1001 PRINT=DATA TRANS=(V10,V11)