10.1.1 Regression Trees

Means Analysis

The objective of Means Analysis is to create groups, which allow for the best prediction of the dependent variable values from the group means. The splitting criterion is therefore based on group means.

The splitting process involves the following steps:

  1. Choose the un-split group L, which has the largest sum of squares

The total sum of squares for the parent group is

  1. For each predictor Pi find that division of the classes in Pi that provides the largest reduction in the unexplained sum of squares. That is, split L (the "parent" group) into two disjoint subgroups (or "descendents") L1 and L2 so as to maximize the between sum of squares.

where NL1+NL2=NL, and NL1, NL2 NMIN; NMIN is a minimum group size requirement

  1. Select a predictor Pj such that BSSj BSSi, j i, and if BSSj PeSS1 split L into the 2 groups: L1 and L2 defined for the predictor Pj. The parameter Pe is an eligibility criterion { See (1) below) } . If BSSj < PeSS1 then L is deemed as a final group and is not split further.

The steps A-C are recursively performed. The process stops when one or more of the several criteria below are met:

  1. The marginal (added) reduction in error variance. If a split accounts for less than some pre-stated fraction of the original variance around the mean (say 0.6%). This is the best criterion to use.
  2. If at a split, one or both subgroups have fewer than some pre-stated number of cases (e.g., 25), then the mean would be unreliable. However, this is usually a dangerous rule, since (a) the least squares criterion is very sensitive to extreme cases, (b) cases in sub-groups can appear extreme even if they don’t in the full sample, and (c) the program can alert the researcher to the presence (and their impact) and impact of extreme cases by isolating a group of one or two cases that account for a substantial fraction of the variance, if this criterion is not used.
  3. The total number of splits has already reached some pre-stated maximum (e.g., 30). This is a useful secondary safeguard to prevent generating too many groups through inadvertence, e.g., in setting the first, main criterion too low.

These criteria ensure that the process stops before unreliable reduction in error variance occurs.

It may seem odd as to why the splitting criterion is not based on statistical significance. But, in reality, too many splits are evaluated that statistical significance becomes an irrelevant criterion. Suppose there are m different predictors of k categories each. Even if all the predictors are monotonic, each split looks at m (k-1) possibilities and by the time twenty-five such splits have been decided upon, the program has searched 25 m (k-1) possible splits. With twenty predictors of ten classes each, this figure would be 4,500. If the monotocity of predictors is not preserved (or is absent as in case of nominal variables), the number of possible splits would explode. Hence there is no point worrying about statistical significance or degrees of freedom.

Prediction with a covariate

In certain situations, a particular predictor dominates the dependent variable to such an extent that hardly any other predictor matters. For example, in economic studies, income or education so dominates the dependent variable that the data are split on little else. In such a situation, it may be desirable to remove that effect of a dominant variable to visualize the effects of other variables. One could assume a linear relationship through the origin and simply divide the dependent variable into groups by that predictor. This often has the added advantage of improving the homogeneity of variance where the variance of the dependent variable is related to its level.

Similar problems arise in empirical research in sociology and psychology, where it becomes necessary to isolate the effect of a particular variable under a wide variety of circumstances.

Further in the analysis of temporal changes in a phenomenon, the initial value of the phenomenon clearly affects its value measured at a subsequent time. That is why the residuals from the regression of its t2 value on its initial t1 value are often used as a measure of change, instead of the raw increments. However, this "initial value" effect may not be the same for all subgroups in the population.

To deal with these covariate problems, regression analysis is performed, where the sum of squares is explained by differences in the two subgroup regression lines, instead of the subgroup means.

This method can be used when analyzing a dependent variable with one covariate and several predictors. It aims at creating groups, which would allow for the best prediction of the dependent variable values from the group regression equation and the value of the covariate. In other words, created groups should provide largest differences in group regression lines. The splitting criterion (explained variation) is based upon group regression of the dependent variable on the covariate.

where bg is the slope of the regression line in group g.