Classification and Regression Trees

 

10

 

Introduction

Tree-based modeling is an exploratory data analytic technique for uncovering structure in large data sets. This technique is quite useful for:

Tree-based models are useful for both classification and regression problems. In these problems, there is a set of classification or predictor variables (Xi ) and a dependent variable (Y). The Xi variables may be a mixture of nominal and/ or ordinal scales (or code intervals of equal-interval scale) and Y a quantitative or a qualitative (i.e., nominal or categorical) variable.

In classification trees the dependent variable is categorical, whereas in regression trees the dependent variable is quantitative. Regression trees parallel regression/ANOVA (Analysis of variance) modeling. Classification trees parallel discriminant analysis

The Search module in IDAMS computes classification and regression trees. The basis of the Search algorithm is the question embedded in an iterative procedure: What dichotomous split on which predictor variable will maximally improve the predictability of the dependent variable?

The SEARCH module carries out sequential binary splits according to a local optimization criterion, which varies with the measurement scale of the dependent variable.

Predictor Variable

Dependent Variable

Splitting Criterion

Program Option

Several (Ordinal/Nominal)

Quantitative

Explained variation based on group means

Means analysis

Several (Ordinal/Nominal) plus one covariate

Quantitative

Explained variation based on the regression of the dependent variable on the covariate

Regression analysis

Several (Ordinal/Nominal)

Nominal/ordinal or a set of dichotomous variables

Explained variation is the entropy of the dependent variable

Chi-square analysis