Searching for Structure (statistics)

55     Searching for Structure

Notation


y
=
value of the dependent variable
x
=
frequency (weighted) of the categorical dependent variable
or values (weighted) of dichotomous dependent variables
z
=
value of the covariate
w
=
value of the weight
k
=
subscript for case
j
=
subscript for category code of the dependent variable
or subscript for dichotomous dependent variables
m
=
number of codes of the dependent variable
or number of dichotomous dependent variables
g
=
subscript for group; g = 1 indicate the whole sample
i
=
subscript for final groups
t
=
number of final groups
Ng
=
number of cases in group g
Wg
=
sum of weights in group g
Ni
=
number of cases in the final group i
Wi
=
sum of weights in the final group i
N
=
total number of cases
W
=
total sum of weights.

55.1  Means Analysis

This method can be used when analysing one dependent variable (interval or dichotomous) and several predictors. It aims at creating groups which would allow for the best prediction of the dependent variable values from the group average. In other words, created groups should provide largest differences in group means. Thus, the splitting criterion (explained variation) is based upon group means.

a)  Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative splits for parent groups as well as for each group resulting from the best split.

i)  Sum (wt). Number of cases (Ng) if the weight variable is not specified, or weighted number of cases (Wg) in group g.

ii)  Mean y. Mean value of the dependent variable y in group g.



y
 

g 
= æ
è
Ng
å
k = 1 
wk ygk ö
ø
 /  Wg

iii)  Var y. Variance of the dependent variable y in group g.


syg2 = æ
è
Ng
å
k = 1 
wk ( ygk -
y
 

g 
)2 ö
ø
 /   æ
ç
è
Wg - Wg
Ng
ö
÷
ø

iv)  Variation. Sum of squares of the dependent variable (as in one-way analysis of variance) in group g.


Vg = Ng
å
k = 1 
wk ( ygk -
y
 

g 
)2

v)  Var expl. Explained variation is measured by the difference between the variation in the parent group and the sum of variation in the two children groups. It provides, for each predictor, the amount of variation explained by the best split for this predictor, i.e. the highest value obtained over all possible splits for this predictor.

Let g1 and g2 denote two subgroups (children groups) obtained in a split of the parent group g, and Vg1 and Vg2 their respective variation. The variation explained by such a split of group g is calculated as follows:


EVg = Vg  -  (Vg1  +  Vg2)

Then, this value is maximized over all possible splits for the predictor.

vi)  Explained variation. This is the percent of the total variation explained by the final groups.


Percent = 100  ×  EV
TV

where EV and TV are, respectively, the variation explained by the final groups and the total variation (see 1.b below).

b)  One-way analysis of final groups. These are one-way analysis of variance statistics calculated for the final groups.

i)  Explained variation and DF. This is the amount of variation explained by the final groups and the corresponding degrees of freedom.


EV = TV - UV = TV - t
å
i = 1 
Vi


DF = t - 1

ii)  Total variation and DF. Variation calculated for the whole sample, i.e. for group 1, and the corresponding degrees of freedom.


TV = V1


DF = W - 1

iii)  Error and DF. This is the amount of unexplained variation and the corresponding degrees of freedom.


UV = t
å
i = 1 
Vi


DF = W - t

c)  Split summary table. The table provides group mean value, variance and variation of the dependent variable at each split as well as the variation explained by that split (see 1.a above).

d)  Final group summary table. The table provides mean value, variance and variation of the dependent variable for the final groups (see 1.a above).

e)  Percent of explained variation. The percent of total variation explained by the best split for each group is calculated as follows:


Percentg = 100  ×  EVg
TV

Note that this value is equal to zero for the final groups (indicated by an asterisk).

f)  Residuals. The residuals are the differences between the observed value and the predicted value of the dependent variable.


ek = yk - ^
y
 

k 

As predicted value, a case is assigned the mean value of the dependent variable for the group to which it belongs, i.e.


^
y
 

ik 
=
y
 

i 

55.2  Regression Analysis

This method can be used when analysing a dependent variable (interval or dichotomous) with one covariate and several predictors. It aims at creating groups which would allow for the best prediction of the dependent variable values from the group regression equation and the value of covariate. In other words, created groups should provide largest differences in group regression lines. The splitting criterion (explained variation) is based upon group regression of the dependent variable on the covariate.

a)  Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative splits for parent groups as well as for each group resulting from the best split.

i)  Sum (wt). Number of cases (Ng) if the weight variable is not specified, or weighted number of cases (Wg) in group g.

ii)  Mean y,z. Mean value of the dependent variable y and the covariate z in group g (see 1.a.ii above).

iii)  Var y,z. Variance of the dependent variable y and the covariate z in group g (see 1.a.iii above).

iv)  Slope. This is the slope of the dependent variable y on the covariate z in group g.


bg =
Ng
å
k = 1 
wk ( ygk -
y
 

g 
) ( zgk -
z
 

g 
)

Ng
å
k = 1 
wk ( zgk -
z
 

g 
)2

v)  Variation. This is the error or residual sum of squares from estimating the variable y by its regression on covariate in group g, i.e. a measure of deviation about the regression line.


Vg = Ng
å
k = 1 
wk ( ygk  -  
y
 

g 
)2   -   bg × Ng
å
k = 1 
wk ( ygk -
y
 

g 
) (zgk -
z
 

g 
)

where bg is the slope of the regression line in group g.

vi)  Var expl. Explained variation (EV). See 1.a.v above for general information, and 2.a.v above for details on V (variation) used in regression analysis.

vii)  Explained variation. This is the percent of the total variation explained by the final groups. See 1.a.iv above and 2.b below.

b)  One-way analysis of final groups. These are the summary statistics for the final groups. See 1.b above for general information, and 2.a.v and 2.a.vi above for details on V and EV measures used in regression analysis.

c)   Split summary table. The table provides group mean value, variance and variation of the dependent variable at each split as well as the variation explained by that split. It also provides mean value and variance of the covariate. See 2.a above for formulas. Moreover, the following regression statistics are calculated for each split:

i)  Slope. It is the slope of the dependent variable y on the covariate z in group g (see 2.a.iv above).

ii)  Intercept. It is the constant term in the regression equation.


ag =
y
 

g 
-  bg
z
 

g 

where bg is the slope in group g.

iii)  Corr. Pearson r correlation coefficient between the dependent variable y and the covariate z in group g.


rg = æ
è
Ng
å
k = 1 
wk ( ygk -
y
 

g 
)  ( zgk -
z
 

g 
) ö
ø
 /  
Ö
 

syg2  szg2
 

d)  Final group summary table. The table provides the same information (except the explained variation) as in "Split summary table", but for final groups.

e)  Percent of explained variation. The percent of total variation explained by the best split for each group (see 1.e and 2.a.vi above).

f)  Residuals. The residuals are the differences between the observed value and the predicted value of dependent variable.


ek = yk - ^
y
 

k 

Predicted values are calculated as follows:


^
y
 

ik 
= ai + bi zik

where ai and bi are regression coefficients for the final group i.

55.3  Chi-square Analysis

This method can be used when analysing one dependent variable (nominal or ordinal) or a set of dichotomous dependent variables with several predictors. It aims at creating groups which would allow for the best prediction of the dependent variable category from its group distribution. In other words, created groups should provide largest differences in the dependent variable distributions. The splitting criterion (explained variation) is calculated on the basis of frequency distributions of the dependent variable. Note that multiple dependent dichotomous variables are treated as categories of one categorical variable.

a)  Trace statistics. These are the statistics calculated on the whole sample (for g = 1), and on tentative splits for parent groups as well as for each group resulting from the best split.

i)  Sum (wt). Number of cases (Ng) if the weight variable is not specified, or weighted number of cases (Wg) in group g.

ii)  Variation. This is the entropy for group g, i.e. a measure of disorder in the distribution of the dependent variable.


Vg = -2 m
å
j = 1 
xjg· × ln xjg·
x·g·

where


xjg· = Ng
å
k = 1 
xjgk                x·g· = m
å
j = 1 
xjg·

and xjgk is the "frequency" (coded 0 or 1) of code j (or value of variable j) of case k in group g.

iii)  Var expl. Explained variation (EV). See 1.a.v above for general information, and 3.a.ii above for details on V (variation) used in chi-square analysis.

iv)  Explained variation. This is the percent of the total variation explained by the final groups. See 1.a.vi above and 3.b below.

b)  One-way analysis of final groups. These are the summary statistics for the final groups. See 1.b above for general information, and 3.a.ii and 3.a.iii above for details on V and EV measures used in chi-square analysis.

c)  Split summary table. The table provides variation of the dependent variable at each split as well as the variation explained by that split. See 3.a.ii and 3.a.iii above for formulas.

d)  Final group summary table. The table provides variation of the dependent variable for the final groups.

e)  Percent of explained variation. The percent of total variation explained by the best split for each group (see 1.e and 3.a.iii above).

f)  Percent distributions. A bivariate table showing percentage distributions of the dependent variable for all groups (Pjg).

g)  Residuals. The residuals are the differences between the observed value and the predicted value of dependent variable.

For analysis with one categorical dependent variable, residuals are calculated for each category of the variable. Thus, the number of residuals is equal to the number of categories.


ejk = xjk - ^
x
 

jik 
Observed values, xjk, are created as a series of "dummy variables", coded 0 or 1.

As predicted value for category j, a case is assigned the proportion of cases being in this category for the group to which the case belongs, i.e.


^
x
 

jik 
= Pji / 100

For analysis with several dichotomous dependent variables, residuals are calculated for each variable. Thus, the number of residuals is equal to the number of dependent variables.


ejk = x¢jk - ^
x
 

jik 

Observed values are calculated as follows:


x¢jk = xjk  / m
å
j = 1 
xjk

As predicted value for variable j, a case is assigned the proportion of cases having value 1 for this variable in the group to which the case belongs, i.e.


^
x
 

jik 
= Pji / 100

55.4  References

Morgan, J.N., Messenger, R.C., THAID A Sequential Analysis Program for the Analysis of Nominal Scale Dependent Variables, Institute for Social Research, The University of Michigan, Ann Arbor, 1973.

Sonquist, J.A., Baker, E.L., Morgan, J.N., Searching for Structure, Revised ed., Institute for Social Research, The University of Michigan, Ann Arbor, 1974.