Typology and Ascending Classification (statistics)

57     Typology and Ascending Classification

Notation


x
=
values of variables
k
=
subscript for case
v
=
subscript for variable
g, i, j
=
subscripts for groups
a
=
number of active variables (quantitative and dichotomized qualitative)
p
=
number of passive variables (quantitative and dichotomized qualitative)
t
=
number of initial groups
Ni
=
number of cases in group i
(weighted if the case weight is used)
Nj
=
number of cases in group j
(weighted if the case weight is used)
a
=
value of the variable weight
w
=
value of the case weight
W
=
total sum of case weights.

57.1  Types of Variables Used

The program accepts both quantitative and qualitative (categorical) variables, the latter being treated as quantitative after full dichotomization of their respective categories, i.e. after the construction of as many dichotomic (1/0) variables as the number of categories. The variables used by the program may be either active or passive. The active variables are those on the basis of which the typology is constructed. The passive variables do not participate in the construction of typology, but the program prints for them the main statistics within the groups of typology.

A set of active variables is denoted here Xa, and a set of passive variables Xp.

57.2  Case Profile

Profile of the case k is a vector Pk such as


Pk = (xk1, xk2, ..., xkv, ..., xka) = (xkv)

where all xv Î Xa.

If the active variables are requested to be standardized the kth case profile becomes


Pk = æ
ç
è
xkv
sv
ö
÷
ø

where sv is the standard deviation of the variable xv (see 7.b below).

57.3  Group Profile

Profile of the group i, called also barycenter of group, is a vector Pi such as


Pi = (
x
 

i1 
,
x
 

i2 
, ...,
x
 

iv 
, ...,
x
 

ia 
) = (
x
 

iv 
)

and in the case of standardized data it becomes


Pi = (
x
 

iv 
 /  sv )

where the numerator is the mean of the variable xv for the cases belonging to the group i and denominator is the overall standard deviation of this variable.

57.4  Distances Used

There are three basic types of distances used in the program, namely: city block distance, Euclidean distance and Chi-square distance of Benzécri. They may be used to calculate distances between two cases, between a case and a group of cases and between two groups of cases. Below, this distances are defined as distances between two groups of cases (between two group profiles), but the other distances can easily be obtained by adapting respective formulas.

a)  City block distance.


dij = d (Pi, Pj) =
a
å
v = 1 
av  |
x
 

iv 
-
x
 

jv 
|

a
å
v = 1 
av

b)  Euclidean distance.


dij = d (Pi, Pj) =   æ
 ú
 ú
 ú
Ö

a
å
v = 1 
av  (
x
 

iv 
-
x
 

jv 
)2

a
å
v = 1 
av
 

c)  Chi-square distance.


dij = d (Pi, Pj) =   æ
 ú
Ö

a
å
v = 1 
1
pv
æ
è
piv
pi
- pjv
pj
ö
ø
2
 
 

where


pv = t
å
g = 1 

x
 

gv 
  ,         pi = a
å
v = 1 

x
 

iv 
  ,         pj = a
å
v = 1 

x
 

jv 


piv =
x
 

iv 
 /   é
ë
t
å
g = 1 
a
å
v = 1 

x
 

gv 
ù
û
  ,         pjv =
x
 

jv 
 /   é
ë
t
å
g = 1 
a
å
v = 1 

x
 

gv 
ù
û

Moreover, the program provides a possibility of using "weighted" distance, called displacement, which is defined as follows:


Dij = D (Pi, Pj) = 2 Ni Nj
Ni + Nj
 dij

Note that displacement between two case profiles is equal to their distance since Ni = Nj = 1.

57.5  Building of an Initial Typology

a)  Selection of an initial configuration. Before starting the process of aggregating the cases, the program selects the initial configuration, i.e. t initial group profiles, in either one of the following ways:

When the construction starts from t case profiles, the program considers this set of t vectors as a set of t 'starting cases' and distributes the remaining cases according to their distance to each of the starting case.

Let denote the set of t starting cases by


Pstarting = ì
í
î
Pk1, Pk2, ..., Pkt ü
ý
þ

and the distance between groups and/or cases i and j by D (Pi, Pj).

Note that D (Pi, Pj) can be any distance defined in the section 4 above.

For each case i ¬ Î Pstarting the program calculates


b =
min
1 £ j £ t 
   é
ë
D (Pi, Pkj) ù
û


g = min
é
ë
D (Pk1,Pk2),D (Pk1,Pk3), ... ,D (Pkt-1,Pkt) ù
û

There are two possibilities:

At the end of this procedure, the initial configuration is a set of t profiles


Pinitial = ì
í
î
P1, P2, ..., Pj, ..., Pt ü
ý
þ

where Pj is a mean profile of all the cases belonging to the group j.

At this stage the program does not take into account weighting of cases, if any.

b)  Stabilization of the initial configuration. The initial configuration is stabilized by an iteration process. During each iteration, the program redistributes the cases among initial groups taking into account their distances to each group profile.

Here again there are two possibilities:

After this operation, the group Pj contains Nj - 1 cases and the group Pj¢ contains Nj¢ + 1 cases.

Note that if the cases are weighted, then


Nj = Nj - wi


Nj¢ = Nj¢ + wi


Pi = wi  Pi

where wi is the weight of the case i, and Nj and Nj¢ are the weighted number of cases in the groups Pj and Pj¢ respectively.

Stability of groups is measured by the percentage of cases that do not change groups between two subsequent iterations.

The procedure is repeated until the groups are stabilized or when the number of iterations fixed by the user is reached.

57.6  Characteristics of Distances by Groups

a)  N. The number of cases in each group of the initial typology.

b)  Mean. Mean distance for each group, i.e. the mean of distances from the group profile over all cases belonging to this group.

c)  SD. Standard deviation of distance for each group.

d)  Classification of distances. Distribution of cases, both in terms of frequency and percentages, across 15 continuous intervals, which are different for each group.

e)  Total count. Total number of cases participating in the building of the initial typology.

f)  Mean. Overall mean distance.

g)  SD. Overall standard deviation of distance.

h)  Classification of distances (same limits for each group). Same as 6.d above except that the 15 intervals are of the same range for all groups.

57.7  Summary Statistics for Quantitative Variables and for Qualitative Active Variables

a)  Mean. Mean of quantitative xv Î ( Xa ÈXp ). For qualitative variable categories, it is a proportion of cases in this category.



x
 

v 
= æ
è

å
k 
wk xkv ö
ø
 /  W

b)  S. D. Standard deviation.


sv =   æ
Ö

é
ë
  W  
å
k 
wk xkv2 - æ
è

å
k 
wk xkv ö
ø
2
 
ù
û
 /  W2
 

c)  Weight. The value of variable weight calculated for each variable as follows:


av = ì
ï
ï
ï
ï
ï
ï
í
ï
ï
ï
ï
ï
ï
î
0
for quantitative passive variables
1
for quantitative active variables
(   ______
Ö(c+1)/3
 
) / c
for categories of a qualitative active variable,
where c is the number of non-empty categories
of the variable under consideration
1
for categories of a qualitative active variable
if Chi-square distance is used.

57.8  Description of Resulting Typology

At the end of the initial typology construction, and also at the end of each step of ascending classification, all variables, i.e. active and passive are evaluated by the amount of explained variance. It is a measure of discriminant power of each quantitative variable and each category of qualitative variables. This is followed by an individual description of all groups of the typology.

a)  Proportion of cases. Percentage, multiplied by 1000, of cases belonging to each group of the typology.

b)  Explained variance.


EV (xv) = 1000  × 
tg
å
i = 1 
Ni æ
è

x
 

iv 
-
x
 

v 
ö
ø
2
 


å
k 
wk æ
è
xkv -
x
 

v 
ö
ø
2
 

where


tg
=
number of groups in the typology

x
 

iv 
=
mean of the variable v in group i

x
 

v 
=
grand mean of the variable v.

c)  Grand mean.

For quantitative variables, mean values as described under 7.a above.

For each category of qualitative variables, percentage of cases in this category.

d)  Statistics for each group of the typology.

For quantitative variables:
first line: mean values as described under 7.a above;
second line: standard deviations as described under 7.b above.

For each category of qualitative variables:
first line: column percentage of cases;
second line: row percentage of cases.

57.9  Summary Table of the Amount of Variance Explained by the Typology

Similarly to the description of the resulting typology, a summary table is printed at the end of the initial typology construction and at the end of each step of ascending classification.

a)  Variables explaining 80 % of the variance. List of the most discriminating variables, i.e. those variables which - taken altogether - are responsible for at least 80 % of the explained variance, together with the amount of variance explained by each of them individually (see 8.b above).

b)  Mean variance explained by active variables.



EV
 

active 
=
a
å
v = 1 
av  EV (xv)

a
å
v = 1 
av

c)  Mean variance explained by all variables.



EV
 

all 
=
a+p
å
v = 1 
av  EV (xv)

a+p
å
v = 1 
av

d)  Mean variance explained by the variables which explain 80 % of the total variance. After each regrouping, the program looks for variables which explain at least 80 % of the total variance (see 9.a above) and prints mean variance explained by those variables before and after regrouping, and the percentage of such variables.

57.10  Hierarchical Ascending Classification

After creation of the initial typology, the program performs a sequence of regroupings, reducing one by one the initial number of groups up to the number specified by the user. At each regrouping, the program selects two closest groups, i.e. two groups with the smallest distance or displacements (see section 4 above), and calculates the profile for this new group.

a)  Group i + j. Profile of the new group, printed for up to 15 active variables in descending order of their deviation (see 10.d below). Note that if there are less than 15 active variables, or less than 15 variables with valid cases in aggregated groups, the program completes the list using passive variables.

b)  Group i. Profile of the group i, printed for the same variables as above.

c)   Group j. Profile of the group j, printed for the same variables as above.

d)  Dev. Absolute value of the difference between profiles of groups i and j, printed for the same variables as above.


Dev (xv) = |
x
 

iv 
-
x
 

jv 
|

e)  Weighted deviation. Deviation weighted by the variable weight and the variable standard deviation, printed for the same variables as above.


WDev (xv) = Dev (xv)    av
sv

57.11  References

Aimetti, J.P., SYSTIT: Programme de classification automatique, GSIE-CFRO, Paris, 1978.

Diday, E., Optimisation en classification automatique, RAIRO, Vol. 3, 1972.

Hall & Ball, A clustering technique for summarizing multivariate data, Behavioral Sciences, Vol. 12, No 2, 1967.