7.1.5 Divisive Analysis (Diana)

DIANA is a hierarchical clustering technique, but its main difference with the agglomerative method (Agnes) is that it constructs the hierarchy in the inverse order.

Initially (Step 0), there is one large cluster consisting of all n objects. At each subsequent step, the largest available cluster is split into two clusters until finally all clusters, comprise of single objects. Thus, the hierarchy is built in n-1 steps.

In the first step of an agglomerative method, all possible fusions of two objects are considered leading to combinations. In the divisive method based on the same principle, there are possibilities to split the data into two clusters. This number is considerably larger than that in the case of an agglomerative method.

To avoid considering all possibilities, the algorithm proceeds as follows.

  1. Find the object, which has the highest average dissimilarity to all other objects. This object initiates a new cluster– a sort of a splinter group.
  2. For each object i outside the splinter group compute
  3. Di = [average d(i,j) j Rsplinter group ] - [average d(i,j) j Rsplinter group]
  4. Find an object h for which the difference Dh is the largest. If Dh is positive, then h is, on the average close to the splinter group.
  5. Repeat Steps 2 and 3 until all differences Dh are negative. The data set is then split into two clusters.
  6. Select the cluster with the largest diameter. The diameter of a cluster is the largest dissimilarity between any two of its objects. Then divide this cluster, following steps 1-4.
  7. Repeat Step 5 until all clusters contain only a single object.

Divisive Coefficient (DC)

For each object i, let d ( i ) denote the diameter of the last cluster to which it belongs (before being split off as a single object), divided by the diameter of the whole data set.

The divisive coefficient (DC), given by

indicates the strength of the clustering structure found by the algorithm.

Graphical display

The hierarchy obtained from Diana can be represented graphically by the dissimilarity banner.

Dissimilarity banner consists of lines with stars and stripes, which repeat the identifiers of the objects. The banner is read from left to right, but the fixed scales above and below the banner range from 1.00 (corresponding to the diameter of the entire data set) to 0.00 (corresponding to the diameter of the singletons)

Each line with stars ends at the diameter at which the cluster is split. The actual diameter of the data set (corresponding to 1.00 in the banner) is displayed just below the banner.

The divisive coefficient defined above can be seen as the average width of the divisive banner.

where li is the length of the line containing the identifier of object i. Thus it can be seen as the counterpart of the agglomerative coefficient (AC):

DC = 0 No cluster structure

DC = 1 Clear cluster structure