#### 7.1.2 Clustering Large Applications (Clara__)__

Compared to Pam, Clara can
deal with much larger data sets. It also tries to find *k* representative
objects that are centrally located in the cluster, they define. Internally,
this is achieved by considering data subsets of fixed size, so that the overall
computation time and storage requirements become linear in the total number of
objects rather than quadratic.

In Pam, the collection of all
pair-wise distances between objects is stored in the central memory, thereby consuming
O(*n*^{2}) memory space. Therefore Pam cannot be used for large values of *n*.
To avoid this problem Clara does
not compute the entire dissimilarity matrix at a time. Therefore, Clara accepts only the actual
measurements (i.e,. *n **´ p* data matrix).

The clustering of objects with Clara
is carried out in two steps. First, a sample is drawn from the set of objects
and divided into *k* clusters, using the same algorithm as in Pam. The
algorithm consists of two parts - BUlLD and SWAP. In
BUILD, successive medoids are selected with the
purpose of obtaining the smallest possible average distance between the objects
of the sample and their most similar representative objects. In SWAP, an attempt
is made to decrease the average distance by replacing representative objects.
Then each object not belonging to the sample is assigned to the nearest medoid. This yields a clustering of all objects.

The quality of clustering is defined as the average distance between each
object and its medoid. This procedure is repeated
five times and clustering with the lowest average distance is retained for
further analysis.

The final average distance, the average and the maximum distance to each medoid are calculated in the same way as in Pam for all
objects. Also, the ratio of the maximum distance of the medoid
to the minimum distance of the medoid to another medoid.
This ratio gives information on the tightness of a cluster. A small value (0.20) indicates
a very tight cluster, while a value >
1 indicates a weak cluster. Finally, silhouettes of clusters are plotted, but
only for the best subset, since the entire silhouette plot will be too large to
print.