#### 7.1.1 Partitioning Around Medoids (Pam)

Compared to the well - known kmeans algorithm, .Pam has the following features:

• It operates on the dissimilarity matrix of the given data set or when it is presented with an n ´ p data matrix, the algorithm first computes a dissimilarity matrix.
• It is more robust, because it minimizes a sum of dissimilarities instead of a sum of squared Euclidean distances.
• It provides a novel graphical display, the silhouette plot, which allows the user to select the optimal number of clusters.

In many clustering problems, one is interested in the characterization of the clusters by means of typical objects, which represent the various structural features of objects under investigation. The algorithm Pam first computes k representative objects, called medoids. A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. In the classification literature, such representative objects are called centrotypes. After finding the set of medoids, each object of the data set is assigned to the nearest medoid. That is, object i is put into cluster vi, when medoid mvi is nearer than any other medoid mw.

d(i, mvi ) £ d (i,m w ) for all w = 1,..., k.

The k representative objects should minimize the objective function, which is the sum of the dissimilarities of all objects to their nearest medoid:

Objective function = S d(i, mvI)

The algorithm proceeds in two steps:

• BUILD-step: This step sequentially selects k "centrally located" objects, to be used as initial medoids
• SWAP-step: If the objective function can be reduced by interchanging (swapping) a selected object with an unselected object, then the swap is carried out. This is continued till the objective function can no longer be decreased.

Isolated clusters

The program considers two types of isolated clusters. L-clusters and L* clusters. Cluster C is an L-cluster, if for each object i belonging to C:

max dij < min dih, jC    hC

Cluster C is an L* - cluster, if

max dij < min dlh, ij C, lC,hC

It is clear that L* - clusters are also L-clusters. It should be noted that the property of being isolated depends on the internal structure of a cluster as well as on its position relative to other clusters.

Diameter of a cluster

The diameter of Cluster C is defined as the largest dissimilarity between objects belonging to the Cluster C.

DiameterC = max dij, i,jC

Separation of a cluster

The separation of Cluster C is defined as the smallest dissimilarity between two objects; one of which belong to Cluster C and the other does not.

SeparationC = min dlh, lC, hC

Average distance to a medoid

If j is the medoid of Cluster C, the average distance of all objects of C to j is calculated as follows:

Average distancej =

where N is the number of object minus other than j.

Maximum distance to medoid

If object j is the medoid of cluster C, the maximum distance of objects of C to j is calculated as follows:

Maximum distancej = max dij , iC

Graphical display

Two of the most difficult tasks in cluster analysis are: How to decide the appropriate number of clusters and how to distinguish a bad cluster from a good one. The module Clusfind computes a set of values called silhouettes that provide key information about these tasks. First, we will explain how these are calculated and then we will show how they are used

Each cluster is represented by one silhouette, showing which objects lie within the cluster and which objects merely hold an intermediate position. The entire clustering is displayed by plotting all silhouettes into a single diagram, from which the quality of the clusters can be compared.

Silhouettes are constructed in the following way. Consider any object i of the data set, and let A denote the cluster to which it is assigned, and then calculate

a ( i ) = average dissimilarity of i to all other objects of A

Now consider any cluster C different from A and define

d ( i, C ) = average dissimilarity of i to all objects of C

Compute d ( i, C ) for all clusters C A, and then select the smallest of those

b = min d ( i,C ) ,         CA

Let B denote the cluster which attains the minimum i.e., d ( i,B ) = b ( i ) is called the neighbour of object i. The value
s ( i ) can now be defined as:

s(i) =

It can be easily seen that s ( i ) lies between -1 and +1. The value of s ( i ) may be interpreted as follows:

 s ( i ) = 1 ‘within’ dissimilarity a ( i ) is much smaller than the smallest ‘between’ dissimilarity. In other words, object i has been assigned to an appropriate cluster. The second best cluster B is not nearly as close as the actual cluster A. s ( i ) = 0 a(i) and b(i) are approximately equal. Hence, it is not clear whether i should be assigned to A or B. It can be considered as an ‘intermediate’ case. s ( i ) = - 1 Object i is badly classified. When s is close to negative one, the object is poorly classified. Its dissimilarity with other objects in its cluster is much greater than its dissimilarity with objects in the nearest cluster. Why isn’t it in the neighboring cluster?

Hence, the silhouette value summarizes how appropriate each object’s cluster is.

The silhouette of a cluster is a plot of the s ( i ) ranked in decreasing order of all the objects i. The plot is a horizontal line, whose length is proportioned to s ( i ). The silhouette shows which objects lie well within the cluster and which ones are merely somewhere in between clusters. A wide silhouette indicates large s ( i ) values and hence a pronounced cluster. The height of a cluster is simply equal to the number of objects in the cluster.

The entire silhouette plot shows the silhouettes of all clusters next to each other, so that the quality of clusters can be compared. The overall average silhouette width of the silhouette plot is the average of s( i ) over all object in the data set.

The silhouette plot is quite useful for deciding the number of clusters. One can run Pam several times, each times for different values of k and then compare the resulting silhouette plots. The average silhouette width can be used to select the ‘best’ number of clusters, by choosing that k which yields the highest silhouette width.

SC = max (k)

where the maximum is taken over all k, for which the silhouette can be constructed, which means k = 2,3,-.., n. This coefficient is a dimensionless measure of the extent of clustering structuring that has been discovered by the clustering algorithm. The SC can be interpreted as follows:

 Range of SC Interpretation 0.71-1.0 A strong structure has been found 0.51-0.70 A reasonable structure has been found 0.26-0.50 The structure is weak and could be artificial. Try additional methods of data analysis. £ 0.25 No substantial structure has been found