Compared to the well - known kmeans algorithm, .Pam has the following features:
In many clustering problems, one is interested in the characterization of the clusters by means of typical objects, which represent the various structural features of objects under investigation. The algorithm Pam first computes k representative objects, called medoids. A medoid can be defined as that object of a cluster, whose average dissimilarity to all the objects in the cluster is minimal. In the classification literature, such representative objects are called centrotypes. After finding the set of medoids, each object of the data set is assigned to the nearest medoid. That is, object i is put into cluster vi, when medoid mvi is nearer than any other medoid mw.
d(i, mvi ) £ d (i,m w ) for all w = 1,..., k.
The k representative objects should minimize the objective function, which is the sum of the dissimilarities of all objects to their nearest medoid:
Objective function = S d(i, mvI)
The algorithm proceeds in two steps:
Isolated clusters
The program considers two types of isolated clusters. L-clusters and L* clusters. Cluster C is an L-cluster, if for each object i belonging to C:
max dij < min dih, j
C h
C
Cluster C is an L* - cluster, if
max dij < min dlh, ij
C, l
C,h
C
It is clear that L* - clusters are also L-clusters. It should be noted that the property of being isolated depends on the internal structure of a cluster as well as on its position relative to other clusters.
Diameter of a cluster
The diameter of Cluster C is defined as the largest dissimilarity between objects belonging to the Cluster C.
DiameterC =
max dij, i,j
C
Separation of a cluster
The separation of Cluster C is defined as the smallest dissimilarity between two objects; one of which belong to Cluster C and the other does not.
SeparationC =
min dlh, l
C,
h
C
Average distance to a medoid
If j is the medoid of Cluster C, the average distance of all objects of C to j is calculated as follows:
Average distancej
= ![]()
where N is the number of object minus other than j.
Maximum distance to medoid
If object j is the medoid of cluster C, the maximum distance of objects of C to j is calculated as follows:
Maximum distancej
= max dij , i
C
Graphical display
Two of the most difficult tasks in cluster analysis are: How to decide the appropriate number of clusters and how to distinguish a bad cluster from a good one. The module Clusfind computes a set of values called silhouettes that provide key information about these tasks. First, we will explain how these are calculated and then we will show how they are used
Each cluster is represented by one silhouette, showing which objects lie within the cluster and which objects merely hold an intermediate position. The entire clustering is displayed by plotting all silhouettes into a single diagram, from which the quality of the clusters can be compared.
Silhouettes are constructed in the following way. Consider any object i of the data set, and let A denote the cluster to which it is assigned, and then calculate
a ( i ) = average dissimilarity of i to all other objects of A
Now consider any cluster C different from A and define
d ( i, C ) = average dissimilarity of i to all objects of C
Compute d ( i, C ) for
all clusters C
A, and then select the smallest
of those
b = min d ( i,C ) , C
A
Let B denote the cluster which attains the minimum i.e., d ( i,B ) = b ( i ) is called the neighbour
of object i. The value
s ( i ) can now be defined as:
s(i) = ![]()
It can be easily seen that s ( i ) lies between -1 and +1. The value of s ( i ) may be interpreted as follows:
|
s ( i ) = 1 |
‘within’ dissimilarity a ( i
) is much smaller than the smallest ‘between’ dissimilarity.
In other words, object i has been assigned
to an appropriate cluster. The second best cluster B is not nearly as
close as the actual cluster A. |
|
s ( i ) = 0 |
a(i) and b(i) are
approximately equal. Hence, it is not clear whether i
should be assigned to A or B. It can be considered as an ‘intermediate’
case. |
|
s ( i ) = - 1 |
Object i is badly classified.
When s is close to negative one, the object is
poorly classified. Its dissimilarity with other objects in its cluster is much
greater than its dissimilarity with objects in the nearest cluster. Why
isn’t it in the neighboring cluster? |
Hence, the silhouette value summarizes how appropriate each object’s cluster is.
The silhouette of a cluster is a plot of the s ( i ) ranked in decreasing order of all the objects i. The plot is a horizontal line, whose length is proportioned to s ( i ). The silhouette shows which objects lie well within the cluster and which ones are merely somewhere in between clusters. A wide silhouette indicates large s ( i ) values and hence a pronounced cluster. The height of a cluster is simply equal to the number of objects in the cluster.
The entire silhouette plot shows the silhouettes of all clusters next to each other, so that the quality of clusters can be compared. The overall average silhouette width of the silhouette plot is the average of s( i ) over all object in the data set.
The silhouette plot is quite useful for deciding the number of clusters. One can run Pam several times, each times for different values of k and then compare the resulting silhouette plots. The average silhouette width can be used to select the ‘best’ number of clusters, by choosing that k which yields the highest silhouette width.
SC = max
(k)
where the maximum is taken over all k, for which the silhouette can be constructed, which means k = 2,3,-.., n. This coefficient is a dimensionless measure of the extent of clustering structuring that has been discovered by the clustering algorithm. The SC can be interpreted as follows:
|
Range of SC |
Interpretation |
|
0.71-1.0 |
A strong structure has been found |
|
0.51-0.70 |
A reasonable structure has been found |
|
0.26-0.50 |
The structure is weak and could be artificial. Try additional methods of data analysis. |
|
£ 0.25 |
No substantial structure has been found |