statsitical modelling (cluster analysis) project index statistical modelling (diagnostics)

English or languish - Probing the ramifications
of Hong Kong's language policy

CLUSTER ANALYSIS
Identification
cluster analysis (key features | key terms | two-stage analysis)

After clusters have been formed and their number has been determined, each cluster must be appropriately identified. As the boundaries of a cluster are often unclear, and cluster membership usually diverse, the identification of clusters can be difficult. Notwithstanding there do exist some general guidelines for naming (profiling) clusters. The best way to avoid problems of identification is to be careful about the variables one employs to generate them. Notwithstanding, the problem of identification never goes away. It is for this reason that several guidelines are recommended.

Cluster centroids - Although a mathematically generated description of a cluster's center, a centroid can be utilized as a representative profile of a cluster's members. Obviously no two observations (members) of a cluster are ever likely to be identifcal. Moreover, it is unlikely, depending on the clustering procedure employed, that any one member of a cluster occupies its center.

This profiling techinque is typically used with clustering techniques that minimize within-group variance as a means to generate clusters. See hierarchical routines.

Core observations - As the centroid of a cluster is a mathematical construct, often that observation which most resembles the centroid is selected to represent the cluster's profile. This profiling method is especially useful for those clustering techniques that utilize individual observations as the starting point for generating clusters. See threshold routines.

Variable analysis - The single most effective way to interpret a cluster is to understand the degree to which each variable of the characteristic set is represented in the cluster. Although centroids and core observations are useful, they do not provide us with a very complete picture. In the following chart the value of variable X1 for each observation of the sample population is plotted against the mean of X1 and the cluster in which the observation appears in the final solution. This technique is obviously less useful when the clusters are overlapping.

Observation scatter gram for variable X1 across clusters.

17
16
15
     8
18
11, 15
6
 
14
13
12

14
2, 22
   
23
9
7

 
11
10
9
24
13
16

 

19, 20

   
8
7
6


17 
21
12

3
 


 5
5
4
3
  1
10
 
4
 
2
1
0
       
 Variable X1 Cluster 1 Cluster 2 Cluster 3 Cluster 4
Mean = 10.3
Standard Deviation = 4.1

If there were five variables, there would be five such graphs each with its own mean and distribution of observations across each of the four clusters. From the five graphs a summary chart would be produced.

Summary Chart
 Variable
Name
Cluster 1 Cluster 2 Cluster 3 Cluster 4
X1   mid  mid/low high/mid  mid 
X2   low  mid/low high/mid  mid
X3   low  high low  mid
X4   low  high low  mid/low
X5   low/mid  high mid/low  mid/low

The nonmetric values high, high/mid, mid, mid/low, and low summarize the relative position of the observations of a particular cluster according to where they fall relative to the attribute's mean value for the entire sample population. These values are assigned to each variable and cluster combination. From this table one can see at a glance the relative importance of each attribute in each cluster and how each attribute is distributed across all clusters. Under normal circumstances a more descriptive name for each variable would also appear in an effort to assist the researcher in understanding the cluster's overall make-up. This exercise is very similar to that used when naming the individual factors of a factor-analytical solution.

statsitical modelling (cluster analysis) top statistical modelling (diagnostics)