statsitical modelling (cluster analysis) project index statistical modelling (diagnostics)

English or languish - Probing the ramifications
of Hong Kong's language policy

CLUSTER ANALYSIS
Research Design Issues
cluster analysis (key terms)

Data Transformation
cluster analysis (key features | key terms)

Experimental studies that have tested the ability of clustering techniques and proximity measures to generate known clusters have shown that the choice of proximity measure is often not important. These studies have also demonstrated a similar result for standardization procedures. To the extent that standardization and certain proximity measures reduce the significance of extreme outliers the solutions of clustering techniques that are sensitive to outliers can be improved. In general one should select proximity measures that are appropriate to the input data, but not be overly concerned about which proximity measure is employed when more than one measure is appropriate for the same data.

Careful selection of variables is likely the best control for interdependency among the charactertistic variables. In general variables that are to be accorded equal importance should be independent of one another. The Mahalanobis D2 proximity measure is the single best proximity measure for removing partial correlations among correlated input variables. Principal component analysis can also be applied to the characteristic variable set before the clustering routine is started. The resulting factor scores are then employed as the input data set for the computation of proximity measures other than the Mahalanobis D2 statistic. As the input variables will no longer reflect in a precise manner those of the original variable set, this latter technique may create problems with the identification of individual clusters after the clustering routine has been completed.

When the correlations of variables are few in number the exaggerated importance of highly correlated variables can be removed through appropriate weighting.

Solution Retrieval
cluster analysis (key features)

Experimental studies that have tested the ability of different clustering techniques and proximity measures to generate known clusters have shown that the choice of clustering routine plays a crucial role in the determination of the final solution.

Although there are a larger number of hierarchical methods available, research has demonstrated that iterative partitioning (nonhierarchical) methods can produce superior solutions. In order for these superior to be achieved, however, one must already be well-informed about the true nature of the clusters. This obvious paradox is overcome through two-stage clustering. Well-suited for the first stage are average linkage and Ward's minimum variance routines. Both of these have been shown to be generally superior over other hierarchical techniques. From these techniques one can determine both a candidate number of clusters and a starting point for the interation routine of the second stage. In addition one can examine the order in which observations are clustered and the distance between individual observations. This information is useful in determining the presence of distortive outliers and their eventual elimination.

Caution: Not all manufactures provide both hierarchical and non-hierarchical, iterative clustering techniques in the same software package.

Pivot and Defining Variables
cluster analysis (key terms)

If two-stage clustering is not possible one can take advantage of a weakness common to most multivariate-analytical techniques - variable interdependence. A careful examination of the characteristic correlation matirx is likely to reveal variables that correlate high with some variables but only weakly with others. These are called pivot variables. Variables that correlate high with some variables but only weakly with other variables that in turn demonstrate high correlation with still other variables are called defining variables. This is because their shared and non-shared variance form an important basis for cluster formation. By identifying pivotal and defining variables in the correlation matrix one can get a good general idea of the number of clusters that are likely to result before an iterative routine is run. Thus, pivotal and defining variable identification is an important step in determining the initial number of clusters for clustering techniques that require a fixed number.

Variable Selection
cluster analysis (key features)

Empirical findings have demonstrated that variable selection is crucial in the application of clustering procedures, as even one or two irrelevant variables can greatly distort the final solution. Thus, before cluster analysis is performed one must carefully consider that which one hopes to find. Unlike factor analysis which is primarily a data reduction and summarization technique, classification analysis can be more appropriately described as an analytical test or summary procedure. One cannot expect one's laundry machine to sort one's laundry. Much of the sorting occurrs before the laundering takes place. In other words, one should choose his charactertistic variable set according to what one expects to find.

As multivariate normal distribution is a standard assumption for many cluster-analytical procedures, one should test the members of the attribute variable set for signs of skewness and kurtosis before including them in the characteristic variable set. Individual variables that behave non-normal ly are likely to distort the final solution.

In general one should avoid mixing variables with different metric properties. Moreover, one should be judicious about standardizing entire variable sets, as there can be a significant loss of information.

Statistical Significance and Validation
cluster analysis (key features) | statistical modelling (useful statistics)

Statistical Significance (C-Statistic )- Even after careful analysis of the data set and completion of the clustering routine, one cannot be sure that one has succeeded in identifying meaningful clusters. Since all clustering techniques result in a clustered solution, one must test whether one's solution is statistically meaningful. A somewhat preferred statistic to test for significance is given by the following formula

C = log(max (|T| / |W|)

where |T| = determinant of the total variance-covariance matrix
where |W| = determinant of the pooled within-group variance-covariance matrix

As a number of iterative partitioning methods seek to maximize |T| with respect to |W| this statistic is especially useful. Moreover, the statistic errs on the conservative side for those iterative procedures that do not seek to maximize this relationship. See reference to Arnold (1979) and Friedman and Rubin (1967) in Punj and Stewart (1983)

Validation
cluster analysis (key features)

Demonstrating statistical significance, though an important first step, is insufficient to insure reliability of the solution. Cross-validation and external validation are two commonly accepted methods to insure reliability.

This approach is thought to provide an objective measure of reliability. For further reference see McIntyre and Blashfield (1980) in Punj and Stewart (1983)

statsitical modelling (cluster analysis) top statistical modelling (diagnostics)