Fig. 1 Determining the dimensionality of a clustering problem. (A and B) Representation of the mRNA clustering problem consisting of >14,000 mRNAs measured across 89 cell lines. Data are from Lu et al. (6). When the mRNAs are clustered, the mRNAs are the objects and each cell line represents a feature, resulting in an 89-dimensional problem (A). When attempting to classify normal and tumor cell lines using gene expression, the cells lines are the objects and each mRNA is a feature, resulting in a clustering problem with thousands of dimensions (B). (C) Effect of dimensionality on sparsity. (D) Effect of dimensionality on coverage of the data based on SD from the mean.
Fig. 2 Dimensionality reduction methods and effects. Comparison of PCA and subspace clustering. (A) Three clusters are plotted in two dimensions. The dashed red line indicates the one-dimensional line upon which the original two-dimensional data is projected as determined by PCA. (B) The clusters are plotted in the new single dimension after reducing the dimensionality from two to one. (C) Three alternate one-dimensional projections (dashed red lines) onto which the data can be projected, each demonstrating better separability for some clusters than the projection identified using PCA. (D to F) Comparison of the original clustering results of 89 cell lines in ~14,000-dimensional mRNA data (D) to clustering results after PCA (E) and after subspace clustering (F). Blue bars, gastrointestinal cell lines; yellow bars, nongastrointestinal cell lines.
Fig. 3 Effect of transformations and distance metric on clustering results. (A) Demonstration of how transformations affect the relationship of data points in space. A toy data set (reference set, https://github.com/knaegle/clusteringReview) was clustered into four clusters with agglomerative clustering, average linkage, and Euclidean distance. The four reference clusters without transformation (upper panel) and after log2 transformation (lower panel). (B) Transformations and distance metrics change clustering results when compared to the reference clustering result. With no transformation (upper panels), Euclidean and cosine distance do not change cluster identity, but with Manhattan distance, a new cluster A′ is added, and cluster C is merged into cluster B. With the log2 transformation (lower panels), the Euclidean and Manhattan metrics caused cluster C′ to emerge and cluster D to be lost. (C) Dendrogram from the microRNA (miRNA) clustering experiment result from 89 cell lines and 217 microRNAs (6). Gastrointestinal-derived cell lines (blue bars) predominantly cluster together in the full-dimensional space. Note: The data were log2-transformed as part of the preclustering analysis. (D) Same microRNA data as in (C) but without log2 transformation.
Fig. 4 The effect of algorithm on clustering results. Four toy data sets (https://github.com/knaegle/clusteringReview) demonstrate the effects of different types of clustering algorithms on various known structures in two-dimensional data.
Fig. 5 Ensemble clustering overview. Finishing techniques were applied to random toy data (see file S1 for analysis details). (A) Set of clustering results obtained using the k-means algorithm with various values of k (a k-sweep). (B) Hierarchically clustered (Ward linkage) co-occurrence matrix for the ensemble of results in (A). The heatmap represents the percentage of times any pair of data points coclusters across the ensemble. (C) A majority vote analysis was applied (left panel) using a threshold of 50% on the co-occurrence matrix in (B). Six clusters (see dendrogram color groupings) result from the majority vote (right panel). (D) Application of fuzzy clustering to the ensemble. The left panel shows the details of the co-occurrence matrix for the blue, gray, and orange clusters, and the right shows the clustering assignments. The gray cluster provides an example of partially fuzzy clustering because it shares membership with the orange and dark blue clusters.
Fig. 6 Ensemble clustering on phosphoproteomic data. (A) Single clustering solution showing known interactors with EGFR (orange bars) and PDLIM1 (blue bar) coclustering in the phosphoproteomic data (blue heatmap). (B) Co-occurrence matrix heatmap demonstrating clustering of interactors with EGFR. The known interactors with EGFR (orange bars) and PDLIM1 (blue bar) are found in a single cluster (upper left). (C) Subset of clustering results across multiple distance metrics and clustering algorithms. Under the dendrogram, known interactors with EGFR are marked with orange bars and PDLIM1 is marked with a blue bar.
- Table 1 Validation metrics.
Validation metrics used for testing the quality of a clustering result and the measures on which they are based.
Metric References Measures RMSSTD (34) Compactness RS (35) Modified Hubert’s Γ statistic (36) K-nearest neighbor consistency (37) Connectedness Connectivity (38) Determinant ratio index (62) SD validity index (35) Separation Pseudo-F statistic
(Calinski-Harabasz)(39) Combination of
compactness
and separationDunn index (40) Silhouette width (41) Davies-Bouldin index (41) Gap statistic (43) Rand index (44) Similarity between solutions - Table 2 Validation metrics.
Results of validation metrics [RMSSTD (34), RS (35), determinant ratio index (62), and SD validity index (35)] applied to the data from Fig. 4. The validation metrics indicate the type of best score (max or min). The bold values represent the best score for each structure and validation metric.
Compactness Connectedness Separation Validation metric RMMSTD RS Determinant ratio index SD validity index Algorithm Structure Min Max Min Min K-means No structure 0.38 0.23 4.04 6.67 Ward No structure 0.40 0.18 2.58 7.56 DBSCAN No structure 0.44 0.00 1.00 N/A Mixture models No structure 0.38 0.23 4.09 6.67 K-means Spherical clusters 0.78 0.81 249.08 0.32 Ward Spherical clusters 1.18 0.56 26.43 0.18 DBSCAN Spherical clusters 0.58 0.86 1683.53 2.52 Mixture models Spherical clusters 0.78 0.81 249.08 0.32 K-means Long parallel clusters 0.89 0.27 2.77 0.79 Ward Long parallel clusters 0.93 0.21 2.10 0.75 DBSCAN Long parallel clusters 0.84 0.21 22.36 4.32 Mixture models Long parallel clusters 0.89 0.27 2.79 0.79 K-means Half-moons 0.55 0.35 4.23 1.77 Ward Half-moons 0.57 0.30 3.77 1.98 DBSCAN Half-moons 0.60 0.21 3.03 2.61 Mixture models Half-moons 0.56 0.33 5.21 1.93 K-means Nested circles 0.52 0.17 2.71 4.04 Ward Nested circles 0.53 0.13 1.97 3.30 DBSCAN Nested circles 0.57 0.00 1.00 4059.71 Mixture models Nested circles 0.52 0.17 2.71 3.99 - Table 3 Ensemble perturbations.
Major perturbations applied to the data or to the clustering parameters in ensemble clustering and their intended purpose.
Perturbation Reason behind perturbation References K Identify the optimum number of clusters (46, 60, 63) Noise Identify relationships within the data that are not affected by biological or experimental noise (45, 47, 64) Starting point (nondeterministic
algorithms)Identify those partitions that are independent of starting position or identify set of minima (23, 48) Projections into lower dimensions Increase robustness to clustering noise resulting from high dimensionality (30, 63, 65–67) Subsampling Identify subsets of the data that cluster consistently (68–70) Parameters of clustering Identify unpredicted biological information (32) - Table 4 Summary of in-depth review articles.
A collection of reviews for more in-depth coverage of each topic.
In-depth review area References Specific clustering algorithms, their trade-offs, and how they function (2, 3, 71–73) Analysis of the effects of different distance metrics on clustering gene expression data (74, 75) Practical and mathematical implications of high dimensionality on clustering (15, 16) A thorough review of validation metrics (35, 38, 62, 76) The most common multiple hypothesis correction procedures including Bonferroni correction and FDR correction. (55, 58) The effects of specific distances on data clustering for lower-dimensional spaces (77) The effects of specific distances on data clustering for high-dimensional spaces (15, 78) Ensembles of some algorithms incompatible with high-dimensional data can be useful on higher-dimensional data, even when a single clustering solution is uninformative. (23, 24) A more in-depth analysis of ensembles, including evaluating the results of multiple clustering runs and determining consensus (60, 63)
Supplementary Materials
www.sciencesignaling.org/cgi/content/full/9/432/re6/DC1
File S1. Output of the iPython Notebook that generates all examples in this review.
Additional Files
- Supplementary Materials for:
Avoiding common pitfalls when clustering biological data
Tom Ronan, Zhijie Qi, Kristen M. Naegle*
*Corresponding author. Email: knaegle{at}wustl.edu
This PDF file includes:
- File S1. Output of the iPython Notebook that generates all examples in this review.
Technical Details
Format: Adobe Acrobat PDF
Size: 756 KB
Citation: T. Ronan, Z. Qi, K. M. Naegle, Avoiding common pitfalls when clustering biological data. Sci. Signal. 9, re6 (2016).