Avoiding common pitfalls when clustering biological data

See allHide authors and affiliations

Sci. Signal.  14 Jun 2016:
Vol. 9, Issue 432, pp. re6
DOI: 10.1126/scisignal.aad1932
  • Fig. 1 Determining the dimensionality of a clustering problem.

    (A and B) Representation of the mRNA clustering problem consisting of >14,000 mRNAs measured across 89 cell lines. Data are from Lu et al. (6). When the mRNAs are clustered, the mRNAs are the objects and each cell line represents a feature, resulting in an 89-dimensional problem (A). When attempting to classify normal and tumor cell lines using gene expression, the cells lines are the objects and each mRNA is a feature, resulting in a clustering problem with thousands of dimensions (B). (C) Effect of dimensionality on sparsity. (D) Effect of dimensionality on coverage of the data based on SD from the mean.

  • Fig. 2 Dimensionality reduction methods and effects.

    Comparison of PCA and subspace clustering. (A) Three clusters are plotted in two dimensions. The dashed red line indicates the one-dimensional line upon which the original two-dimensional data is projected as determined by PCA. (B) The clusters are plotted in the new single dimension after reducing the dimensionality from two to one. (C) Three alternate one-dimensional projections (dashed red lines) onto which the data can be projected, each demonstrating better separability for some clusters than the projection identified using PCA. (D to F) Comparison of the original clustering results of 89 cell lines in ~14,000-dimensional mRNA data (D) to clustering results after PCA (E) and after subspace clustering (F). Blue bars, gastrointestinal cell lines; yellow bars, nongastrointestinal cell lines.

  • Fig. 3 Effect of transformations and distance metric on clustering results.

    (A) Demonstration of how transformations affect the relationship of data points in space. A toy data set (reference set, was clustered into four clusters with agglomerative clustering, average linkage, and Euclidean distance. The four reference clusters without transformation (upper panel) and after log2 transformation (lower panel). (B) Transformations and distance metrics change clustering results when compared to the reference clustering result. With no transformation (upper panels), Euclidean and cosine distance do not change cluster identity, but with Manhattan distance, a new cluster A′ is added, and cluster C is merged into cluster B. With the log2 transformation (lower panels), the Euclidean and Manhattan metrics caused cluster C′ to emerge and cluster D to be lost. (C) Dendrogram from the microRNA (miRNA) clustering experiment result from 89 cell lines and 217 microRNAs (6). Gastrointestinal-derived cell lines (blue bars) predominantly cluster together in the full-dimensional space. Note: The data were log2-transformed as part of the preclustering analysis. (D) Same microRNA data as in (C) but without log2 transformation.

  • Fig. 4 The effect of algorithm on clustering results.

    Four toy data sets ( demonstrate the effects of different types of clustering algorithms on various known structures in two-dimensional data.

  • Fig. 5 Ensemble clustering overview.

    Finishing techniques were applied to random toy data (see file S1 for analysis details). (A) Set of clustering results obtained using the k-means algorithm with various values of k (a k-sweep). (B) Hierarchically clustered (Ward linkage) co-occurrence matrix for the ensemble of results in (A). The heatmap represents the percentage of times any pair of data points coclusters across the ensemble. (C) A majority vote analysis was applied (left panel) using a threshold of 50% on the co-occurrence matrix in (B). Six clusters (see dendrogram color groupings) result from the majority vote (right panel). (D) Application of fuzzy clustering to the ensemble. The left panel shows the details of the co-occurrence matrix for the blue, gray, and orange clusters, and the right shows the clustering assignments. The gray cluster provides an example of partially fuzzy clustering because it shares membership with the orange and dark blue clusters.

  • Fig. 6 Ensemble clustering on phosphoproteomic data.

    (A) Single clustering solution showing known interactors with EGFR (orange bars) and PDLIM1 (blue bar) coclustering in the phosphoproteomic data (blue heatmap). (B) Co-occurrence matrix heatmap demonstrating clustering of interactors with EGFR. The known interactors with EGFR (orange bars) and PDLIM1 (blue bar) are found in a single cluster (upper left). (C) Subset of clustering results across multiple distance metrics and clustering algorithms. Under the dendrogram, known interactors with EGFR are marked with orange bars and PDLIM1 is marked with a blue bar.

  • Table 1 Validation metrics.

    Validation metrics used for testing the quality of a clustering result and the measures on which they are based.

    Modified Hubert’s Γ statistic(36)
    K-nearest neighbor consistency(37)Connectedness
    Determinant ratio index(62)
    SD validity index(35)Separation
    Pseudo-F statistic
    (39)Combination of
    and separation
    Dunn index(40)
    Silhouette width(41)
    Davies-Bouldin index(41)
    Gap statistic(43)
    Rand index(44)Similarity between solutions
  • Table 2 Validation metrics.

    Results of validation metrics [RMSSTD (34), RS (35), determinant ratio index (62), and SD validity index (35)] applied to the data from Fig. 4. The validation metrics indicate the type of best score (max or min). The bold values represent the best score for each structure and validation metric.

    Validation metricRMMSTDRSDeterminant ratio indexSD validity index
    K-meansNo structure0.380.234.046.67
    WardNo structure0.400.182.587.56
    DBSCANNo structure0.440.001.00N/A
    Mixture modelsNo structure0.380.234.096.67
    K-meansSpherical clusters0.780.81249.080.32
    WardSpherical clusters1.180.5626.430.18
    DBSCANSpherical clusters0.580.861683.532.52
    Mixture modelsSpherical clusters0.780.81249.080.32
    K-meansLong parallel clusters0.890.272.770.79
    WardLong parallel clusters0.930.212.100.75
    DBSCANLong parallel clusters0.840.2122.364.32
    Mixture modelsLong parallel clusters0.890.272.790.79
    Mixture modelsHalf-moons0.560.335.211.93
    K-meansNested circles0.520.172.714.04
    WardNested circles0.530.131.973.30
    DBSCANNested circles0.570.001.004059.71
    Mixture modelsNested circles0.520.172.713.99
  • Table 3 Ensemble perturbations.

    Major perturbations applied to the data or to the clustering parameters in ensemble clustering and their intended purpose.

    PerturbationReason behind perturbationReferences
    KIdentify the optimum number of clusters(46, 60, 63)
    NoiseIdentify relationships within the data that are not affected by biological or experimental noise(45, 47, 64)
    Starting point (nondeterministic
    Identify those partitions that are independent of starting position or identify set of minima(23, 48)
    Projections into lower dimensionsIncrease robustness to clustering noise resulting from high dimensionality(30, 63, 6567)
    SubsamplingIdentify subsets of the data that cluster consistently(6870)
    Parameters of clusteringIdentify unpredicted biological information(32)
  • Table 4 Summary of in-depth review articles.

    A collection of reviews for more in-depth coverage of each topic.

    In-depth review areaReferences
    Specific clustering algorithms, their trade-offs, and how they function(2, 3, 7173)
    Analysis of the effects of different distance metrics on clustering gene expression data(74, 75)
    Practical and mathematical implications of high dimensionality on clustering(15, 16)
    A thorough review of validation metrics(35, 38, 62, 76)
    The most common multiple hypothesis correction procedures including Bonferroni correction and FDR correction.(55, 58)
    The effects of specific distances on data clustering for lower-dimensional spaces(77)
    The effects of specific distances on data clustering for high-dimensional spaces(15, 78)
    Ensembles of some algorithms incompatible with high-dimensional data can be useful on higher-dimensional data, even when a single clustering solution is uninformative.(23, 24)
    A more in-depth analysis of ensembles, including evaluating the results of multiple clustering runs and determining consensus(60, 63)

Supplementary Materials

  • Supplementary Materials for:

    Avoiding common pitfalls when clustering biological data

    Tom Ronan, Zhijie Qi, Kristen M. Naegle*

    *Corresponding author. Email: knaegle{at}

    This PDF file includes:

    • File S1. Output of the iPython Notebook that generates all examples in this review.

    [Download PDF]

    Technical Details

    Format: Adobe Acrobat PDF

    Size: 756 KB

    Citation: T. Ronan, Z. Qi, K. M. Naegle, Avoiding common pitfalls when clustering biological data. Sci. Signal. 9, re6 (2016).

    © 2016 American Association for the Advancement of Science