The Potential Cost of High-Throughput Proteomics

See allHide authors and affiliations

Science Signaling  15 Feb 2011:
Vol. 4, Issue 160, pp. pe8
DOI: 10.1126/scisignal.2001813


Improvements in speed and mass accuracy of mass spectrometers revolutionized proteomics, with high-throughput proteomics enabling the profiling of complete proteomes and thousands of posttranslational modification sites. The limits of high-throughput proteomics are constantly pushed to new frontiers, and mass spectrometry–based proteomics may eventually permit the analysis of protein expression profiles in less than a day. Increased data acquisition speed has led to a dramatic increase in the total number of tandem mass spectrometry (MS/MS) spectra, such that millions of MS/MS spectra are now acquired in a given set of analyses. Many of these spectra are insufficiently validated; instead, statistical tools are commonly used to estimate false-positive or false-discovery rates for these data sets. Many laboratories may not realize the costs associated with using these widely available, but minimally validated, data sets. The costs associated with use of these data can include missed opportunities for biological insight, the pollution of databases with increasing numbers of false-positive identifications, and time spent by biologists investigating false leads, resulting in a lack of faith in proteomics data. Improved strategies for data validation need to be implemented, along with a change in the culture of high-throughput proteomics, linking proteomics closer to biology.

Over the past two decades, mass spectrometry–based protein characterization has evolved from analysis of single peptides or proteins to near-comprehensive characterization of proteomes from individual species (15) and identification of thousands of proteins from biological samples, such as cell lysates or plasma. Along with increased coverage of the unmodified proteome from these various biological settings, high-throughput (HT) proteomics is also used seemingly routinely to catalog thousands of protein posttranslational modification sites from microbes, cell lines, and tissues (69). This evolution in proteomics toward more-rapid analysis of larger numbers of species has followed a trend similar to many other techniques, including gene sequencing, transcriptional profiling, and RNA interference, among others. As with these other fields, the challenge facing proteomics over the past several years is one of throughput versus accuracy: Is it possible to radically increase throughput while maintaining high-fidelity protein and posttranslational modification site identification? The first part of this challenge has not been an issue; the proteomics field has successfully increased throughput and coverage. Unfortunately, increased throughput may have come at the cost of accuracy.

In the current implementation, most HT proteomics data sets are assessed by statistical analysis: Searching algorithm metrics for peptide identification are adjusted to yield a given percentage of false positives, as determined by searching the tandem mass spectrometry (MS/MS) spectra against a forward and reversed or scrambled protein database (10). The resulting data set is then considered to contain an acceptable proportion of false positive identifications, usually on the order of 1%. Although this approach provides for rapid data evaluation, is easy to adopt, and is commonly accepted by most proteomics groups; unfortunately, there is no easy way to determine which of the peptides or proteins in the final data set are false positives versus correct identifications. Moreover, the frequency of various motifs that are commonly found in a given proteome may not be correctly represented in the reversed or scrambled database, thereby yielding an artificially low number of hits in the decoy database and a correspondingly artificially low false-positive rate. More importantly, this statistical tool does not actually validate the accuracy of the peptide assignments in the final data set, but instead serves as a rough filter to remove a large number of poor-scoring peptide assignments. This is not to say that algorithmic methods have no place in the process. For example, through comparison of libraries of validated “correct” and “incorrect” spectral assignments, it is possible to construct a classifier that can remove a large number of false-positive peptide identifications, thereby facilitating the subsequent validation process (11). So how does one validate MS/MS spectra, and what is the cost of simply stating an estimate of the number of false positive identifications?

As a first step, spectral validation can be performed manually by checking the assignments of the various fragment ions compared to predicted mass-to-charge (m/z) ratios of the fragment ions for the given peptide sequence (12). For correct peptide sequences, all of the abundant fragment ions (as a rule of thumb, all fragment ions over 10% of the base peak intensity in the MS/MS spectrum) should be assigned to a specific fragmentation site in the peptide (13). Unassigned fragment ions in the MS/MS spectrum are indicative of either incorrect sequence identification or of a mixed spectrum; in either case, the confidence in the sequence identification is much lower, and the peptide should either be left out of the list or placed into a separate low-confidence list. For those peptides or posttranslational modification sites that are deemed critically important to the study or for MS/MS spectra for which the peptide assignment is of low confidence, the next step would be to synthesize the peptide, with or without heavy isotope–labeled amino acids, and compare MS/MS spectra for the synthetic peptide to the endogenous peptide. A third step to fully validate the identification is to perform a coelution experiment, where the synthetic peptide is added (in excess) to the biological sample. Obviously, if the identification is correct, the synthetic peptide will coelute with the endogenous peptide.

Two alternate strategies have been proposed to improve the identification and validation of peptide sequence assignments. The latest generation of mass spectrometers provides more-accurate determination of the m/z ratio of the precursor peptide (improved mass accuracy), and improvements to triple quadrupole mass spectrometers have facilitated more rapid and sensitive multiple reaction monitoring (MRM) experiments. Although improved mass accuracy does narrow the range of possible peptide sequences (14), this strategy disregards the complexity of the proteome. The combinatorial effect of splice isoforms, over 100 different possible posttranslational modifications, and lack of proteolytic enzyme specificity radically increases the number of possible species that could match a given precursor peptide m/z ratio, even with highly accurate mass measurements. When this complexity is considered, it is obvious that high mass accuracy is not sufficient to confirm peptide sequence identification. MRM is a targeted strategy in which a given precursor m/z ratio is isolated and fragmented, and then only selected fragment ions are detected (15). Although this approach confirms the presence of a given precursor m/z ratio and fragment ion transition(s), this information was already available from the original analysis. Sequence assignment is not confirmed by MRM unless synthetic peptides are included in a coelution experiment. Thus, neither of these strategies is an acceptable alternative to the spectral validation techniques described above.

The effort associated with spectral validation can be substantial, because each MS/MS spectrum requires time (minutes to hours) to manually validate, and peptide synthesis and coelution experiments can cost on the order of $100 and take several hours. This validation cost should then be compared with the cost associated with false-positive identifications. This cost is difficult to estimate, because it depends on the nature of the study and on the role of the incorrectly identified species. For instance, inclusion of a false-positive protein identification in a list of proteins present in a given tissue or biological fluid may not be important unless that protein is selected for further study because it is differentially expressed under selected conditions or that protein is included as a candidate biomarker for a given disease state. For identification of posttranslation modification sites, it is more difficult to predict the impact of false positives. Simply including the posttranslation modification site in a list of thousands of such sites in a given study would seem to be fairly innocuous but can have negative unintended consequences. Most of the posttranslation modification sites contained in these unvalidated lists are included as identified sites in one or more of the posttranslation modification site databases, such as Uniprot (16) or Phosphosite (17), or posttranslation modification resources, such as PTMScout (18), without regard for whether the site has been validated. The hazard is that many biology research groups consider the posttranslation modification sites in these databases as accurately assigned; therefore, for a given research laboratory focused on the role of a selected modified protein, the identification of a previously unknown posttranslation modification site in one of these databases can be a major event leading to a detailed investigation of the site, including the development of site-specific antibodies and mutation of the site. These biological follow-up studies can be immensely time-consuming; determining the phenotypic role of a given phosphorylation site is often a manuscript by itself and can require months to years of dedicated effort. Obviously, in either case, the cost of false-positive identification can be quite high, with years spent validating the biology of the posttranslation modification site on a protein or assessing the applicability of a biomarker in multiple tissues with different assays. There is a potential human cost as well, because these follow-up studies are typically performed by a graduate student or postdoctoral associate, whose career track may be substantially negatively influenced by attempting to validate a false lead. Thus, the effort associated with spectral validation pales in comparison to the cost in lost time and effort associated with chasing down a false-positive protein or posttranslation modification site identification (Fig. 1).

Fig. 1

Approximate timeline for typical proteomics experiment (not to scale). Note that the biological sample generation time can vary substantially depending on the source of material. For simple organisms such as Saccharomyces cerevisiae, this time may be on the order of weeks, whereas for mouse models or human tumors this time may be on the order of years. The time scale for liquid chromatography (LC)–MS analysis will vary depending on number of fractions to be analyzed and number of mass spectral analyses composing a given experiment. Manual validation of a single MS/MS spectrum requires 10 to 20 min for experienced users; obviously, the time for validation scales with the number of MS/MS spectra as indicated by the asterisk. Biological follow-up experiments will also vary greatly on the basis of the biological context but will typically require months to years to identify the function of a given posttranslation modification site or protein.


With these issues in mind, is there a way to proceed with HT proteomics while minimizing the number of false-positive identifications? As a first step, the goalposts for many proteomics experiments need to be altered. Instead of aiming to maximize the number of proteins, peptides, or posttranslation modification sites identified in a given analysis, the goal of these experiments should be to most accurately determine which species are present in a given sample and to accurately quantify how their respective amounts are altered under different conditions. When proteomics experiments are closely tied to the goal of having biological impact and creating viable and accurate leads for further investigation, then the total number of peptides or modified sites becomes inconsequential, and minimizing the number of false positives takes precedence.

Given the pace of improvements in HT proteomics technologies and the sheer number of spectra that are being generated, continued publication of minimally validated data sets is likely. In this case, perhaps an appropriate strategy would be to have all of the MS/MS spectra with associated fragment ion assignments publicly available. Biologists interested in investigating a given posttranslational modification site or protein could either contact the proteomics laboratory to ask for further confirmation of a given identification or attempt to evaluate the quality of the MS/MS spectrum assignments themselves, with the above criteria in mind. In either case, proteomics needs to move from a “buyer-beware” culture of knowingly publishing data sets containing a substantial number of false-positive identifications to a culture where accuracy and data quality are more highly valued than the number of identifications.


View Abstract

Stay Connected to Science Signaling

Navigate This Article