PerspectiveBiochemistry

Effective Representation and Storage of Mass Spectrometry–Based Proteomic Data Sets for the Scientific Community

See allHide authors and affiliations

Science Signaling  15 Feb 2011:
Vol. 4, Issue 160, pp. pe7
DOI: 10.1126/scisignal.2001839

Abstract

Mass spectrometry–based proteomics has emerged as a technology of choice for global analysis of cell signaling networks. However, reporting and sharing of MS data are often haphazard, limiting the usefulness of proteomics to the signaling community. We argue that raw data should always be provided with proteomics studies together with detailed peptide and protein identification and quantification information. Statistical criteria for peptide identification and their posttranslational modifications have largely been established for individual projects. However, the current practice of indiscriminately incorporating these individual results into databases such as UniProt is problematic. Because of the vast differences in underlying data quality, we advocate a differentiated annotation of data by level of reliability. Requirements for the reporting of quantitative data are being developed, but there are few mechanisms for community-wide sharing of these data.

Biological regulation involves protein interactions, changes in gene expression, or posttranslational modifications. Mass spectrometry (MS)–based proteomics, in particular high-resolution and quantitative implementations, has developed into a powerful technology to characterize all these aspects of protein behavior in a large-scale manner (14). Novel proteomic sample-preparation strategies, mass spectrometric instrumentation, and computational proteomics are rapidly advancing the field, and quantification of thousands of proteins is now possible (5). Of special interest to the signaling community are the efficient methods of mapping protein interactions (69) and the large-scale identification and quantification of peptides bearing posttranslational modifications (PTMs) (1015). One of the challenges in making use of this quantum leap in the ability to characterize proteins in biological processes is the rigorous analysis, representation, and sharing of proteomics data sets. In addition to discussing current achievements and remaining problems faced by the community, we give recommendations for handling the coming data flood resulting from increasingly streamlined and available proteomics technology (16).

The typical pipeline in MS-based proteomics involves enzymatic digestion of the protein mixture of interest and separation of the peptides by liquid chromatography, coupled directly to mass measurement and fragmentation in a liquid chromatography–tandem mass spectrometer (LC-MS/MS). Early proteomics studies frequently used low-resolution machines, but today high-resolution instrumentation is readily available, robust, and sensitive. High mass spectrometric resolution helps in accurately measuring peptide masses, which greatly aids in their identification (17). Peptides are generally quantified by determining peak intensities between stable isotope–labeled pairs of the same peptide or between the peak intensities of the same peptide in different LC-MS/MS runs, which is called “label-free” quantification (1821). In either case, the ability to distinguish between coeluting peptides of similar masses is a precondition for accurate peptide quantification; thus, high-resolution instrumentation should be used in proteomics and particularly if the data are meant to serve as a resource to the community.

Peptide identification and quantification is the basic unit of MS-based proteomics, and these data should therefore accompany any paper (Fig. 1). Reporting typically involves large tab-delimited data sets that include peptide masses and ratios, the database identification score, and information related to PTMs, if appropriate. In the case of PTMs, a “localization score” must also be given, which expresses the likelihood that the modification is attached to a given residue. If the PTMs themselves are of interest, such as in phosphoproteomic studies, an annotated spectrum for each site should be provided. This is already the policy of proteomics journals, such as Molecular and Cellular Proteomics (MCP) (22). In our opinion, it is sufficient if these annotations, which will often involve thousands of sites, are done in an automated manner, but others recommend that they should be checked manually (23).

Fig. 1

Hierarchy of proteomics data types in shotgun proteomics. From the raw MS data, lists of identified and quantified peptides are extracted. In the case of modified peptides, annotated spectra should be provided along with a localization score, expressing the confidence in placing the modification on a specific amino acid in the peptide sequence. At the next level, peptide information is aggregated, allowing protein identification and quantification. Bioinformatic analyses of the results, such as PTM motif extraction, represented by the WebLogo plot, or network modeling stand at the top of the pyramid.

CREDIT: Y. HAMMOND/SCIENCE SIGNALING

Targeted proteomics in the form of multiple reaction monitoring (MRM) directs MS measurements only on a subset of peptides of interest, for which specific detection conditions have been determined beforehand (18, 24, 25). Results from MRM experiments should therefore be easier to document. However, signals in MRM can also originate from coeluting peptides with similar masses, and establishing robust and reliable false discovery rates (FDRs) for this technique should be a high priority.

The next level of information is the list of identified and quantified proteins. This will typically also take the form of a spreadsheet, and it should accompany each paper on the journal’s Web site. All proteins identified exceeding the stated FDR (usually 1%) should be provided. It is often useful to provide histograms of fold-changes for the entire protein population—this enables the reader to judge the behavior of the entire proteome. Further bioinformatic analysis and interpretation is generally performed, and these should accompany the paper in the form of supplementary information in the usual way.

With thousands of high-resolution mass spectrometers—usually of the linear ion trap–Orbitrap configuration—in operation and a single proteomics project on one machine easily generating hundreds of LC-MS/MS files of a few hundred megabytes each (Fig. 2), these raw data files are too large to store on the journal’s Web site. They should nevertheless be available to the community because improved computational proteomics tools may allow reanalysis in the future (similar to the microarray data repositories, for example). Furthermore, availability of the raw data to other researchers helps to ensure high standards of data quality and interpretation. Some journals, including MCP and Science Signaling, require that all raw MS data that are part of a manuscript must be uploaded to a publicly accessible site before publication, and we suggest that other journals follow their lead. Unfortunately, the amount of raw MS data produced is overwhelming, and a simple solution for data storage and sharing is not yet available. Tranche at Proteome Commons is currently the only public data repository that can handle and store raw MS data files (https://proteomecommons.org/) (26). Tranche originates from the Andrews laboratory at the University of Michigan, who continue operating and funding it with various other collaborators. Finding a stable, long-term solution to the raw data storage problem should be a high priority of the proteomics community.

Fig. 2

Hierarchy of data storage and presentation. At the bottom of the pyramid, high-resolution mass spectrometers produce enormous amounts of experimental data stored in LC-MS/MS data files. From these data, peptide peaks are recognized and identified, resulting in large spreadsheets containing peptide-level information. At the protein level, which is also typically presented in spreadsheet format, the amount of information is much reduced. Raw LC-MS/MS data should be stored in repositories, such as Tranche, whereas all other information should accompany the research communication. Specialized databases, such as PRIDE and Peptide Atlas, aggregate proteomic data; whereas Phospho.elm, Phosida, and Phosphosite are examples of PTM-centric databases that allow biologists to check modifications of their proteins of interest. Finally, UniProt is a database that serves as a common entry point for the biological community to retrieve proteomic information about any protein of interest.

CREDIT: Y. HAMMOND/SCIENCE SIGNALING

Because so many of the currently used LC-MS/MS machines are linear ion traps, specifically LTQ-Orbitraps, the practice of storing data in vendor-supplied formats causes few major problems at present. However, for long-term viability, data converters that transform the data into open standard formats without unduly increasing files sizes or losing information are needed.

Along with Proteomics Identifications database [PRIDE (http://www.ebi.ac.ukpride/)] (27) and Peptidome (http://www.ncbi.nlm.nih.gov/peptidome), processed proteomics data sets can be uploaded to and accessed from several public Web-based repositories, such as the Peptide Atlas (http://www.peptideatlas.org/) created by the Aebersold laboratory (28). These databases store peak lists and peptide identifications, as well as some metadata associated with each project. Apart from serving as a static repository, PRIDE allows users to query all recorded data sets, potentially revealing common and distinct aspects of individual projects. The focus of PeptideAtlas is to collect fragmentation information for as many peptides as possible. This information is used to characterize the proteome of different organisms and to construct precursor to fragment transitions for targeted proteomics. Although these repositories are helpful to the community by gathering the scattered published proteomics data, they face the problem of inflated FDRs, which occurs when accumulating many data sets of varying quality. Furthermore, even when combining individually well-controlled projects, false-positive identifications accumulate much faster than true identifications in the combined data. Another major task for some proteomics data repositories is storage of quantitative information, which is increasingly the focus of proteomic experiments.

In addition to these general proteomics databases, there are a few cell signaling–oriented ones. For example, Phospho.elm (http://phospho.elm.eu.org/) (29), Phosphosite (http://www.phosphosite.org/) (30), and Phosida (http://www.phosida.com/) (31) record information about MS-identified phosphorylation sites, as well as other PTMs, such as glycosylation and lysine acetylation. They also contain information about the evolution of modification sites, regulation of the modification sites, and enzymes that are likely adding the PTMs, such as kinases for phosphorylated sites. One strategy to avoid the accumulation of false-positive data in a proteomic repository is to only include data that have been obtained with similar or identical high-resolution workflows. Although effective for data repositories representing the data from a particular lab, such as Phosida, this strategy is more difficult for repositories that contain data from multiple sources. In these more diverse data repositories, adequate annotation and information about the workflow and data reliability are crucial. Statistically proper control of false-positive data furthermore requires analysis of all acquired raw data together—another argument for mandatory submission of raw data.

Most biologists obtain proteomic data on proteins of interest by searching well-established community databases, such as UniProt (32). UniProt lists PTM information abstracted from some of the proteomics literature and has become a gold standard for determining whether a given modification site has previously been identified. It does not attempt to discriminate among published data sets, and it is exposed to the problem of accumulating false positives. Consequently, it is already “contaminated” by large-scale data that do not fulfill rigorous evaluation criteria. Although it is very difficult to remove errors from databases, clearly this is necessary to keep these community resources reliable and useful. We suggest thinking of “social network”–like mechanisms by which the reliability score of a site increases as a function of the number of studies that have identified the site. Such criteria could be defined as meeting a particular FDR, using instruments and procedures of a minimum resolution, or through data mining of the supplied raw data that checks for consistency and plausibility.

There is now a large community developing and applying MS-based proteomics. Measurements that used to take months can now be done in a few days, and analysis is increasingly automated. This will lead to an exponential increase in proteomics data submitted to journals and deposited into databases. Some studies derive their value from reporting previously unknown modification sites, interactions, and so on, essentially providing a resource for the community. Other studies instead report quantitative changes in the abundance or the modifications of peptides already in the databases. Thus, the task of the next few years is to populate databases with high-quality peptide data and to find mechanisms to remove or at least flag results that are unlikely to be correct.

References

View Abstract

Stay Connected to Science Signaling


Editor's Blog

Navigate This Article