Perspective

# "Bits" and Pieces

See allHide authors and affiliations

Science's STKE  20 Jun 2006:
Vol. 2006, Issue 340, pp. pe28
DOI: 10.1126/stke.3402006pe28

## Abstract

Many protein-protein interactions involved in signal transduction occur through the interaction of modular protein domains in one molecule with short linear sequences of amino acids ("motifs") in another. Although protein domains are recognized by a variety of computational tools, bioinformatic approaches alone have not been successful in identifying the short sequence motifs to which domains bind. A new approach, applying motif-determining algorithms to smaller subproteomic collections of proteins that are already known to associate with each other in high-throughput protein-protein interaction screens, now appears to be capable of capturing a reasonably large number of low-affinity core motif sequences. Application of this approach to the genomes of yeast, fruit flies, nematodes, and humans has doubled the number of known or suspected protein-protein interaction motifs.

Protein-protein interactions play essential roles in all processes within the cell. A detailed understanding of how protein-protein interactions occur at the molecular and atomic level, how they evolved, and how they are regulated is paramount to understanding the process of life itself. A less lofty, but perhaps more pressing goal in dissecting the molecular nature of protein-protein interactions lies in deciding which interactions are well suited for pharmacological manipulation, and drug companies appear to be making increasing efforts in this direction (1).

Certain types of protein-protein interactions, particularly those involved in the organization of complex molecular machines such as the proteosome, the centrosome, and the nuclear pore involve large regions of protein-protein contact, burying several thousand square angstroms or more of the surface from each of the proteins. In contrast, other interactions, particularly the very dynamic and highly regulated multipartner interactions involved in signal transduction, for example, involve more limited areas of protein-protein contact (2, 3). In many of these cases, the protein-protein interaction involves binding of a modular domain in one protein to a short linear amino acid motif in the other. For the purposes of this article, I will define a "domain" as a stretch of protein sequence, typically between 40 and 300 amino acids in length, that autonomously folds into a compact, stable three-dimensional structure and performs some type of known or unknown biological function. A "motif" refers to a very short region within a protein, typically 3 to 12 amino acids in length, that is responsible for binding of the protein to some other biological molecule, that is, to a modular domain contained within its interacting partner. Within the context of their constituent proteins, motifs are generally thought to exist in a relatively unstructured state, but seem to adopt relatively stable conformations once they are bound to their interacting domains. Although additional sites of contact between the two proteins may extend beyond the specific domain-motif contact region to modulate the strength of the binding, it is the domain-motif association that dominates the interaction. Because these types of domain-motif interactions generally bury less surface area than other types of protein-protein interactions, they are ideally suited for use in dynamic cellular processes where particular protein-protein interactions need to be formed and then broken in a relatively short amount of time. Consequently, these rapidly reversible domain-motif associations are also potentially very amenable to disruption by small molecules, making these interactions of prime importance from the perspective of drug design.

Historically, identification of modular domains and their corresponding binding motifs emerged from experiments in individual labs focusing on the biology of one or a few specific protein-protein interactions (4), a process that was both slow and not necessarily focused on defining the general properties of the domain itself. Given our increased appreciation of the importance of domain-motif interactions and the ever-expanding number of complete genomic sequences available, is it possible to use computational methods alone to identify both modular domains and motif sequences in a rapid high-throughput way from proteome-sized collections of sequences? For domains, the problem is straightforward. Because modular domains such as SH3 (Src homology 3) domains or BRCT domains (named after the C-terminal domain of the breast cancer susceptibility protein BRCA1) typically occur in more than one protein and contain 40 to 300 amino acids or more, they can be readily identified computationally in searches for conserved sequences in this length range that are contained within several distinct proteins in a single proteome. Various algorithms have been optimized to accomplish this that use sequence alignments, position-specific scoring motifs, Markov models, or neural networks (58). Annotated searchable collections of modular domains are available over the Web (912). Although the computationally derived domains (that is, their sequence boundaries) frequently underestimate the true size of the domain when compared with what is determined experimentally [c.f. (13, 14)], all of the computational methods work remarkably well. The key to their success lies in the fact that the central core of a modular protein domain, that is, the string of amino acids being recognized in the sequence comparisons, is reasonably large--big enough so that the amount of information contained in protein sequence alignments is statistically robust. To a computational biologist, this means that the computer algorithm used to find the domain generates "bit scores" for real domains that greatly exceed those that would arise from similarly sized chunks of randomly chosen sequences.

Can these same types of computational tools that were so successful for domain hunting also be used to predict the specific motif sequences to which the domains bind? Herein lies the rub--in contrast to the relatively large sequence length of domains, the average length of domain-binding sequence motifs is quite short, typically three to eight amino acids in length. Consequently, from a computational point of view, these motifs live in the twilight of statistical significance when search algorithms are applied to proteome-sized data sets--the bit scores are just too low. Until recently, the only way to identify these motifs was by using experimental techniques such as peptide phage display (11) or oriented peptide library screening (12) that queried individual domains one at a time. Now, Russell and his colleagues have devised a clever computational technique for finding these motifs by applying the motif-determining algorithms to small subsets of the proteome defined by high-throughput protein-protein interaction data (17). In essence, by limiting the search space to a subset of proteins that all share a common binding partner (Fig. 1), the authors argued that any linear sequence motif that is involved in the binding is likely to be overrepresented in that subset, noting that most linear sequence motifs are generally found in unstructured regions of proteins further narrowed the sequence space where the motifs were sought. Thus, parts of the proteins in the interaction subset comprising globular domains, coiled-coil regions, etc., were discarded before the motif search was applied. Combing the interaction data sets for proteins that shared a particular modular domain--that is, making a "superset" of all those that interact with proteins that contain SH2 domains, for example, also enhanced identification of motifs. Finally, the inclusion of orthologs of each protein-protein interaction data set found in related genomes further enhanced the likelihood of motif discovery. Details of the algorithm and the methods for statistically evaluating the results are described in Appendix 1, Fig. 2, and Table 1. Using this joint experimental and computational approach, the authors were able to identify 11 linear sequence motifs in yeast, 26 in fruit fly, 27 in nematode, and 112 in humans, roughly doubling the number of motifs that were known previously.

How good were the motifs that the authors identified? Short peptides containing three of the motifs were subsequently tested experimentally, and two of them--a VxxxRxYs (18) motif predicted to interact with fly Translin, and a DxxDxxxD motif predicted to interact with yeast protein phosphatase 1 (PP1), were shown to bind to their respective proteins with dissociation constants (KD values) of 43 and 22 μM, respectively. Although these values indicate interactions that have somewhat low affinity, it appears that at least part of the observed protein-protein interactions were successfully captured by these computationally identified motifs.

Some well-known domain-binding motifs, however, were missed with this approach. Why might that be? In some cases, there may have been too few sequences containing the motifs to reach statistical significance. In other cases, motif recognition may have been complicated by low sequence complexity. For example, WW domains bind strongly to proline-rich sequences locally embellished by one or two particular amino acid residues. Group 1 WW domains, for example, bind to PPxY sequences, whereas group 2 WW domains bind to PPLP sequences (19). If the background probabilities for degenerate proline-rich regions within a complete proteome are high, these motifs are likely to be missed in a computational search if only a small number of interacting partners are found to contain these motifs, as apparently occurred for WW domains in the yeast interaction data set. Likewise, we have observed frequent runs of serine-proline– and serine-glutamine–rich sequences in mammalian proteomes, suggesting that computer-based identification of motifs that contain these sequences would likely also be difficult.

Not all regions of low sequence complexity, however, pose equivalent difficulties for motif identification. In the case of fly and nematode, for example, Russell and colleagues (17) were able to identify a number of low-complexity motifs, including His-, Ser-, Lys-, and Glu/His-rich motifs. In those organisms, at least, these low-complexity sequences may not be overly represented in the global sequence background frequencies. The true significance of these low-complexity motifs awaits further experimental evaluation, because their identification could equally represent an artifact of the yeast two-hybrid method used to generate the high-throughput protein-protein interaction data set.

Two additional points merit comment. First, because regions of large sequence identity between proteins in the interaction set were routinely discarded before motif discovery (to prevent sequence bias overrepresentation), smaller linear motifs may have been lost if they were contained within a larger conserved sequence context. Second, because the definition of linear sequence motifs that Russell and colleagues used did not include conservative amino acid substitutions (i.e., Lys substitution for Arg within a fixed position in a motif was not permitted), some additional motifs may have gone unrecognized.

What are the "bottom line" outcomes of this study? One seminal finding is that high-throughput experimental data, even if they are of relatively low accuracy, combined with computational analysis, are more powerful for characterizing the molecular nature of protein-motif interactions than either method alone. Another important revelation is that capturing relatively low affinity motifs (i.e., binding constants in the 10 to 100 μM range) may be the best that computational approaches can do, in the absence of more specific data. At least part of this limitation is probably a direct consequence of the biology of domain-motif interactions within isolated genomes themselves. In yeast, for example, the SH3 domain of Sho1 binds only to a single PxxP motif found in the protein Pbs2. In fact, the Sho1 SH3 domain and the Pbs2 motif appear to have coevolved, with negative discrimination during evolution insuring against any other promiscuous SH3:PxxP interactions in the yeast cell (20). Thus, if one groups all 27 SH3 domain-containing proteins in yeast with their ligands, it should not be surprising that only the most common denominators of binding (i.e., a common PxxP motif) will emerge, because each SH3-PXXP motif interaction may be tailored for specificity by selecting unique residues in the nonproline positions. Finally, because some domains often coexist with others (e.g., SH2 and SH3 domains, FHA and BRCT domains), it is not generally possible using computational methods alone to assign a motif as the binding partner for a specific domain.

What comes next? Based on conservative extrapolation of the Neduva et al. data (17), it appears that at least 100 additional new motifs await discovery in the human proteome--perhaps even as many as 500. How will these new sequence motifs be found? Some may emerge from this same computational approach as protein-protein interaction databases are purged of false-positive and false-negative data, and better annotated with respect to physiological significance. Others may be revealed by structural considerations where possible motif-binding sites on the common interacting protein are evaluated in a computational manner (21). Still more motifs may be revealed by context-dependent searches limited to proteins sharing a common subcellular compartment, or in which a specific post-translational modification is known to enhance a particular protein-protein interaction. The hope for the future us that some or all of these will allow us to better capture the secret language of proteins that we are just now beginning to decode.

Appendix

Assessing statistical significance. The statistical significance of newly discovered motifs was measured in the following way. First, a "background" protein data set was constructed with 15,000 randomly selected proteins from SWISS-PROT. Sequences were filtered to remove domains, transmembrane segments, collagen regions, and signal peptides, as well as homologous sequences defined by BLAST. This left a background distribution of nondomain sequences in which to assess motif probabilities.

The random probability of newly discovered motifs was estimated using the binomial distribution. The probability of observing a motif n times in a data set of M proteins by random chance is given by the following:

$$mathtex$$$P(n{\vert}\ M)=\left(\frac{M}{n}\right)p^{n}(\mathrm{1}{-}p)^{M{-}n}$$$mathtex$$

where n = number of times the motif was found in a set of M proteins, $$mathtex$$$$\frac{M}{n}$$$$mathtex$$ is the binomial coefficient denoting the number of ways to place n motifs in M proteins, and p is the probability of finding the motif in the background database of 15,000 proteins. This binomial probability, P(n|M), is the chance that, in the genome studied, that motif occurred in the set of interacting proteins as frequently as it was found solely by chance.

Statistical enrichment of motif discovery using evolutionary conservation. For proteins in the "interacting set," their orthologs were found in related genomes. One can then search for occurrence of the motif within the ortholog interaction set for each of the genomes and assign binomial probabilities for the occurrence of that motif within each genome.

From the multiple genomes examined, the "motif conservation score" Scons, was defined as the product of the individual binomial probabilities for each genome, that is,

$$mathtex$$$S_{cons}\ =\ P_{1}\ {\times}\ P_{2}\ {\times}\ P_{3}\ {\times}\ {\ldots}.\ {\times}\ P_{y}$$$mathtex$$

where y is the total number of genomes studied.

Next, the background distribution of Scons values that would emerge solely by chance if one searched for that motif in a similarly sized set of randomly selected proteins in those genomes are calculated. To do this, for each genome, 50 sets of proteins are randomly constructed to be of the same number and length as those in the orthologous interaction set. Binomial probabilities and Scons values are then calculated, with each of these random genome data sets.

Finally, the Scons value that was actually observed for the motif in the real interacting set is compared with the Scons distribution in the random data sets, and only those motifs whose Scons values have P values < 0.001 are chosen.

1. 1.
2. 2.
3. 3.
4. 4.
5. 5.
6. 6.
7. 7.
8. 8.
9. 9.
10. 10.
11. 11.
12. 12.
13. 13.
14. 14.
15. 15.
16. 16.
17. 17.
18. 18.
19. 19.
20. 20.
21. 21.
View Abstract