Research ArticlePhosphoproteomics

Deciphering Protein Kinase Specificity Through Large-Scale Analysis of Yeast Phosphorylation Site Motifs

See allHide authors and affiliations

Sci. Signal.  16 Feb 2010:
Vol. 3, Issue 109, pp. ra12
DOI: 10.1126/scisignal.2000482


Phosphorylation is a universal mechanism for regulating cell behavior in eukaryotes. Although protein kinases target short linear sequence motifs on their substrates, the rules for kinase substrate recognition are not completely understood. We used a rapid peptide screening approach to determine consensus phosphorylation site motifs targeted by 61 of the 122 kinases in Saccharomyces cerevisiae. By correlating these motifs with kinase primary sequence, we uncovered previously unappreciated rules for determining specificity within the kinase family, including a residue determining P−3 arginine specificity among members of the CMGC [CDK (cyclin-dependent kinase), MAPK (mitogen-activated protein kinase), GSK (glycogen synthase kinase), and CDK-like] group of kinases. Furthermore, computational scanning of the yeast proteome enabled the prediction of thousands of new kinase-substrate relationships. We experimentally verified several candidate substrates of the Prk1 family of kinases in vitro and in vivo and identified a protein substrate of the kinase Vhs1. Together, these results elucidate how kinase catalytic domains recognize their phosphorylation targets and suggest general avenues for the identification of previously unknown kinase substrates across eukaryotes.


As one of the most widespread posttranslational modifications, protein phosphorylation is involved in virtually every basic cellular process, including DNA replication, gene transcription, protein translation, cell growth and metabolism, differentiation, and intercellular communication. With the advent of whole genome sequencing, the entire complement of kinases, or “kinome,” for multiple organisms has been cataloged, revealing that most eukaryotes devote ~2% of their protein-coding capacity to these enzymes (1). Unraveling the function of each member of such a large family remains a challenge. Advances in phosphoproteomic methodologies, such as large-scale mass spectrometry (MS)–based phosphorylation site discovery, targeted small interfering RNA screens, the use of analog-sensitive kinase alleles that are engineered to accept specific inhibitors and adenosine triphosphate (ATP) analogs, and protein microarray analyses, have shed considerable light on the scope and complexity of phosphorylation-based signal transduction pathways in eukaryotes (25).

However, one aspect of protein kinase biology that remains poorly understood is how kinases achieve specificity for their target substrates. Understanding rules for substrate recognition by kinases has important applications in the mapping of phosphorylation sites in protein substrates, discovery of previously unknown substrates, and production of model substrates for small-molecule inhibitor screening (6). In addition, a detailed understanding of how kinases interact with their substrates enables both deciphering and genetic rewiring of kinase specificity, thereby uncovering fundamental ways in which signaling pathways are organized and propagated (7, 8).

In a typical eukaryotic cell, there are hundreds of thousands of Ser, Thr, and Tyr residues among the thousands of proteins. To ensure signaling fidelity, kinases must somehow discriminate among these vast numbers of potential phosphorylation sites. Mechanisms that influence substrate selection by a protein kinase include subcellular localization, substrate docking interactions, and binding to scaffold proteins (9). An important aspect of substrate recognition, however, is that the phosphorylation site on the substrate falls within a consensus amino acid sequence that is complementary to the active site of the kinase.

Consensus phosphorylation site motifs for protein kinases have been previously established on an individual basis through either the inspection of known phosphorylation sites, the systematic mutagenesis of protein and peptide substrates, or the screening of peptide libraries (10, 11). Although these studies have provided valuable insight into substrate recognition, such data are only available for a subset of known protein kinases. NetPhorest, which is the most comprehensive repository for kinase phosphorylation site motifs reported to date, includes motifs for only 35% of all human kinases (12). The incompleteness of available data and the heterogeneity of data collection methods limits their application to elucidating cellular signaling pathways and modeling larger phosphorylation networks. For example, the use of motif scanning approaches to link specific kinases to the thousands of in vivo phosphorylation sites discovered through MS-based phosphoproteomics has proven difficult in targeted kinase studies because multiple kinases can potentially target the same or similar motifs.

We thus set out to catalog consensus phosphorylation site motifs for the kinome of the model organism Saccharomyces cerevisiae. We adapted a peptide library screening approach (13) to a miniaturized format that would enable rapid analysis of large numbers of kinases. With this method, we determined consensus phosphorylation motifs targeted by 61 of the 122 yeast kinases. This large collection of phosphorylation site motifs provided new insight into the structural basis for substrate recognition by protein kinases as a family in a manner not possible through analyses of individual kinases. Furthermore, we used our motif collection to predict new kinase-substrate relationships through database scanning and integration with other yeast proteomic and genomic data sets.


A rapid peptide-based approach for the high-throughput determination of kinase consensus phosphorylation site motifs

To determine phosphorylation motifs for yeast protein kinases, we developed a high-throughput approach using our previously reported positional scanning peptide library (13). This library consisted of 200 distinct peptide mixtures in which each 16-mer peptide contained a central fixed phosphorylation acceptor (phosphoacceptor) site (an equimolar mixture of Ser and Thr) flanked by degenerate positions consisting of equimolar mixtures of the 20 amino acids excluding Ser, Thr, and Cys, and a C-terminal biotin tag (Fig. 1A). For each of the nine positions surrounding the phosphoacceptor site, there were 22 peptide mixtures in which each of the 20 unmodified amino acids, as well as phosphothreonine (pT) and phosphotyrosine (pY), were fixed. In addition to these 198 (9 × 22) peptide mixtures, two control peptide mixtures bearing either Ser or Thr alone as the fixed phosphoacceptor residue in the context of a fully degenerate sequence were also included. These control mixtures served as indicators of any preference the kinase had for either Ser or Thr residues at the phosphoacceptor site. Peptides were incubated with the kinase of interest in the presence of radiolabeled ATP. At the end of the incubation period, aliquots of each reaction were spotted simultaneously with a capillary pin–based liquid transfer device onto a streptavidin-coated membrane that captured the peptide substrates through their C-terminal biotin tags. After extensive washing, the membrane was dried and exposed to a phosphor screen, allowing the extent of radiolabel incorporation for each peptide to be visualized and quantified. To enable high-throughput analysis, all steps were performed in a 1536-well format, thereby reducing the amount of kinase and peptide required and enabling simultaneous analysis of four kinases.

Fig. 1

Miniaturized peptide array approach enables high-throughput analysis of kinase consensus phosphorylation motifs. (A) Scheme for kinase peptide screening. Capillary pin–based liquid transfer devices were used to add components to reactions (2 μl per well) and spot 0.2-μl aliquots onto the streptavidin-coated membrane after incubation. The 1536-well format allows four kinases to be analyzed simultaneously. (B) Representative peptide screening results for Atg1, Gin4, Mps1, and Prk1. (C) Phosphorylation of consensus peptide substrates by Atg1, Gin4, Mps1, and Prk1. The sequence of each peptide is as follows: ATGtide, YANWLAASIYLDGKKK; GINtide, YALRRSRSMWNLGKKK; MPStide, YADHDDDTMHFRGKKK; and PRKtide, YALKPQYTGPRGKKK. Abbreviations for the amino acids are as follows: A, Ala; C, Cys; D, Asp; E, Glu; F, Phe; G, Gly; H, His; I, Ile; K, Lys; L, Leu; M, Met; N, Asn; P, Pro; Q, Gln; R, Arg; S, Ser; T, Thr; V, Val; W, Trp; and Y, Tyr. Peptide phosphorylation was assayed at 10 μM concentration by radiolabel kinase assay. Incorporation of radiolabeled phosphate into peptides was determined by phosphocellulose filter binding assay. Maximal rates for each kinase in these assays were as follows: Atg1, 250 nM/min, Gin4, 510 nM/min, Mps1, 130 nM/min, Prk1, 330 nM/min. (D) Rates of Atg1 phosphorylation of ATGtide variants with individual point substitutions. Peptide phosphorylation was assayed as for (C).

Three yeast kinases (Tpk1, Tpk2, and Ste20) were assayed with both the miniaturized and the large-volume formats, and we performed multiple replicates with one of these kinases, Tpk1. Identical results were observed with the two formats and in replicate assays with the 1536-well format (data for Tpk1 are shown in fig. S1). These kinases also recapitulated preferences of their mammalian orthologs for basic residues upstream of the phosphorylation site (13, 14). These results confirm that the miniaturized peptide library screening system is reproducible and provides data that are quantitatively equivalent to lower-throughput approaches.

Screening yeast kinases for their consensus phosphorylation site motifs

With our peptide array method, we screened 111 of the 122 yeast kinases. Kinases were initially purified from yeast strains that harbor galactose-inducible expression plasmids bearing either a C-terminal tandem affinity purification tag or an N-terminal glutathione S-transferase (GST) tag (15, 16). In a number of instances, it was necessary to perform the assay in the presence of known activating subunits [(for example, cyclins for cyclin-dependent kinases (CDKs)], phosphorylate the kinase in vitro or coexpress it with an activating kinase, or purify the kinase from yeast grown under activating conditions. For kinases with which poor yields were obtained from yeast, we used alternative bacterial and mammalian cell expression systems. Each kinase was assayed on the peptide substrates in duplicate on separate days. In total, we generated reproducible phosphorylation motifs for 61 of the 111 yeast kinases screened (Fig. 1B and table S1). Three distinct motifs were generated for the CDK Pho85 by analyzing its activity separately in complex with different cyclin subunits (Pho80, Pcl1, and Pcl2). The remaining kinases were not sufficiently active to phosphorylate the peptides above background levels. These kinases may be highly specific for particular protein substrates and thus do not phosphorylate peptides efficiently. For example, in keeping with previous observations for their mammalian orthologs (17), we did not observe activity on our peptide substrates for the eight kinases in the mitogen-activated protein kinase (MAPK) kinase (MAPKK) and MAPK kinase kinase (MAPKKK) families. Other kinases were likely simply inactive under exponential growth conditions or when assayed in the absence of obligate binding partners and may be suitable for analysis once their activation mechanisms are more completely understood.

Approximately half of the phosphorylation site motifs that we determined for yeast kinases were identical to known motifs, because they corresponded to yeast homologs of mammalian kinases that have been previously characterized (11, 12). In contrast, the remaining kinases and their mammalian homologs have either not been previously characterized (table S2 lists mammalian homologs and indicates which kinases have previously known motifs) or in one instance (Tos3) yielded a motif different from that reported. Representative spot arrays produced by four kinases for which phosphorylation motifs were not previously known (Atg1, Gin4, Mps1, and Prk1) are shown in Fig. 1B. Spot intensities from the peptide arrays were quantified, background-corrected, and normalized to provide the selectivity values shown in Table 1. We verified the consensus phosphorylation motifs for these kinases by performing kinase assays with optimized peptide substrates (named ATGtide, GINtide, MPStide, and PRKtide, respectively) consisting of those residues that were most highly selected at each position. As shown in Fig. 1C, each kinase was highly specific for its corresponding peptide substrate, thus providing independent validation of our mixture-based peptide library screening approach.

Table 1

Quantified selectivity values for protein kinases discussed in the text. Peptide array data were quantified and normalized to an average value of 1 within a position. Positively selected residues with values greater than 1.5 are shown. Complete quantified PWMs for all kinases are provided as data set S1.

View this table:

Notably, the autophagy-linked kinase Atg1 had an atypical motif exhibiting selections for hydrophobic residues at multiple positions. We verified this motif by making targeted substitutions to the ATGtide substrate. As anticipated, substituting a different favorable hydrophobic residue (Met) at the most selective position (P–3) had no significant effect on the rate of ATGtide phosphorylation. Moreover, substituting unfavorable charged residues at any of three most strongly selective positions dramatically reduced the reaction rate (Fig. 1D).

Overall features of kinase phosphorylation signatures

Normalized, background-corrected phosphorylation signals for each kinase were assembled into position weight matrices (PWMs), which are quantitative representations of the phosphorylation motif. We scored each position for its total selectivity, and a specificity heat map of all kinases and positions revealed the wide range of selectivity exhibited by kinases (Fig. 2). At one extreme, Yck1 and Cka1 (yeast casein kinase 1 and casein kinase 2 homologs) were highly sequence-specific, with requirements for particular amino acids at multiple positions. At the other extreme, Cak1 and Rad53 were the least selective in that, although the extent of substrate phosphorylation by these kinases is clearly dependent on peptide sequence, there were no residues that were required at any position surrounding the phosphoacceptor. Most kinases fell between these extremes, with a combination of required residues and more subtle propensities that influence the overall efficiency of phosphorylation. Furthermore, although selectivity was noted at each position surrounding the phosphorylation site, kinases were most frequently selective at the P–3 position, followed by the P–2 and P+1 positions. By contrast, few kinases were selective at the P–1 position.

Fig. 2

Heat map ranking kinases by their specificity quotients as calculated from their average PWMs. Kinases are ranked from least specific (top) to most specific (bottom). The specificity in each position is defined as the information content in each position, equivalent to the total height of the sequence logo (see table S1 for logos).

The 61 yeast kinases were clustered into groups on the basis of phosphorylation site selectivity (Fig. 3). Thirty-five kinases were observed to target “basophilic” motifs. Thirty-one of these showed a classic basophilic signature (10), with a strong selectivity primarily for an Arg residue at the P–3 position. This was the single most common feature found among all motifs (Fig. 3 and table S1). Four other basophilic kinases, Ipl1, Skm1, Ste20, and Cla4, were selective for Arg at the P–2 position but did not show strong selectivity for Arg at the P–3 position (Fig. 3 and table S1). The basophilic kinases, however, diverged with respect to the residues selected at other positions. For example, basophilic kinases are often reported to be selective primarily for either Leu or Arg at the P–5 position, as well as selective for Arg at P–3 (13, 1820). Among the various kinases that selected Arg at the P–3 position, we observed a spectrum of residues selected at the P–5 position, including not only Leu (Cmk1 and Cmk2) and Arg (Ypk1), but also Met (Vhs1), Val or Ile (Prr1), and His (Psk2) (Fig. 3 and table S1). The seven proline-directed kinases, which primarily selected for Pro at the P+1 position, were also distinguishable on the basis of selectivity at other positions. For example, Kss1, Hog1, and Fus3 all showed a secondary selectivity for proline at the P–2 position that was not observed in Pho85 or Cdc28. Other motifs were less common, and include multiple distinct “acidophilic” motifs in which the strongest selectivity was for Asp, Glu, or pThr. Such acidophilic motifs have been previously seen for various mammalian kinases, including GSK3 (glycogen synthase kinase 3; selectivity for acidic amino acids at the P+4 position), CK1 (casein kinase I; P–5 through P–3), PLK (Polo-like kinase; P–2), and CK2 (P+1 through P+3) (2123). All yeast orthologs of these kinases recapitulated the motif found in their mammalian orthologs (table S2), but we also found additional yeast acidophilic kinases that were not anticipated (Mps1, Gcn2, and Cdc7). In addition, three kinases, Atg1, Kin1, and Kin3, exhibited their strongest selectivities for hydrophobic residues. The remaining kinases exhibited multiple strong selectivities and could not easily be categorized.

Fig. 3

Dendrogram of yeast kinases clustered by specificity. Specificity categories are indicated by shading: red, acidophilic; orange, Pro-directed; cyan, P–3 Arg selecting; blue, P–2 Arg selecting; green, other. Because there were multiple distinct acidophilic motifs in which selectivity varies by position, some kinases selecting primarily acidic residues clustered in the “other” category. Sequence logos (74) are shown for selected kinases from each group.

Connecting phosphorylation site motifs to kinase specificity–determining residues

Yeast kinases have been classified into five groups on the basis of sequence homology: AGC [adenosine 3′,5′-monophosphate–dependent protein kinase (PKA)/guanosine 3′,5′-monophosphate–dependent protein kinase (PKG)/protein kinase C (PKC)], CAMK (calcium/calmodulin-regulated and structurally similar kinases), CMGC (CDK, MAPK, GSK, and CDK-like kinases), STE11/STE20, and STE7/MEK (MAPKK) (24). These groups have then been classified further into families that share a high degree of sequence similarity within their catalytic domains. Although related kinases generally recognized similar phosphorylation site motifs, kinases within the same family occasionally exhibited differences, both subtle and striking. One family that illustrates striking differences is the Snf1 kinase family, which belongs to the CAMK group. In yeast, the Snf1 [also known as the AMPK, adenosine monophosphate (AMP)–activated protein kinase] family has six family members—Gin4, Hsl1, Kcc4, Kin1, Kin2, and Snf1. We identified consensus phosphorylation site motifs for each of these kinases with the exception of Kin2 (Table 1 and table S1). All five kinases had common features in their motifs, which are also shared with mammalian AMPKs (25, 26). For example, each one had preferences for a Ser residue as the phosphoacceptor site, a Ser residue at the P–2 position, an Asn residue at the P+3 position, and hydrophobic residues at the P+4 position (Gin4, Snf1, and Kin1 are summarized in Table 1; see data set S1 for quantitative data for Hsl1 and Kcc4). Strikingly, however, only four of the five Snf1 family kinases exhibited the hallmark basophilic P–3 Arg selectivity of the CAMK group, with Kin1 lacking this conserved feature. Instead, Kin1 had an additional preference for an Asn residue at the P–2 position. This difference correlated with a single amino acid substitution within the kinase catalytic domain (Fig. 4A). Gin4, Hsl1, Kcc4, and Snf1 each have a conserved Glu residue (corresponding to Glu127 in PKA; Fig. 4B). Crystal structures of multiple basophilic kinases in complex with peptide substrates have shown that this residue forms a salt bridge with the guanidino group of the P–3 Arg residue of the bound substrate (2730). Unlike the other family members, Kin1 has a Gln residue in place of this conserved Glu. These observations are thus consistent with a role for Glu127 as the critical specificity-determining residue for Arg at the P–3 position in substrates, at least within the Snf1 family.

Fig. 4

Comparison of kinase consensus phosphorylation site motifs to primary sequence reveals specificity-determining residues. (A) Sequence alignment of the regions surrounding residues 127 and 170 (human PKA numbering) in the catalytic domain of representative Snf1 family kinases (Gin4, Snf1, Kin1), and the CMGC kinases Yak1 and Kss1. The presence of an acidic residue at position 127 correlates with Arg selectivity at the P–3 position for the Snf1 family, but not the CMGC group. Conversely, a Glu residue at position 170 correlates with Arg selectivity for CMGC group kinases, but not for the Snf1 family. (B) Stereo view of the crystal structure of PKA with bound pseudosubstrate peptide (shown in cyan in stick representation; for clarity, only the portion falling within the active-site cleft is shown) highlighting predicted specificity-determining residues (in sphere representation). Residues 127 and 170 are shown in yellow and magenta, respectively. The figure was generated with PyMOL from the coordinates in PDB code 1ATP. (C) Kss1 mutagenesis. Mutation Kss1 Ser147 to Glu confers selectivity for Arg at P–3. The bar graph shows normalized spot intensities for the P–3 position taken from screens of the full peptide library (shown in fig. S2).

However, crystallographic insight into specificity determinants in protein kinases is limited to a handful of cases where structures of kinase-peptide complexes have been solved. Although computational approaches have offered additional insight into structural features that control specificity (31, 32), the existence of alternative binding modes, even between kinases with similar specificity (30), makes it difficult to make general conclusions regarding the relationship of kinase sequence to specificity. Indeed, multiple sequence alignment of the yeast kinome and comparison with our experimentally determined motifs indicated that the presence of an acidic residue at position 127 is neither necessary nor sufficient to direct selectivity for Arg at the P–3 position in substrates. For example, within the CMGC group, members of the MAPK and CDK families (Fus3, Kss1, Hog1, Cdc28, and Pho85), which are proline-directed kinases, have an Asp residue at that position, despite a lack of selectivity for Arg at the P–3 position. Conversely, Yak1 within the same group is basophilic, yet lacks an acidic residue at that position (Table 1 and Fig. 4A). Presumably, other residues within the catalytic domain are responsible for dictating a basophilic signature within this group of kinases.

With our large collection of kinase motifs, we identified previously unknown specificity-determining residues, including, but not restricted to, residues that might confer P–3 Arg selectivity for kinases that are not part of the Snf1 family. We used an approach based on the idea of covariation (33). We identified residues whose variation in the primary sequence of the catalytic domain significantly correlated with the variation in phosphorylation site specificity across kinases. To measure sequence variation, we used a simple pairwise similarity matrix, and to compare specificities, we calculated the Frobenius norm of the differences in PWMs (Table 2 and Fig. 4B). This approach reproduced several specificity-determining residues previously known from both structural and mutagenesis studies, including Glu127. In addition, we uncovered many previously unknown candidate specificity-determining residues, seven of which were predicted to be within 10 Å to a bound protein substrate. Among these, an acidic Glu residue at position 170 (PKA numbering) correlated with P–3 Arg selectivity among CMGC kinases. This result contrasts with a previous prediction based on modeling of DYRK1A, the human homolog of Yak1 (34). To test our predictions, we examined the role of residue 170 in substrate selection. Indeed, a Ser to Glu mutation at the analogous position in the MAPK Kss1 (residue 147) conferred a basophilic signature (Fig. 4C and fig. S2). This result validates our ability to predict previously unknown specificity-determining residues on the basis of our large motif data set.

Table 2

Computationally predicted kinase specificity–determining residues. Correlation values and peptide-kinase distance measurements are defined in Materials and Methods.

View this table:

Connecting kinases to substrates on the basis of phosphorylation site motifs

Because in vivo phosphorylation sites on protein substrates tend to fall within the context of the phosphorylation site motif for a particular kinase, database scanning has been used to predict previously unknown substrates and to pinpoint sites of phosphorylation (14, 26, 3539). However, simple sequence-matching approaches are prone to false positives, because predicted sites may not be accessible for phosphorylation, and kinases can also depend on docking or scaffolding interactions for substrate recruitment. In addition, false negatives are frequent for kinases with low sequence specificity because their motifs occur in many proteins and are thus present with high frequency in databases (14, 18). To increase the accuracy of such predictions, we generated and used a motif analysis pipeline, MOTIPS ( MOTIPS scans sequence databases for sites that most closely match the PWM for a particular kinase with a modified algorithm based on the program Scansite (40). Predicted sites are then scored on the basis of a panel of features (evolutionary conservation, predicted surface accessibility, and disordered structure) that are characteristic of known phosphorylation sites (4143).

We first analyzed established kinase substrates for the presence of their respective phosphorylation site motifs with MOTIPS. From a sampling of 174 in vivo kinase-substrate relationships curated from the literature, 99 of the substrates ranked among the top 0.5% of predicted sites for their respective kinase, with 27 substrates falling within the top 200 sites (Fig. 5A). We next analyzed predicted substrates for each of the 61 yeast kinases for their associated biological processes and respective localization according to Gene Ontology (GO) assignments in the Saccharomyces Genome Database (44) (Fig. 5B; the full list of predicted substrates for each kinase with associated GO terms and MOTIPS features is provided as data set S2). We found that predicted substrates were more likely to be associated with the same biological process and to localize to the same subcellular compartment as their respective kinases than a randomly chosen set of proteins. Together, these observations suggest that motif scanning with our set of phosphorylation site motifs enriches for authentic kinase-substrate pairs.

Fig. 5

MOTIPS ranking of known and predicted kinase-substrate pairs. (A) Bar graph showing the number of protein substrates reported in the literature (true positives) that have at least one phosphorylation site falling within the indicated rank value of predicted substrates for its respective kinase. Shown are the 99 sites of 174 known kinase-substrate pairs analyzed that fall within the top 0.5% predicted sites for that kinase among all Ser or Thr residues in the yeast proteome. (B) GO analysis of predicted kinase-substrate relationships that fall within the top 100 predicted substrates for all 61 kinases analyzed. The graph shows the ratio of predicted kinase-substrate pairs sharing either an annotated biological process (left bars) or a subcellular compartment (right bars) in comparison to pairs of proteins chosen at random. For both pairs, the probability that the observed value falls within the random distribution is extremely low (P < 10−35) based on the calculated area under the Gaussian curve corresponding to the random distribution.

To establish directly that our bioinformatics analysis had uncovered authentic substrates, we examined more closely the predicted substrates of the protein kinase Prk1. Prk1 is a member of a small family of kinases conserved throughout eukaryotes that mediates reorganization of the actin cytoskeleton during endocytosis (45). Our peptide array analysis revealed an unusual phosphorylation site motif that included strong preferences for aliphatic residues at the P–5 position, Gly at the P+1 position, and Thr as the phosphoacceptor (Fig. 1B and Table 1). We selected 107 Prk1 candidate substrates identified by MOTIPS for further analysis. These substrates contained sites of high, middle, and low rank among the top 2000 scoring sites. Because all five known Prk1 substrates undergo multisite phosphorylation (4547), candidates were also chosen for having at least three predicted Prk1 phosphorylation sites. Of the 107 candidate substrates, we observed phosphorylation of 19 candidates in vitro with wild-type Prk1 but not with a Prk1 inactive mutant (fig. S3). To identify additional candidates, we used these 19 candidates as positive data points in a training set to educate MOTIPS by machine learning. Negative data points in the training set included 81 of the original Prk1 candidates that were unambiguously not substrates in vitro, as well as about 400 proteins identified in the yeast protein database as localizing solely to noncytosolic compartments (48).

This set of positive and negative data points was used to retrain the Bayesian algorithm in MOTIPS to integrate the motif matching, conservation, surface accessibility, and disorder scores for each site, along with an additional score based on the number of predicted sites. The five known in vivo substrates of Prk1, which were excluded from the training set, all fell within the top seven targets (Fig. 6A). Five additional candidates taken from the top 15 putative substrates in the new Prk1 hit list were tested by an in vitro kinase assay that used the purified candidates as substrates. These in vitro assays revealed three additional substrates for Prk1: Gon7, a protein component of the EKC/KEOPS (endopeptidase-like kinase chromatin-associated/kinase, putative endopeptidase and other proteins of small size) complex involved in telomere regulation; Gph1, a protein involved in the mobilization of glycogen; and the key endocytic protein Las17. One of the five additional candidates tested was Ypl150w, which is a putative kinase that autophosphorylated in our assay and thus could not be confirmed or excluded as a substrate of Prk1. This second round of in vitro assays provides additional evidence that retraining our algorithm increased our success rate in predicting authentic kinase substrates. Furthermore, among the 22 in vitro confirmed Prk1 substrates, seven proteins (Bem2, Ede1, Las17, Sac3, Sla2, Syp1, and Yap1801) are reported to have roles in endocytosis or the regulation of the actin cytoskeleton, suggesting that they may be subject to regulation by Prk1 (Table 3).

Fig. 6

Prediction and confirmation of kinase-substrate relationships. (A) Top 15 hits from the trained Prk1 MOTIPS output. The Prk1 hit list of candidate substrates was subjected to machine learning with a training set consisting of 19 true positives (experimentally derived) and ~480 true negatives (experimentally derived and supplemented with those proteins that are known to solely localize to noncytosolic compartments). Known in vivo substrates of Prk1 are highlighted in yellow. (B) Electrophoretic mobility shift analyses of Bem2 and Ede1. TAP (tandem affinity purification)–tagged Bem2 and Ede1 were purified from wild-type (WT) or prk1Δ ark1Δ strains by immobilized immunoglobulin G and then incubated in the presence or absence of phosphatases followed by immunoblotting against the TAP tag. (C) Mobility shift confirms Sol2 as an in vivo substrate of Vhs1. Lysates from wild-type (WT) or vhs1Δ strains expressing TAP-tagged Sol2 were fractionated on denaturing polyacrylamide gels impregnated with Phos-tag (57), which retards the mobility of phosphoproteins, followed by immunoblotting against the TAP tag.

Table 3

Proteins phosphorylated by Prk1 in vitro. Proteins functionally associated with actin rearrangement or endocytosis are Bem2, Ede1, Las17, Sac3, Sla2, Syp1, and Yap1801. Ub, ubiquitin; RhoGAP, Rho guanosine triphosphatase–activating protein; SCF, Skp1-Cullin-F-box; UPR, unfolded protein response; WASP, Wiscott-Aldrich syndrome protein.

View this table:

We next investigated whether our predicted Prk1 candidate substrates represented bona fide substrates. Because a closely related kinase, Ark1, has an overlapping biological function and shares a nearly identical phosphorylation site motif with Prk1, we examined the phosphorylation state of candidate substrates in yeast strains deleted for both PRK1 and ARK1. Changes in phosphorylation were monitored by electrophoretic mobility shifts in immunoblots of purified substrates, with phosphatase-treated samples serving as a control for the unphosphorylated species. We observed a change in mobility for two candidate substrates, Bem2 and Ede1, suggesting that they are in vivo targets of Prk1 or Ark1, or both (Fig. 6B). Although we did not observe gel shifts for other substrates, it is likely that some are authentic Prk1/Ark1 substrates as well but simply do not change mobility upon phosphorylation. Notably, previous MS phosphoproteomic analysis identified three of the in vitro Prk1 substrates (Ede1, Syp1, and Rpl5) as phosphorylated at Prk1 consensus sites in vivo (4954) (the MOTIPS output for all kinases, which is available as data set S2, indicates which candidate phosphorylation sites have been identified by MS).

We also validated kinase-substrate pairs through integration with other proteomic data sets. We found that the kinase Vhs1, for which limited functional information is known, exhibited selectivity for the phosphorylation site motif MXRXXS (Table 1 and table S1). Fourteen in vitro substrates for the kinase Vhs1 (55) were previously identified by protein microarray analysis (4), and six of these, Mga1, Pfk26, Sef1, Sol1, Sol2, and Utr1, contain the Vhs1 consensus phosphorylation site motif. MS phosphoproteomic analysis (49) revealed that Sef1 was phosphorylated in vivo at a Vhs1 consensus phosphorylation site, and in an immunoprecipitation-MS analysis Sef1 and Vhs1 physically interacted (56). In addition, MS phosphoproteomic analysis identified Sol1 as phosphorylated at a Vhs1 consensus phosphorylation site in vivo (50), and its homolog Sol2 was the most highly phosphorylated Vhs1 in vitro substrate identified by protein microarray analysis (4). Mobility shift analysis of VHS1 deletion strains using Phos-tag SDS–polyacrylamide gel electrophoresis (SDS-PAGE) (57) was consistent with Sol2 as a substrate for Vhs1 in vivo (Fig. 6C). Although the presence of multiple Sol2 species in the presence and absence of Vhs1 indicates phosphorylation at multiple sites, likely by more than one kinase, the mobility shift indicates that in vhs1 mutant cells, Sol2 is phosphorylated at fewer sites. Sol2, which promotes nucleocytoplasmic transfer RNA (tRNA) transport (58), is the first reported in vivo substrate for Vhs1 and suggests a role for this kinase in regulating this process. These results illustrate how integration of data from multiple proteomic approaches can shed light on the biology of poorly characterized molecules.


The elucidation of the mechanisms underlying kinase specificity remains an integral part of understanding phosphorylation-based signal transduction pathways. Previous methods for determining consensus phosphorylation site motifs have not been suitable for large-scale screening of a eukaryotic kinome. Here, we have described an approach for the high-throughput identification of consensus phosphorylation site motifs in which multiple kinases, with no previously known substrates, can be analyzed simultaneously. We have used this approach to provide comprehensive analysis of kinase specificity in a single eukaryotic organism, the yeast S. cerevisiae. Among other applications, this large data set has provided much broader insight into the structural basis for kinase selectivity than has been possible through individual analyses of single kinases.

With our data, we linked protein kinases to previously unknown substrates, thus elucidating mechanisms of phosphorylation-dependent signaling. A limitation to our approach, however, is that the peptide arrays treat each position in the substrate independently, and thus the potential interdependence between multiple positions is ignored. This approach is nonetheless a valuable first-pass screen for analyzing kinase specificity because it involves the systematic and exhaustive analysis of each amino acid residue at each position surrounding the phosphorylation site. Preferences observed with this approach can provide the basis for the design of kinase-specific peptide libraries to uncover positional interdependence. Furthermore, the presence of a consensus phosphorylation sequence alone is insufficient to direct phosphorylation of a protein by a particular kinase, and, accordingly, identification of previously unknown substrates on the basis of motif scanning is difficult. However, integration with other proteomic data sets provides a means of increasing confidence in predicted kinase-substrate relationships. In addition, specific kinase-substrate pairs can be inferred through computational methods that make use of non–sequence-based “contextual” features, such as subcellular localization and molecular function (38). For example, prediction of substrates targeted by relatively nonspecific kinases with phosphorylation site motifs alone is unlikely to be successful because these sequences occur frequently in proteomes. In such cases, selection of authentic substrates is driven by docking or scaffolding interactions, and consensus sequences for substrate recruitment can be used in combination with phosphorylation site motifs to identify previously unknown substrates (59, 60).

For previously characterized kinases, we observed a high degree of conservation of phosphorylation site motifs between yeast and mammalian orthologs. These similarities suggest that the many previously unknown consensus motifs reported here are also conserved. Therefore, this data set will serve as a resource for studies of phosphorylation-dependent signaling in higher eukaryotes, as well as yeast.

Materials and Methods

Details regarding yeast strain information, kinase preparation, characterization of purified kinases, in vitro kinase assays, and electrophoretic mobility shift analyses are available in the Supplementary Materials.

Peptide library screening

The peptide library (Anaspec Inc.) has been previously reported (13). For this study, fresh stock solutions were made from 5 mg of powder by dissolving peptides in dimethyl sulfoxide (DMSO), quantifying by absorbance at 280 nm, and adjusting to a stock concentration of 10 mM by adding the appropriate volume of DMSO. Stock solutions were stored at −20°C in microcentrifuge tubes. Working 0.6 mM aqueous stocks were prepared by diluting the DMSO stock in 20 mM Hepes (pH 7.4) and arrayed into 1536-well stock plates containing 5-μl aliquots in each well. Plates were sealed with adhesive foil and stored at −20°C.

Peptides (0.2 μl per well) were transferred to assay plates containing 2 μl of kinase reaction buffer [generally 20 mM Hepes (pH 7.4), 10 mM MgCl2, 1 mM dithiothreitol (DTT), 0.1% Tween 20] from stock plates manually with a 48 × 6 slot pin replicator (VP Scientific). Reactions were initiated by adding a solution (0.2 μl per well) containing purified kinase and [γ-33P]ATP (0.55 mM, 0.3 to 0.4 μCi/μl; Perkin Elmer) with a 48 × 1 slot pin replicator (VP Scientific). Plates were sealed and incubated for 1 to 8 hours at 30°C. The final concentrations of the reaction components in each well were 50 μM peptide and 50 μM ATP at a specific activity of 0.55 to 0.73 mCi/μmol. After incubation, 0.2 μl from each well was spotted onto streptavidin-coated membrane (SAM2 Biotin Capture Membrane, Promega) simultaneously using the 48 × 6 slot pin replicator. Membranes were washed three times with 10 mM tris-HCl (pH 7.5) with 140 mM NaCl and 0.1% SDS, twice with 2 M NaCl, twice with 2 M NaCl with 1% H3PO4, and twice with water, then dried and exposed to a phosphor storage screen. Processing of final images of the spot arrays consisted of copying the 4 × 22 grid corresponding to the P+1, P+2, P+3, and P+4 peptide mixtures and pasting it below the 5 × 22 grid corresponding to the P–5, P–4, P–3, P–2, and P–1 peptide mixtures with Adobe Photoshop to provide the 9 × 22 spot grids shown in Fig. 1 and table S1.

PWM generation

For each array, peptide phosphorylation signals were quantified with Genepix Pro 6.0 (Molecular Devices) by manually aligning a 48 × 8 grid of circles onto each scanned phosphorimage to calculate the median intensity for each spot. These median intensity values were then background-corrected by subtracting the median intensity value corresponding to the negative control spot (reaction carried out in the absence of any peptide substrate). Signal scores for each amino acid at each position were then normalized by the following equation:Zca=ScaimSci×mwhere Zca stands for the normalized score of amino acid a at position c having a signal score Sca, and m stands for the total number of amino acids. Sci is the signal score of amino acid i at position c where i is defined in the summation of all the m amino acids. The PWM is an N × 20 matrix of N positions with the normalized, background-corrected value given as the weight for each amino acid at each position. To account for spurious phosphorylation of Ser and Thr residues at other positions, the PWM entries in all Ser and Thr positions were set to 1 (equivalent to neutral selection at that position) with subsequent renormalization of the PWM.

Proteome scanning

The entire yeast proteome was scanned to identify the best matches to each PWM. Our approach used a window-sliding method based on the normalized PWM similar to the method used in Scansite (40). Briefly, it extracted every possible 15-mer sequence from the yeast proteome and calculated the match score to the PWM, based on the formula:S=ia=rilog(Mia)where i stands for the position in the motif and ri stands for the residue that is present at position i in the peptide in question. Mia is the normalized PWM as described above. The resulting score was then normalized, such that zero stands for an optimal match to the motif and larger positive scores correspond to weaker matches. The top 10,000 potential phosphorylation sites for each kinase are reported in data set S2. This algorithm was implemented in a modular form in Java. All sequences and features were loaded into an SQL database that is interactively queried by the Java search module.

Feature collection

A number of different genomic features were gathered to supplement the initial match score. To compute the conservation score, we collected all orthologs for 13 proteomes of related yeast species (Saccharomyces paradoxus as the closest and Schizosaccharomyces pombe as the farthest) using the comparative genomics algorithm implemented in INPARANOID (61). We then aligned these orthologs using the automated alignment method MUSCLE (62) (the full set of alignments is available as data set S3). For each PWM hit, we calculated the conservation score by estimating the entropy at each position based on the aligned orthologs with the AL2CO program. The disorder score was based on the prediction program DISOPRED (63). DISOPRED was run for each protein in the yeast proteome. We used the DISOPRED probability score, corresponding to the likelihood of the residue in question being in a disordered region, as the measure of disorder. Finally, the surface accessibility score was calculated with the prediction program SABLE for each protein in the yeast proteome (64). The simple numerical surface score was used as the measure of surface accessibility.

Feature integration

An integration algorithm based on the naïve Bayes framework was used to integrate the four features. We used a number of experimentally determined gold-standard kinase-substrate pairs, “positives,” to train the algorithm. For gold-standard negatives, we supplemented a set of experimentally determined negatives with a set of randomly chosen protein pairs. Each of these pairs is a pair of proteins that are annotated to always localize to two different compartments (for example, nucleus only and cytoplasm only). Thus, we biased the randomly chosen set of protein pairs further toward a set that was highly unlikely to contain any spurious positive interactions. The conditional probability was calculated from the four features according to the following formula:p(I|D1,D2,D3,D4)=p(I)p(D1|I)p(D2|I)p(D3|I)p(D4|I)where I denotes either interaction or noninteraction and D1 through D4 denote the four features. Data were thus integrated under the assumption that the four features are independent. To formally assess independence of the features, we calculated pairwise correlation coefficients. The results showed the pairwise correlation coefficients ranging from 0.01 to 0.57 (absolute values) have an average of 0.18, indicating the features are to a large extent independent (see table S3). Moreover, we performed principal components analysis (PCA) using the statistical software R to transform the possibly correlated values of the five features (hits per protein, match score, disorder score, accessibility score, and conservation score) of the Prk1 targets into uncorrelated values. The first three vectors were chosen to build a naïve Bayes model followed by a 10-fold stratified cross-validation. The area under the curve (AUC; 75.9%) of the receiver operating curve (ROC) resulting from the PCA validation was then compared to the AUC (78.6%) of Prk1 without the PCA transformation. The close performance of the two further indicated a certain level of independency of the features. Bayesian integration was implemented with the Java machine learning package Weka (65). The entire methodology is available as the modularized software packages MOTIPS (

Covariation calculation to estimate specificity-determining residues

Sequences of the 61 yeast kinase catalytic domains (obtained from the database) were initially aligned with ClustalW2 (66). A high-quality sequence alignment was generated by manual editing of the initial alignment in Jalview (67) on the basis of multiple pairwise alignments with kinases of known three-dimensional (3D) structure and conserved catalytic residues (table S4). In addition, 89 orthologous kinases from S. pombe, Dictyostelium discoideum, and Homo sapiens were added and manually aligned. For these orthologs, the PWM was inferred to be identical to its yeast counterpart. A correlation-based methodology was implemented to identify specificity-determining residues:

For each (nm) pair of sequence positions (n) and positions in the PWM (m), 2D vectors were generated; k is the total number of kinases in the alignment and is equal to the number of PWMs. The first vector contained all pairwise similarities between the primary sequences of the kinases in that position, based on the McLachlan matrix (that is, the similarity of the amino acid in position X in kinase A to the similarity of the amino acid in the same position in kinase B) (68). The McLachlan matrix was chosen because it scores for residue substitutions based on chemical similarity (that is, physicochemical properties). The second vector contained the pairwise similarity of all PWMs to each other, based on the Frobenius norm (69):ij(M1,ijM2,ij)2

Each position was then scored with the Pearson correlation coefficient of these two vectors (listed under “correlation” in Table 2). This method was implemented in the programming package MATLAB. Distances of the residue in question from bound peptide were estimated by mapping the residue onto the PKA-PKI structure [Protein Data Bank (PDB) ID 1ATP] with the program VMD. The peptide-kinase distances were measured as the closest distances between the geometric centers of the residue on the kinase, as mapped to the PKA structure, to the bound peptide, as in the PKA structure.


Funding: This study was supported by US NIH grants to M.S., B.E.T. (GM079498), N.M.H. (GM50717), W.A.L. (GM55040), and D.F.S. (CA82257); by Howard Hughes Medical Institute (W.A.L.); and by a Swiss National Science Foundation grant to C.D.

Author contributions: J.M., S.P., G.R.J., D.L.S., S.A.P., V.D., and B.E.T. performed experiments; P.M.K., H.Y.K.L., and M.B.G. performed computational work; J.M., X.Z., G.R.J., S.A.P., V.D., M.J., E.C., H.N., M.G., A.R., J.-L.N.M., Y.-J.S., H.E.S., R.S., C.S.M.C., C.D., N.M.H., W.A.L., D.F.S., B.S., and B.J.A. prepared and characterized protein kinases and expression constructs; and J.M., P.M.K., H.Y.K.L., M.B.G., M.S., and B.E.T. designed experiments, analyzed data, and wrote the paper.

Competing interests: M.S. consults for Affomix, which has an interest in proteomics, including phosphoproteomics.

Supplementary Materials

Materials and Methods

Fig. S1. Assay reproducibility.

Fig. S2. Representative peptide array screening results for the Kss1 S147E mutant.

Fig. S3. Representative Prk1 in vitro assays.

Table S1. Representative peptide array screening results and sequence logos for each of the 61 kinases assayed (also available as a separate PDF with high-resolution images as 2000482tableS1.pdf).

Table S2. Protein kinases analyzed in this study.

Table S3. Pairwise correlation coefficients for each of four genomic features and the Scansite match score.

Table S4. Alignment of yeast kinases analyzed in this study.


Data set S1. Average PWMs for each of the 61 kinases assayed (plain text files).

Data set S2. MOTIPS output for each of the 61 kinases assayed (plain text files).

Data set S3. MUSCLE alignment of all predicted S. cerevisiae ORFs with orthologs from 12 other yeast species (clustal alignment files).

Data set S4. Alignment of yeast kinases analyzed in this study (clustal alignment file).

References and Notes

View Abstract

Navigate This Article