Research ResourceBiochemistry

Genomics and evolution of protein phosphatases

See allHide authors and affiliations

Science Signaling  11 Apr 2017:
Vol. 10, Issue 474, eaag1796
DOI: 10.1126/scisignal.aag1796

Evolution of the diverse phosphatome

Protein kinases and protein phosphatases are two sides of a major posttranslational mode of regulation, and dysfunction in either underlies many human diseases. Relative to kinases, protein phosphatases are more structurally and catalytically diverse. Using genomic data on protein phosphatases across eukaryotic animal kingdoms, Chen et al. created a human protein phosphatome and a map of its evolution. They also tabulated disease associations with various human phosphatases. The findings provide a rich resource for exploring the phosphatome in development, physiology, and human disease.

Abstract

Protein phosphatases are the essential opposite to protein kinases; together, these enzymes regulate all protein phosphorylation and most cellular processes. To better understand the global roles of protein phosphorylation, we cataloged the human protein phosphatome, composed of 189 known and predicted human protein phosphatase genes. We also identified 79 protein phosphatase pseudogenes or retrogenes, some of which may have residual function. We traced the origin and diversity of phosphatases by building protein phosphatomes for eight other eukaryotes, from the protist Dictyostelium to the sea urchin. We classified protein phosphatases from all nine species into a hierarchy of 10 protein folds, 21 families, and 178 subfamilies. We found that >80% of the 101 human subfamilies were conserved across the animal kingdom, but show substantial differences in evolution, including losses and expansions of individual subfamilies and changes in accessory domains. Protein phosphatases show similar evolutionary dynamics to those of kinases, with substantial losses in major model organisms. Sequence analysis predicts that 26 human protein phosphatase domains are catalytically disabled and that this disability is mostly conserved across orthologs. This genomic and evolutionary perspective on protein phosphatases provides a framework for global analysis of protein phosphorylation throughout the animal kingdom.

INTRODUCTION

Protein phosphatases catalyze the removal of phosphate groups from proteins by hydrolysis, reversing the action of protein kinases. Protein kinases and protein phosphatases are key joint regulators of many cellular processes and human diseases (1, 2). Unlike protein kinases, whose catalytic domains are mostly from a single protein structural fold (3), protein phosphatase activity is associated with many different protein folds and catalytic mechanisms (4, 5), and several of these folds also have other hydrolase activities or are present in nonenzymatic proteins (6).

We collated human phosphatase genes from the literature, databases, and genomic data to create a comprehensive human protein phosphatome. We traced their evolution by defining the protein phosphatomes of another eight eukaryotic organisms from major taxonomic groups, including three other bilaterians (the sea urchin Strongylocentrotus purpuratus, fruit fly Drosophila melanogaster, and nematode Caenorhabditis elegans), two basal metazoans (the sea anemone Nematostella vectensis and the sponge Amphimedon queenslandica), a unicellular choanoflagellate (Monosiga brevicollis), a yeast (Saccharomyces cerevisiae), and Dictyostelium discoideum (fig. S1). We also profiled human pseudophosphatases, phosphatase pseudogenes, and disease associations of the protein phosphatome and kinome. This study provides a comprehensive catalog and classification of human protein phosphatases and traces their complex evolutionary history. All results and extended analyses are available at http://phosphatome.net (7).

RESULTS

Definition of the human phosphatome and protein phosphatome

We defined the human phosphatome as the set of known phosphatases (protein or nonprotein substrates) and their close homologs. We identified proteins that contain phosphatase catalytic domains from the literature and databases (1, 2, 4, 816) and used protein sequence similarity to detect additional homologs (text S1). This generated a human phosphatome of 264 phosphatases (table S1), accounting for 1.17% of the human proteome. We classified phosphatases into 19 folds and 45 families based primarily on the SCOP (Structural Classification of Proteins) structural database (fig. S2 and tables S1 to S3) (17).

We defined the human protein phosphatome liberally, by including all members of any family with any protein phosphatase activity. We used this approach because many phosphatases are poorly characterized and to provide a comprehensive genomic and evolutionary context. These families are from 10 different folds, indicating that protein phosphatase activity has many evolutionary origins (Fig. 1, tables S1 to S3, and text S2). Three different folds—CC1 (Cys-based class I), CC2 (Cys-based class II), and CC3 (Cys-based class III)—share a common catalytic motif of CX5R (14), suggesting convergent evolution.

Fig. 1 Human protein phosphatases are from 10 structural folds.

Numbers of human protein phosphatases per fold are shown. Three folds share a common cysteine-based catalytic motif. Protein Data Bank IDs of the representative structures: CC1 (1VHR), CC2 (5PNT), CC3 (1QB0), PPM (1A6Q), PPPL (1AUI), HAD (2GHT), AP (alkaline phosphatase) (1ZEB), HP (histidine phosphatase) (1ND6), PHP (protein histidine phosphatase) (2HW4), and RTR1 (4FC8).

More than half of the human protein phosphatases belong to the CC1 fold (Fig. 2). This includes the PTP (protein tyrosine phosphatase) family of receptor and nonreceptor tyrosine phosphatases that play a major role in intercellular communication and intracellular signal transduction, as well as the DSP (dual specificity phosphatase) family and the PTEN family of lipid phosphatases. The PPM (metal-dependent protein phosphatase) and PPP-like (PPPL) fold phosphatases are pSer/pThr-specific and have a wide variety of functions, and the remaining folds have more specialized roles (18).

Fig. 2 Phylogenetic relationships of human protein phosphatases.

Ten trees show the relationships among the 189 genes of the human protein phosphatome. Trees were generated from reference multiple alignments of phosphatase domains from all nine species, edited and rendered by hand, as described in text S1. Branch lengths are not drawn to scale. Subfamilies are denoted by dark circles and families by triangles. Fold names are underlined, and families are labeled in gray. Branches are colored by evolutionary origin (table S6). D1 and D2 represent dual domains of PTPs. Detailed alignments and scaled trees are available in data S1 and at http://phosphatome.net/tree/.

In the human phosphatome (Fig. 2 and table S1), we included several CC1 families that are best known as nonprotein phosphatases that are important in signaling: the phosphoinositide phosphatases PTEN, myotubularin (19), and Sac (20), and Paladin (21), for which no catalytic activity has been reported (fig. S3). The HAD (haloacid dehalogenase) fold has many enzymatic activities (6); we excluded ~50% of the HAD families because they have no known substrate (6), although some may have undiscovered phosphatase activity. The resulting protein phosphatome consists of 189 genes from 10 folds and 20 families (Fig. 2), which account for 0.84% of the human proteome (Table 1, tables S1 and S4, and fig. S4). One hundred five of these have documented protein phosphatase activity in databases and literature (table S1). Thirty more have nonprotein substrates, mostly phosphoinositide lipids, and 54 (including 16 verified or predicted pseudophosphatases) have no known substrate. To find protein phosphatases missed in protein databases, we searched human transcript, expression sequence tag (EST), and chromosomal sequences with a custom library of hidden Markov model (HMM) profiles (see Materials and Methods). We found 79 protein phosphatase pseudogenes and retrogenes but no additional functional phosphatases. These are omitted from Fig. 2.

Table 1 Protein phosphatase and kinase repertories in the nine organisms.

The numbers of kinases are from KinBase (81); the sizes of the proteomes were estimated from the literature (see text S9).

View this table:

Identification and classification of protein phosphatases in eight eukaryotes

We constructed protein phosphatomes from eight other species that reflect major stages of evolution (fig. S1) and used these to track the evolution of human protein phosphatases. We identified protein phosphatase homologs from chromosomal, predicted protein, and EST sequences (tables S4 and S5 and text S3), followed by manual curation (see Materials and Methods). We found 1236 protein phosphatases in the eight organisms (Table 1), including 16 previously unannotated genes, and improved the predicted sequences of 315 genes (tables S4 and S5). In addition to the 20 human protein phosphatase families, we found an additional CC1-fold family, OCA, which is present from Dictyostelium to sponge but absent from all five eumetazoans used in the study except sea urchin (Fig. 3). OCA phosphatases are involved in oxidant-induced cell cycle arrest in yeast. All nine species have more protein kinases (average 2.4% of proteome) than protein phosphatases (1.0% of proteome) (Table 1).

Fig. 3 Phylogenetic profile of protein phosphatase genes by family.

Numbers are count of family members by species. Cells are colored by gene number. The dendrogram on the top shows the species phylogeny.

We classified the 1425 protein phosphatases of the nine organisms into 10 folds and 21 families based on sequence similarity with SCOP domains and human phosphatase domains. We further classified 1336 of these into 178 subfamilies, using a hybrid approach that includes sequence similarity of the phosphatase domain and full-length protein, domain combination, known functions, and phylogeny (see Materials and Methods, table S4, and fig. S5). One hundred one subfamilies are found in humans (Fig. 2).

Origin and evolution of the protein phosphatome

We compared the nine phosphatomes to map the history of all human phosphatases and trace the evolutionary dynamics of all phosphatases. We reconstructed ancestral states by Dollo parsimony (Fig. 4), showing that their common ancestor had an estimated 40 phosphatase subfamilies, from 10 folds and 19 families (Fig. 3), 32 of which are found in human (Fig. 4 and table S6). At every branch point in the tree, we see the creation of new subfamilies. We also see losses, often of phosphatase subfamilies, that are strongly conserved in other clades (Fig. 3 and table S6).

Fig. 4 Evolutionary gains and losses of protein phosphatase and kinase subfamilies.

Gains of subfamilies are in blue, and subfamily losses are in red. The inferred subfamily counts of ancestral species are in blue circles, and extant counts are in squares.

More than one-quarter (11 of 40) of phosphatase subfamilies are lost from yeast, in agreement with the loss of one-third of all kinase subfamilies (Fig. 4) and its generally reduced genome. The largest gains are seen at the three clades at the base of the animals (holozoa, metazoa, and eumetazoa), with the number of subfamilies increasing by more than a half (26/45) in holozoa, a quarter (19/69) in metazoa, and 11% (10/88) in eumetazoa (Fig. 4). Whereas yeast has 26% of human subfamilies, basal metazoa had 81% (fig. S6). This period also saw a massive radiation in kinase subfamilies from 90 to 213 (Fig. 4 and fig. S6). These expansions are dominated by the CC1 fold (35 of the 55 new subfamilies in these three clades) and the PPM fold (10 new subfamilies) (table S6 and fig. S5) and include the emergence of receptor PTPs and their cognate tyrosine kinases (table S6) (22). Seventy-five percent (12 of 16) of human PTP subfamilies and 64% (20 of 32) of human tyrosine kinase subfamilies emerged in this period. Receptor PTPs first emerge in holozoa, including the first dual-domain receptor PTP.

Both Monosiga and sponge have major independent PTP expansions (fig. S5): Monosiga has 38 PTPs (the same number as human) and sponge has 103, including a massive expansion of the PTPRD subfamily to 76 members. Eleven Monosiga PTPs and 84 sponge PTPs are predicted to be receptors (table S4). Both also have major expansions of clade-specific tyrosine kinases (22, 23).

Few additional phosphatases or kinases emerged in bilateria, but marked losses of both kinases and phosphatases are seen in ecdysozoa (the common ancestor of Drosophila and C. elegans), which lost 14% of phosphatase subfamilies and 12% of kinase subfamilies. Drosophila and C. elegans individually have even more reduced phosphatomes and kinomes. Even within nematodes and insects, these two species are noted for their gene losses (24, 25). This is reflected in Drosophila, which have the smallest kinome and phosphatome of the animals studied (Table 1). Losses in C. elegans are even greater (15 subfamilies) but more than balanced by extensive expansion of the PTP and PPP families (27 of 28 subfamily gains; table S6) and duplication within ancestral subfamilies, such as the 21 members of the ACP2 acid phosphatase subfamily (table S4), leading to the largest known phosphatome, of 244 genes. This parallels a major expansion in specific areas of the kinome, including several classes of tyrosine kinase (26).

The sea urchin has the most subfamilies shared with human (fig. S6), and the expansions seen in the human phosphatome and kinome are largely due to the whole genome duplications at the base of the vertebrates, which led to a selective increase in signaling genes (27). Of the 51 subfamilies with multiple members in human, 39 have a single member in the closest organism to human that has the subfamily (fig. S5 and table S7). We found three subfamilies only in human (all receptor PTPs): PTPRA, PTPRC, and PTPRK (fig. S5 and table S6). All are found throughout the vertebrates, and PTPRA and PTPRK are also found in the invertebrate chordate Ciona intestinalis. PTPRC (also known as CD45) modulates antigen signaling in B and T cells, and, like PTPRC, both PTPRA subfamily members, PTPRA and PTPRE, dephosphorylate and activate SRC-family kinases (28), functions that are unique to or expanded in vertebrates. PTPRK members (PTPRK, PTPRM, PTPRT, and PTPRU) are involved in cell adhesion, particularly in the nervous system (29). We also found four conserved subfamilies that were lost from human and other jawed vertebrates (text S4).

Across all clades, we see correlation between gains and losses of kinases and phosphatases, in both total numbers and as a percentage of the ancestral subfamily count (fig. S7). This partially reflects changes in total proteome size but also suggests coordinated losses of opposing kinase and phosphatase functions.

Human protein phosphatase pseudogenes and retrogenes

We identified 79 human pseudogenes and retrogenes that are derived from protein phosphatases and contain a predicted protein phosphatase domain, including 18 not found in Entrez Gene (30) (Table 2 and table S8). Pseudogenes have open reading frames (ORFs) that are disabled due to nonsense or frameshift mutations or are fragmentary copies of a functional phosphatase. The 69 pseudogenes come from 32 protein-coding genes of 28 subfamilies, 12 families, and 7 folds. Several phosphatases have many pseudogenes (for example, PGAM1 has 13 and TPTE2 has 9), whereas 20 phosphatases each have only a single pseudogene. Some pseudogenes may be functional (31); for example, PTENP1 is expressed, may regulate cellular abundance of PTEN, and is selectively deleted or repressed in several human cancers (32), and TPTEP1 is fused to an endogenous retrovirus, translated into protein, and silenced in kidney cancer by DNA methylation (33). Only one pseudogene (PTPRVP; OST-PTP) is a degenerate functional gene, which has active orthologs involved in bone formation in other mammals. We also see 10 retrogenes—retrotransposed copies of functional genes that have a complete or near-complete ORF. These may represent “young” pseudogenes or may be functional copies of other protein phosphatases. SSU72 has seven full-length retrogenes and one pseudogene; CDC14C is also believed to be functional (34).

Table 2 Summary of human protein phosphatase pseudogenes by subfamily.

Each row shows a subfamily, the numbers of protein-coding genes and pseudogenes in the subfamily, the parental genes, and the number of pseudogenes derived from the parental genes (in parentheses). Subfamilies are organized by fold and family.

View this table:

Protein phosphatases and diseases

Forty of the 189 protein phosphatases (21%) have disease-associated variants annotated in OMIM (35), UniProt (36), ClinVar (37), and cancer gene sets (text S5, Table 3, and table S9). Twelve (6%) are involved in cancer, according to the cancer gene census (38), and four cancer gene sets from pan-cancer analysis (3942). By comparison, 35% (187 of 538) of human kinases were disease-associated, including 107 (20%) involved in cancer (table S10). Kinases were more frequently annotated as associated with cancer and other diseases than other genes (Fisher’s exact test, P < 0.01), but phosphatases were not (text S5).

Table 3 Disease-related protein phosphatases.

Each row shows a human protein phosphatase gene, its classification, disease(s), and whether it is a cancer gene.

View this table:

Accessory domains

Most protein phosphatases function in signaling networks. Noncatalytic domains and motifs within phosphatases link them to other signaling proteins or subcellular sites. We found 54 types of accessory domain or motif in 93 of the 189 human protein phosphatases (tables S4 and S11), using published profiles and a set of 13 in-house HMM profiles. These include 31 proteins with lipid-binding domains (C2, FERM, PH, FYVE, and GRAM), 5 with membrane or other subcellular targeting domains or motifs (PDZ, CRAL_TRIO, and BRO1), and 29 with protein- or phosphoprotein-interacting domains (SH2, SH3, PTB, fn3, immunoglobulin, leucine-rich repeat, carbonic anhydrase-like, TPR, PTP N-terminal, BRCT-like, RA, and IQ). Twenty-three of these domains are also found in kinases, such as pTyr-binding domains SH2 and PTB, lipid signaling domains (C2 and PH), and calcium signaling (IQ). Forty-six of the 101 subfamilies have accessory domains, 35 of which belong to the CC1 fold.

Closely related subfamilies often share accessory domains, such as the fn3 repeats found in five PTP subfamilies and the FERM and PDZ domains in PTPN3 and PTPN13. Domains and motifs can also be recruited to unrelated subfamilies by convergent evolution. For instance, SH2 domains are found in the PTPN6 (SHP) subfamily of PTPs and the tensin subfamily of DSPs. Accessory domains are often gained and lost within subfamilies, indicating shifts in function (Table 4 and table S12). Sixteen of the 46 multidomain subfamilies present in human have members that have clearly gained or lost domains relative to their canonical domain structure (Table 4 and table S12). For instance, the myotubularin MTMR5 conserves a diacylglycerol-binding C1 domain from sponge to sea urchin but loses it in vertebrates. STS phosphatases from C. elegans and all other nematodes lost their SH3 and ubiquitin-associated (UBA) domains and also completely lack the Syk family of kinases, which are STS substrates (43, 44), suggesting that these domains may be specifically involved in targeting Syk. Rhodanese domains mediate the interactions between many DSPs and their mitogen-activated protein kinase (MAPK) substrates but have been lost from Drosophila puckered (DSP10 subfamily) and C. elegans lip-1 (DSP6 subfamily), apparently without affecting their MAPK phosphatase activity (4547).

Table 4 Reliable domain gains and losses.

Each row is a domain gain or loss event. Columns include gene symbol, species, protein phosphatase classification at subfamily level, structural domain combination, domain changes, category (gain or loss), and conservation. Catalytic domains are in bold and gained or lost domains are in square brackets. The conservation column shows whether the domain gains or losses are present in other organisms. See also table S12.

View this table:

We also found a surprising degree of structural diversity even within the catalytic domains. CC1 fold phosphatases share a core of a five-stranded β sheet (β2 to β4; β11 to β12) flanked by five helices (α2 to α6) (48, 49) but have extensive additional structural elements in specific families or subfamilies (Fig. 5, fig. S8, tables S13 to S18, and text S6). The most remarkable differences are in the region between the β4 and β11 strands. PTPs usually have a multistranded sheet inserted, whereas myotubularins and Sacs usually have one helix inserted, and PTENs have a short loop. Within 33 DSP structures, 9 have one helix inserted between β4 and β11, 3 have a two-helix insert, 1 has a β strand, and the remaining 20 have loop insertions. This conserved diversity likely contributes to the unique functions of each protein phosphatase class.

Fig. 5 Secondary structures of representative phosphatase domains of CC1 families.

The common secondary structure elements are labeled at the bottom, numbered according to (82). α Helices and β sheets are denoted in red and blue, respectively. Inserted gaps are not marked.

Pseudophosphatases

Many phosphatase domains have alterations to catalytic motifs that render them catalytically inactive. Most of these alterations are evolutionarily conserved, and noncatalytic functions have been suggested for several of these domains (5053). We predicted 26 human pseudophosphatase domains based on loss of any of a set of key residues in each fold (tables S19 and S20 and text S7). As seen in kinases, some enzymes circumvent the loss of key residues to maintain activity: 14 of the 26 predicted pseudophosphatase domains have been demonstrated to be catalytically inactive, but 4 have at least one report of activity. Five pseudophosphatase domains are in proteins that encode two phosphatase domains, one of which is catalytically active, including three receptor PTPs and two CDC14s. The accessory rhodanese domains in MAPK phosphatases are catalytically inactive but were not included in the pseudophosphatase list because most active rhodanese domains (other than those of CDC25) are not phosphatases (54).

CC1 fold pseudophosphatases were defined as lacking either the Cys or Arg of the CX5R motif (Fig. 6, A to C, and figs. S9 to S11). No pseudophosphatases were seen in the CC2 or CC3 folds that share the same catalytic motif. The only HAD fold pseudophosphatase, TIMM50, has substitutions at both Asp residues of the DXD catalytic motif (Fig. 6D and fig. S12) (6). TAB1 is the only experimentally verified PPM fold pseudophosphatase. In human, it has substitutions at three of the six residues that coordinate the metal ions mediating the reaction (Fig. 6E and fig. S13) (5). PHLPP1, PHLPP2, and PP2D1 also have substitutions at these residues (Fig. 6E). However, PHLPPs have been shown to be catalytically active. By contrast, the PPIP5K1 and PPIP5K2 phosphatase domains, members of the HP2 family, have all the phosphate pocket conserved residues of active HP2s (Fig. 6F and fig. S14) but experimentally lack phosphatase activity (16).

Fig. 6 Loss of catalytic residues in pseudophosphatases.

(A to F) Sequence logos show conservation pattern of catalytic motifs in major families, followed by logos of subfamilies containing predicted pseudophosphatases. Catalytic residues are marked by asterisks, and red triangles show predicted inactivating substitutions. PPIP5K proteins are known to be inactive but conserve all known catalytic residues. Sequence logos were built by WebLogo (83) and colored by hydrophobicity (y axis is information content in bits). See text S7 and figs. S9 to S14 for details.

Twelve receptor PTPs have dual catalytic domains, a membrane proximal D1 domain, and a membrane distal D2 domain. The D2 domains of PTPRG and PTPRZ1 are definite PTP pseudophosphatase domains, with substitutions at both Cys and Arg of CX5R (Fig. 7 and fig. S15E). Other D2 domains have been shown to lack activity against pNPP (para-nitrophenylphosphate) or pTyr but retain the CX5R motif, so we looked for additional sequence features that might explain their catalytic inactivity. We found conserved substitutions in two other conserved motifs, KNRY and WPD, in all D2 domains other than those from three Nematostella genes. PTPRA loses the Tyr of the KNRY motif and changes WPD to WPE (fig. S15B). Back mutation of both these residues in human PTPRA, PTPRE, and PTPRF restored PTP activity (5557). D2 domains from other subfamilies have related conserved substitutions (fig. S15). However, these substitutions were also found in active PTPs that are either protein or lipid phosphatases (table S21 and fig. S15): PTPRQ and PTPRN2 (fig. S15, G and J) are lipid phosphatases (text S8), whereas PTPN14, PTPN23, and PTPRU D1 are protein phosphatases (fig. S15, H, I, and K), indicating that this is an imperfect predictor of inactivity. Although the D2 domains themselves lack tyrosine phosphatase activity, they are required for the enzymatic activity of the proteins as a whole (58). They may bind and regulate D1 domains (59) and act as redox sensors (60). The strong conservation of CX5R and selective changes to substrate specificity motifs suggest that some D2 domains may be active on nonprotein or nonstandard protein substrates.

Fig. 7 PTP pseudophosphatases.

Most PTPs with altered CX5R, KNRY, or WPD motifs lack PTP activity. The top sequence logo shows motif conservation in active PTPs. Residues required for PTP phosphatase activity are starred in black, and those required for general phosphatase activity are starred in red. Additional sequence logos show motif conservation in subfamilies with altered motifs. Altered critical residues are highlighted by black and red triangles in the same way as the asterisks. Sequence logos were built by WebLogo and colored by hydrophobicity (y axis is information content in bits). See text S8 and fig. S15 for details.

DISCUSSION

We built upon several previous studies (table S1) to create a unified human protein phosphatome (Fig. 2) and traced its history through key stages of eukaryotic evolution. This provides a genomic and evolutionary context for all protein dephosphorylation and a catalog for large-scale experimental studies.

We defined 264 human phosphatases and focused on the 189 genes of the protein phosphatome. These were defined liberally, including catalytically inactive pseudophosphatases and relatives of protein phosphatases that phosphorylate nonprotein substrates. This allowed for evolutionary and structure-function comparisons. We found diverse activities within folds, families, and even subfamilies—from protein phosphatase to nonprotein phosphatase to other hydrolases and even nonenzymatic functions. This suggests that additional protein phosphatases may yet be found in the ranks of nonprotein phosphatase folds (table S1), other hydrolases, or even additional folds.

Many phosphatase families have varied substrate specificity. The human protein phosphatome includes 10 phosphatases that have both protein and nonprotein substrates and another 30 that only have nonprotein substrates (table S1). Many of these are phosphoinositide phosphatases that are also involved in signal transduction. This is a similar situation to kinases, where most protein and phosphoinositide kinases belong to the same fold. We include these nonprotein phosphatases, as well as known pseudophosphatases, due to the possibility of cryptic protein phosphatase activity, their close family relationships to protein phosphatases, and their functional involvement in protein phosphatase–based signaling pathways. Conversely, our analysis of the “inactive” D2 domains of receptor PTPs shows that most of them strongly conserve the catalytic residues but alter substrate recognition motifs, suggesting that these might have some cryptic enzymatic activity.

Thousands of phosphorylation sites have now been associated with upstream kinases, but only 7% of these have been mapped to a cognate phosphatase (fig. S16 and table S22). A major challenge in understanding the phosphatome will be mapping each phosphatase to its substrates, to understand the specificity and roles of all phosphatases. The discrepancy of the gene numbers of the phosphatases and kinases does not simply mean that phosphatases are less specific than kinases because they use additional regulatory mechanisms to achieve specificity. For instance, PPPs use a large number of different regulatory subunits to confer specificity and localization (2, 61, 62). As with kinases, some phosphatases are promiscuous, and others are highly selective (table S22).

The 92 subfamily losses seen in these nine species (table S6) challenge our understanding of conserved biological pathways and may provide functional insights. For instance, cofilin is phosphorylated by the LIMK and TESK kinases and dephosphorylated by slingshot and chronophin (63). Both kinases and both phosphatases are absent from C. elegans, although cofilin is retained, suggesting that C. elegans has lost this mode of cofilin regulation and that these kinases and phosphatases did not have other essential roles in the ancestor of nematodes. Another example comes from the loss of both phosphatase PTPN21 and kinase Tec in worms. These are known to physically interact in human (64), and the coordinated loss also suggests the lack of major unshared functions for at least one of these genes. The functional impact of other gene losses is unclear but offers an evolutionary tool to explore the functions of genes and the consequences of their losses (table S23).

Genome-scale analyses often simultaneously provide partial insight into many biological questions. To better exploit and explore these genomic data, we have established an online database, at http://phosphatome.net, to allow exploration of these data, including a wiki to share additional specific findings on each phosphatase class.

MATERIALS AND METHODS

Determination of the human protein phosphatome

We curated known human phosphatases from the literature (see text S1 and table S1). We used PSI-BLAST to find additional homologs of known phosphatases. We classified human phosphatases into folds, superfamilies, and families by comparing their protein sequences against the SCOP database. The longest protein isoform of each human phosphatase was extracted from RefSeq (65) and BLASTed against SCOP domain sequences, with an E-value threshold of 10−5 and coverage above 70% of the subject domain sequence (17). Each query was assigned the fold and family of its top BLAST hit and checked manually. Phosphatases with no significant hits to SCOP were manually classified by similarity to classified phosphatases in sequence and/or structure and by referring to previous studies on phosphatase classification (1, 2, 4, 816). Three putative folds were not found in SCOP (version 1.75C) (table S2). We split the PTEN family from the dual specificity phosphatase-like family in SCOP because of its functional and sequence differences from other DSPs.

We cataloged phosphatases into protein and nonprotein phosphatases based on the information of their substrates from the HuPho and DEPOD databases (downloaded in March 2013) and from the literature (8, 9) (table S1). The human protein phosphatome was the collection of all the members of the families that contain protein phosphatases plus the families of PTEN, myotubularin, Sac, and Paladin. To find possible protein phosphatase genes missed in RefSeq, we masked the RefSeq protein sequences of protein phosphatases in human chromosomal and EST sequences using BLAT (66) and then searched the masked sequences against a customized HMM library of protein phosphatases (see below) by using HMMER (67) and DeCypherHMM (a hardware-accelerated implementation of HMMER).

HMM library of protein phosphatases

We created an HMM library to identify protein phosphatase domains from sequences. We found all HMM profiles from Pfam (68), SMART (69), and SUPERFAMILY (70) databases that matched human protein phosphatase domains by HMMER search (67). We searched these HMMs against all human protein sequences (longest isoforms) from RefSeq and selected HMMs that matched to human protein phosphatases with a precision higher than 90%. We used these HMMs to search human protein phosphatases again and built our own HMMs for phosphatases missed by published HMMs. The final HMM library has 105 semiredundant profiles that can detect all human protein phosphatases. We also built HMM profiles for each protein phosphatase subfamily and family, which can be used to classify protein phosphatases and for remote homolog detection. All HMM profiles are available at http://phosphatome.net/download/.

Identification of pseudogenes and retrogenes

We defined retrogenes as retrotransposed copies of functional genes and pseudogenes as having ORFs that are interrupted by stop codons, frameshifts, or truncations or lacking a start codon. We only included pseudogenes that contained a protein phosphatase domain. We identified protein phosphatase pseudogenes as follows. First, we masked the protein sequences of protein phosphatases from RefSeq in human chromosomal and EST sequences. We then searched the masked sequences against our HMM library of protein phosphatases by using HMMER (67) and DeCypherHMM. We also included annotated human pseudogenes from Entrez Gene (30) that were named by a phosphatase gene symbol followed by a “P” and optionally a number. For each pseudogene, we built a gene model using GeneWise and GeneDetective (a hardware-accelerated implementation of GeneWise), which predicted gene structure from nucleic acid sequence using a related protein sequence as template (71). We used the protein sequence of its parental gene and the chromosomal sequences that contain the pseudogene and its neighboring genes as the input sequences, except the pseudogene PTPRVP that was built using the protein sequence of its protein-coding ortholog in mouse. We annotated the type of processed or nonprocessed pseudogene based on the intron-exon structure. If a pseudogene had the same intron-exon structure with its parental gene, it was a nonprocessed pseudogene; if it had no intron, or it had very few introns and distinct intron-exon structure in comparison with its parental gene, it was a processed pseudogene. We browsed the pseudogenes in the University of California, Santa Cruz (UCSC), genome browser (72) and annotated their expression based on human mRNAs and spliced EST tracks.

Protein phosphatase identification in other organisms

We used the HMM library to identify protein phosphatases from predicted protein sequences by HMMER and DeCypherHMM. We masked these sequences in chromosomal, EST, and other nucleotide sequences (table S5) by BLAT (sequence identity 99%) (66) and searched these masked sequences by HMM to identify additional phosphatases. We merged genomic and EST predictions by chromosomal location and manually curated each gene model. We excluded clear pseudogenes but retained fragmentary ORFs. We attempted to improve and extend gene models using GeneWise and GeneDetective, using multiple close homologs of the starting gene model as templates, and further extended and corrected gene models using EST sequences from closely related species. Alternative gene models were manually evaluated by comparing with multiple homologs among the nine organisms and/or its most closely related organisms found by BLAST against the National Center for Biotechnology Information (NCBI) NRAA data set. We evaluated gene models by comparison with their homologs in the aspects of the overall similarity of full-length and phosphatase domain sequences, domain combination, and key catalytic residues (fig. S17).

Classification and phylogeny of phosphatases

We classified phosphatases into folds, superfamilies, and families by comparing their protein sequences against the SCOP database. We defined subfamilies as close homologs that are generally conserved between species (text S1). We initially detected subfamilies by OrthoMCL (73) and elaborated with BLAST and multiple alignments to cover cases of fragmentary gene models and misassignment. We classified fragmentary gene models into subfamilies and confirmed or changed the subfamilies of other genes using the sequence similarity of the phosphatase domain and full-length protein, domain combination, and/or known functions. A number of species-specific subfamilies are arguably members of other subfamilies but not definitively so.

We assigned protein phosphatases to subfamilies in a hybrid approach mainly using domain sequence similarity and domain architecture. We extracted domain sequences by alignment to HMMs or representative well-studied proteins and then aligned them with the PROMALS3D web server (74) guided by crystal structures. All alignments were manually inspected and adjusted to remove large inserts and short gene models and to correct obvious local realignment, particularly around catalytic motifs and other conserved region. We inferred phylogenetic trees of protein phosphatases of human and all nine organisms by maximum likelihood (ML) in RAxML (version 8.2.9) (75) with the following parameters: -m MODEL -f a -x 1234 -p 5678 -# 100, where “-m MODEL” sets the model, “-f a” selects the algorithm rapid bootstrap analysis and searches for the best-scoring ML tree in program run, “-x 1234” specifies the random seed for rapid bootstrap, and “-p 5678” sets the random number seed for the parsimony inference. We also used PhyML v 3.0 using the following parameters: -d aa -b 100 -m MATRIX -f [em] -v e -a e -s SPR --rand_start 5 --n_rand_starts 5 --r_seed 1234, where “-d aa” sets the data type as amino acid sequence, “-b 100” sets the number of bootstrap replicates as 100, “-m MATRIX -f [em] -v e -a e” sets the model, “-s SPR” sets the tree topology search operation, “--rand_start 5” sets the initial tree to random as 5, “--n_rand_starts 5” sets the number of initial random trees as 5, and “--r_seed” sets the seed used to initiate the random number generator. We used the JTT substitution model in both programs. We also generated the best model from ProtTest (76) using the Bayesian information criterion as an alternative (data S1 and http://phosphatome.net/tree/). All differences between JTT and ProtTest and between RAxML and PhyML were manually evaluated before creation of Fig. 2, based on the topologies of the individual trees, the statistical support given by bootstrap values, domain combinations, key residues, the similarity of their orthologs in additional organisms, and known functions.

Nomenclature for gene and subfamily naming

We named protein phosphatase genes as follows. (i) Use the gene symbol from the data source, if present. (ii) Use the subfamily name as gene symbol, if there is a single gene of that subfamily in a genome. If there are multiple genes of the subfamily in a genome, use the subfamily name + “-” + number (for example, PPM1K-2). (iii) If a gene is not classified into any subfamily, use genome symbol (first character of genus name and first three characters of species name) + family name + “-Un-” + number (for example, SpurPPPc-Un-1).

We generally named protein phosphatase subfamilies after human members. Species-specific subfamilies of other organisms that have unclear functions are named by the genome symbol (first character of genus name and first three characters of species name) + family name + number, unless we have better choices, which can reflect the functions, evolutionary origin, and/or domain combinations of the subfamilies (for example, LRR-DSP of Dictyostelium has a signature domain combination; DdisMTMR1-like of Dictyostelium is closer to MTMR1 subfamily than other myotubularin subfamilies; egg, eak, and hpo-7 of C. elegans are named after the genes whose functions have been characterized).

Inference of subfamily gains and losses

Subfamily gains and losses were inferred by Dollo parsimony, using the PHYLIP package program dollop (77), with the phylogeny shown in fig. S1 and default parameters. The result was parsed and visualized using a custom R script.

Domain annotation

Protein domains were detected using HMMER searches on Pfam, SMART, and in-house HMMs with local models (available at http://phosphatome.net/download/). E-value cutoffs of <10 were used to pick up repeated elements whose individual scores were very low. For nonrepeated elements, scores of >10−3 were discarded and those >10−5 were inspected manually. Overlapping domains from different profiles of the same family were merged. Signal peptides were predicted with SignalP (78), and transmembrane (TM) segments were predicted with TM-HMM (79).

Statistics

All statistics were performed using R (80).

SUPPLEMENTARY MATERIALS

www.sciencesignaling.org/cgi/content/full/10/474/eaag1796/DC1

Text S1. Cataloging the human phosphatome.

Text S2. Protein substrates of human protein phosphatases and kinases.

Text S3. Cross-references to Entrez Gene models.

Text S4. Losses of protein phosphatase subfamilies in humans.

Text S5. Human protein phosphatases and disease.

Text S6. Structural diversity within the CC1 fold phosphatase domains.

Text S7. Human pseudophosphatases and their evolutionary origins.

Text S8. PTP family pseudophosphatases.

Text S9. Estimating the sizes of the proteomes and kinomes.

Fig. S1. Phylogeny of the species used in this study.

Fig. S2. Classification of the human phosphatome.

Fig. S3. Paladin domain combination and conserved sequence motifs.

Fig. S4. Classification of the human protein phosphatome.

Fig. S5. Phylogenetic profile of protein phosphatase subfamilies in nine species.

Fig. S6. Evolutionary history of protein phosphatases and kinases.

Fig. S7. Comparison of the gains and losses of protein phosphatase and kinase subfamilies.

Fig. S8. Structure-based sequence alignment of representative CC1 fold phosphatase domains.

Fig. S9. The CX5R motif of DSP family pseudophosphatases shows conserved loss of catalytic activity.

Fig. S10. The CX5R motif of PTEN family pseudophosphatases.

Fig. S11. The CX5R motif of myotubularin family pseudophosphatases, showing conserved loss of catalytic residues.

Fig. S12. Loss of the catalytic motif in the TIM50 subfamily.

Fig. S13. Comparison of conserved sequence motifs of PPMs.

Fig. S14. Comparison of conserved sequence motifs between the PPIP5K subfamily and other HP2 phosphatases.

Fig. S15. Predicted PTP pseudophosphatases have alterations to CX5R, KNRY, and WPD motifs.

Fig. S16. Substrate overlap of protein phosphatases and kinases at the gene and residue levels.

Fig. S17. Workflow for building gene models.

Table S1. Classification, literature annotation, and published substrate specificity of the human phosphatome.

Table S2. Substrate preference for each phosphatase fold.

Table S3. Summary of human phosphatases by family and substrate type.

Table S4. Catalog of protein phosphatases from nine selected eukaryotes.

Table S5. Sequence data sets used to build protein phosphatomes.

Table S6. Subfamily gains and losses of protein phosphatases and kinases.

Table S7. Protein phosphatase subfamilies expanded in human.

Table S8. Human protein phosphatase pseudogenes.

Table S9. Human protein phosphatases and disease.

Table S10. Human kinases and disease.

Table S11. Accessory domains in human protein phosphatases.

Table S12. Domain gains and losses.

Table S13. Diversity in secondary structure profiles of CC1 fold phosphatase domains.

Table S14. Secondary structure profile of PTP family phosphatase domain structures.

Table S15. Secondary structure profile of DSP family phosphatase domain structures.

Table S16. Secondary structure profile of PTEN family phosphatase domain structures.

Table S17. Secondary structure profile of myotubularin family phosphatase domain structures.

Table S18. Secondary structure profiles of the Sac family phosphatase domain structure.

Table S19. Catalytic motifs by fold.

Table S20. Catalytically inactive human protein phosphatase domains.

Table S21. PTP domains known or predicted to lack PTP activity.

Table S22. Protein substrates of phosphatases and kinases.

Table S23. Protein phosphatase and kinase subfamilies gained and lost together.

Data S1. Sequence alignments and trees of protein phosphatases.

References (84139)

REFERENCES AND NOTES

Acknowledgments: We thank Y. Zhai and M. Dacre from the Manning Lab for technical support and T. Hunter and members of the Manning Lab for critical discussions. We thank A. Ashley for Fig. 2. We thank P. Hornbeck and B. Zhang from PhosphoSitePlus for providing the phosphatase-substrate relationships and O. Mayba for reviewing statistical tests. Funding: This work was supported by the NIH (grant HG4164 to G.M. and grants DK18024 and DK18849 to J.E.D.) and by the Genentech postdoctoral program (to M.J.C.). Author contributions: M.J.C. designed, performed, and analyzed the computational work and co-wrote the manuscript. J.E.D. provided scientific guidance and co-wrote the manuscript. G.M. designed and reviewed analyses andco-wrote the manuscript. Competing interests: The authors declare that they have no competing interests. Data and materials availability: All data are available at http://phosphatome.net/.
View Abstract

Stay Connected to Science Signaling

Navigate This Article