ReviewBiochemistry

Emerging concepts in pseudoenzyme classification, evolution, and signaling

See allHide authors and affiliations

Science Signaling  13 Aug 2019:
Vol. 12, Issue 594, eaat9797
DOI: 10.1126/scisignal.aat9797

Abstract

The 21st century is witnessing an explosive surge in our understanding of pseudoenzyme-driven regulatory mechanisms in biology. Pseudoenzymes are proteins that have sequence homology with enzyme families but that are proven or predicted to lack enzyme activity due to mutations in otherwise conserved catalytic amino acids. The best-studied pseudoenzymes are pseudokinases, although examples from other families are emerging at a rapid rate as experimental approaches catch up with an avalanche of freely available informatics data. Kingdom-wide analysis in prokaryotes, archaea and eukaryotes reveals that between 5 and 10% of proteins that make up enzyme families are pseudoenzymes, with notable expansions and contractions seemingly associated with specific signaling niches. Pseudoenzymes can allosterically activate canonical enzymes, act as scaffolds to control assembly of signaling complexes and their localization, serve as molecular switches, or regulate signaling networks through substrate or enzyme sequestration. Molecular analysis of pseudoenzymes is rapidly advancing knowledge of how they perform noncatalytic functions and is enabling the discovery of unexpected, and previously unappreciated, functions of their intensively studied enzyme counterparts. Notably, upon further examination, some pseudoenzymes have previously unknown enzymatic activities that could not have been predicted a priori. Pseudoenzymes can be targeted and manipulated by small molecules and therefore represent new therapeutic targets (or anti-targets, where intervention should be avoided) in various diseases. In this review, which brings together broad bioinformatics and cell signaling approaches in the field, we highlight a selection of findings relevant to a contemporary understanding of pseudoenzyme-based biology.

INTRODUCTION

Genomic sequencing and annotation and the subsequent mining of datasets from varied organisms confirm that most well-characterized enzyme families encode pseudoenzyme homologs, which are predicted to be enzymatically inactive due to the loss of at least one key catalytic amino acid residue. This basic description, garnered from primary sequencing data, has allowed the bioinformatic identification of pseudoenzyme genes (all of which are transcribed as mRNAs and translated into proteins) in >20 different protein families (1), including well-studied paradigms among the pseudokinases, pseudophosphatases, and pseudoproteases (210). As detailed in Table 1, subtle changes in catalytic and substrate-binding sites have likely led to the appearance of pseudoenzymes from classical enzyme templates, almost certainly after gene duplication events. As expected, therefore, pseudoenzymes share a similar overall protein fold when compared with catalytically active enzymes from the same family (11, 12). Of further interest, detailed comparison between pseudoenzymes and related enzyme counterparts can also unearth physiological noncatalytic biological functions of enzymes, driven by the adoption of regulated inactive conformation(s). Such additional functions are also broadly encapsulated in the related concept of protein “moonlighting” (13), where proteins are able to “multitask,” by performing different cellular functions as a consequence of their distinct environmental interactomes. Although an absence of conserved catalytic residues in a pseudoenzyme does not unequivocally prove catalytic deficiency, very high sequence and/or structural conservation suggests that pseudoenzyme sequences have been functionally selected across all the existing branches of life and have been preserved to regulate specific aspects of cell biology through catalytically independent mechanisms. Despite the clear presence of pseudoenzymes in a substantial percentage of proteomes, we still understand very little about their individual function, especially relative to their active enzyme counterparts. Two recent international pseudoenzyme meetings, which were held in Liverpool, United Kingdom (2016) and Sardinia, Italy (2018), served as complementary hubs to bring together expertise from across the breadth of international biological communities and to discuss and dissect how computational, theoretical, and experimental data can be combined to make rapid progress in this new field. Below, we review, evaluate, and discuss some of these bioinformatic, structural, and biochemical approaches, which are rapidly revealing the extent, evolution, and distribution of pseudoenzymes across the kingdoms of life.

Table 1 The remarkable diversity of pseudoenzymes.

Examples of pseudoenzymes from across the kingdoms of life, organized by class and function. Pseudoenzymes are highlighted in blue, whereas relevant conventional enzymes are in black. A broad selection of well-studied pseudoenzymes are discussed; the list is not meant to be comprehensive. ADP, adenosine diphosphate; ROS, reactive oxygen species; IFN, interferon; GTPase, guanosine triphosphatase; RIPK3, receptor interacting serine/threonine protein kinase 3; COP1, constitutive photomorphogenic 1; TRIB1, Tribbles 1; DUSP4, dual specificity phosphatase 4; MTMR13, myotubularin-related protein 13; SP, serine protease; PPO, prophenoloxidases.

View this table:

Experimentally established functional classes of pseudoenzymes

Pseudoenzymes are found among many metabolic and signaling classes of enzyme superfamilies (see Table 1 for an annotated selection). In terms of signaling outputs, pseudoenzyme actions can be rationalized based on four types of functional mechanisms: (i) regulating catalytic outputs of conventional (canonical) enzymes, (ii) acting as integrators of signaling events and/or toggling between signaling states as molecular switches, (iii) controlling assembly and localization of signaling hubs, or (iv) binding substrates/subunits to control the activity of conventional enzymes (14).

The first of these mechanisms, for which an increasing number of examples are available in the protein kinase, phosphatase, and ubiquitin signaling literature, all retain clear “enzyme-like” overall folds. These pseudoenzymes evolved the ability to regulate catalysis of a bona fide enzyme-associated partner to generate a graded biological output in which the pseudoenzyme and enzyme interaction remains the crucial controlling factor. The kinase-pseudokinase pair is perhaps the most well-known example from this pseudoenzyme class (15). However, this recurring theme has recently been exemplified in the ubiquitin field, with the analysis of the BRCC36 isopeptidase complex (BRISC) and the BRCA1-A complex, in which the catalytically active JAMM (JAB1/MPN/Mov34 metalloenzyme) domain Lys63-specific deubiquitinase (DUB) BRCC36 is physically partnered with the pseudo-DUBs Abraxas1 and Abraxas2 (16). The BRISC complex permits cooperation between a newly discovered moonlighting function of the metabolic enzyme serine hydroxymethyltransferase 2 (SHMT2), the pseudo-DUB Abraxas2, and three tandem pseudo–ubiquitin E2 variants (UEV) domains of BRCC45 (Fig. 1). Complex assembly is regulated by SHMT2, which requires pyridoxal-5′-phosphate (PLP) to form a tetramer that catalyzes its canonical role in nucleotide and amino acid metabolism. Catalytically inactive SHMT2 dimers, but not active, PLP-bound tetramers, inhibit BRISC DUB activity, and intracellular PLP abundance controls this newly discovered moonlighting SHMT2 function, which leads to the coordinated regulation of immune signaling in cells (17).

Fig. 1 Regulated assembly of the macromolecular BRISC-SHMT2 pseudoenzyme-containing complex.

(Top) Schematic of the SHMT2 dimer-tetramer oligomerization transition, which is regulated by pyridoxal-5′-phosphate (PLP), the active cofactor form of vitamin B6. The SHMT2 dimer, which is inactive as a methyltransferase (left), specifically interacts with, and inhibits, the pseudoenzyme-containing BRISC complex (bottom), revealing a previously unknown moonlighting role for SHMT2. The PLP-bound SHMT2 tetramer, which is active as a methyltransferase, is unable to bind or inhibit the BRISC complex. (Bottom) Schematic of the multimeric BRISC-SHMT2 DUB complex, which contains the active DUB BRCC36 [MPN+ (MPR1, PAD1 N-terminal )] and the inactive Abraxas2 pseudo-DUB (MPN). BRCC45 contains three pseudo-E2 domains (UEVs) and is discussed further in Table 1 and in the text.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

A second class of pseudoenzymes act as binary “switches” by integrating signals in the form of posttranslational modifications (PTMs) or by binding to metabolic ligands (perhaps ancient substrates), triggering interconversion between active and inactive conformations. Several of these are presented in more detail here (Table 1).

The third category of pseudoenzymes possess distinct biological functions by operating as protein interaction scaffolds, generating (sub)cellular focal points that nucleate assembly of protein complexes, or regulating the localization of a binding partner. Examples are shown in Table 1, with the most profligate example to date being pre-mRNA processing-splicing factor 8 (PRPF8), which contains four pseudoenzyme domains (Table 1) that act as scaffolds for assembly of the enzymatic spliceosome complex (16, 18). The further study of protein interactions made by pseudoenzyme domains (“pseudoenzyme interactomes”) will provide interesting information about the role of these proteins in the cell. Well-known examples include the interaction of the pSer/Thr/Tyr interacting (STYX) family of pseudophosphatases, specifically MK-STYX/DUSP24 with components of the canonical extracellular signal–regulated kinase (ERK) pathway (19) and independent targeting to F-box/WD repeat-containing protein 7 (FBXW7), a subunit of the Skp, Cullin, F-box containing (SCF) ubiquitin E3 ligase (20). In a broad sense, molecular interaction data analysis can now be facilitated by using the detailed curation model used by the International Molecular Exchange Consortium of molecular interaction databases (21), which accurately maps interaction data to specific binding regions of proteins and captures the effects of site-directed mutagenesis on potential protein-protein interactions (22).

The final major pseudoenzyme category contains examples in which the protein has repurposed canonical features of a fold that is shared with active enzyme relatives, probably to act as competitors for either substrate binding (as catalytic “traps”) or higher-order complex assembly. A good example of this fourth category is pseudophosphatases, which usually retain the ability to bind to “substrates,” such as phosphorylated peptides, but are no longer able to enzymatically process them, providing a trapping or localization mechanism in situ (5, 23). The same might also be true of pseudoproteases (2), which have reasonable binding affinity for substrates (and can therefore control their fate) but do not cleave their substrates, performing a catalytically independent regulatory role. Outside of these notable pseudoenzyme families, relatively few examples of competitor and/or “decoy” pseudoenzymes have been identified to date. However, new data confirm their expansion within plant, fungal, and pathogen proteomes (2427), including, for example, a new viral pseudoenzyme that provides the pathogen with an advantage by directly competing with host cell defense mechanisms (28).

Below, we present the fruits of the findings from the first two international meetings held between members of the pseudoenzyme community, including new state-of-the-art computational approaches that can exploit both genomic and proteomic (and in the future metabolomic) datasets. Together, these are revealing a vast number of pseudoenzymes embedded within genomes and helping to create the first advanced analytical frameworks to probe and understand how pseudoenzymes function in a multitude of biological niches.

Bioinformatic resources for pseudoenzyme analysis: How computational scientists can identify and curate pseudoenzymes

The UniProt knowledgebase (www.uniprot.org) provides the scientific community with a comprehensive, high-quality, and freely accessible resource of protein sequence and functional information for many species (29). For enzymes, which represent between 20 and 40% of most proteomes, UniProtKB provides additional information about Enzyme Classification (EC), catalytic activity, cofactors, enzyme regulation, kinetics, and pathways, all based on critical assessment of published experimental data. Bioinformatic and structural data are also used to enrich the annotation of the sequence with the identification of active sites and binding sites. Whereas the annotation of enzymes is very well defined, the curation of pseudoenzymes has proven much more challenging. The main issue resides in their identification, especially when the only evidence usually available to define a pseudoenzyme is the absence of critical active site residues. To infer a lack of catalytic activity based solely on sequence analysis can be misleading. Experimental evidence can also be difficult to interpret, especially when it contradicts sequence analysis–based prediction, and catalysis may be present or absent among orthologs. During the curation process, all the available evidence must be manually assessed to decide whether the protein could be considered a pseudoenzyme or not. Another important challenge is how to translate this information into consistent, meaningful annotation to facilitate the subsequent retrieval of pseudoenzymes by users. After the two international pseudoenzyme conferences held in 2016 (Liverpool, United Kingdom) and 2018 (Sardinia, Italy), the annotation of pseudoenzymes in UniProt underwent a revision to reflect advances in the pseudoenzyme field. For example, the UniProt curators improved the usage of protein names to directly reflect their inactive status. In addition to providing the enzyme family to which they are related, as well as the position of the “catalytically inactive” domain, UniProt explains the reason why a protein is considered a pseudoenzyme in a “Caution” comment in the enzyme function section (Fig. 2). Importantly, the source of the evidence used to infer the lack of catalytic activity—experimental, sequence analysis, or orthology based—has also been added to make pseudoenzyme definitions as evidence-based as currently practicable (see Table 1).

Fig. 2 Computational annotation of enzymes and pseudoenzymes in UniProt.

A worked example of the inactive annotation for the C. elegans pseudophosphatase egg-4 (UniProtKB O01767), including the protein name, the protein-coding evidence (status), and the caution comment. The image is recreated and adapted from the UniProt website.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

As an experimental test bed for the curation and annotation of procedures to create pseudoenzyme databases, UniProtKB has recently reviewed and updated the annotation of the complete kinome (30) and phosphatome of the model worm Caenorhabditis elegans, which contain 438 kinases and 237 phosphatases, respectively. Among 208 kinase genes whose function has been experimentally characterized, 41 are annotated as pseudokinases, although supporting evidence is often lacking beyond predicted sequence information. For >90% of worm pseudokinases, a lack of catalytic activity is based only on sequence analysis (Fig. 3A). For genes that lack functional characterization, the sequence analysis that predicted lack of kinase activity can be supported by experimental evidence in orthologs. However, this is not always possible, especially for families that contain mainly C. elegans–specific genes. This is the case for the two most abundant kinase families, Casein Kinase 1 (CK1) and TK. For the CK1 family, 11 of the 83 members (13%), and for the TK-KIN-16 group, 4 of the 15 members (26%), are predicted to be inactive based on the loss of the active site. The TK-Fer group is predicted to contain only active members. A similar situation is found in the C. elegans phosphatome. Of the 237 phosphatases, 8 are annotated as “inactive” pseudoenzymes, although only 1 pseudophosphatase has documented experimental evidence supporting the facile informatics–based prediction of catalytic sterility (Fig. 3B). These two examples illustrate the need to improve experimental characterization of pseudoenzymes, because inferences based only on sequence analysis could result in the incorrect attribution of inactivity, as the well-known example of the with-no-lysine 1 (WNK1; UniProtKB Q9JIH7) family of serine/threonine-protein kinases ably demonstrates (31, 32). On the basis of sequence analysis, WNK enzymes would be predicted to be inactive; however, under various experimental conditions tested, they are catalytically active (6). The WNK members demonstrate the importance of biochemical and structural analysis in understanding enzymes and pseudoenzymes in enzyme superfamilies, such as phosphotransferases; in the case of the WNK “pseudokinases,” this has led to their more accurate reclassification to atypical kinases.

Fig. 3 Analyzing the (pseudo)kinome and (pseudo)phosphatome in a model worm.

(A) Calculated percentage of pseudokinases in the C. elegans kinome (orange). (B) Percentage of pseudophosphatases in the C. elegans phosphatome (orange).

CREDIT: A. KITTERMAN/SCIENCE SIGNALING

Perhaps the most difficult challenge facing the curation field is the vast amount of protein sequences that continue to be made available through genome sequencing projects. For example, in just 1 month between UniProt releases in October and November of 2018, more than 3 million new protein sequences were added to the database. To provide users with information on these proteins (without experimental evidence in nearly all cases), automatic annotation rule–based systems have been devised to enrich the annotation of protein sequences based on protein family membership. These identification systems, on the basis of related manually curated protein entries, provide users with basic predicted functional annotation—an example being UniProtKB H9J2B7, a predicted tyrosine-protein kinase receptor based on automatic sequence detection analysis. Despite the development of rules and tools aimed at identifying differences and similarities in natural variation, such rules may not recognize pseudoenzymes even within a closely related enzyme family.

Evolutionary biology as a driver in the pseudokinase field

Whereas several workflows exist for annotating enzymes, the “rules” for automatic annotation of pseudoenzymes are currently lacking. Focused progress needs to be made on this issue across multiple different enzyme families to fully appreciate pseudoenzyme diversity and potential biological ubiquity. Reliable identification of catalytic site residues that can be used to predict a loss of enzymatic activity remains difficult. However, increased experimental characterization of pseudoenzymes in the last decade, and new evolutionary analyses of catalytic sites and catalytic domains, continues to provide invaluable information that improves manual curation and expands the automatic identification/annotation of pseudoenzymes. A case in point is the pseudokinases, for which a kinome-wide database has now been assembled across all known eukaryotic, bacterial, and archaeal proteomes (24). This analysis, published in Science Signaling, has revealed that pseudokinases are present across all three domains of life; in total, about 30,000 eukaryotic, 1500 bacterial, and 20 archaeal pseudokinase sequences were classified into 86 distinct pseudokinase families, including ~30 well-conserved pseudokinase families that were not previously reported (24). The rich diversity of pseudokinases that occurs across the kingdoms of life exhibits notable family-specific expansions in animals, plants, fungi, and bacteria, where pseudokinases (and pseudoenzymes in general) had previously received cursory attention. Pseudokinase expansions are often accompanied by domain shuffling, which appears to have promoted new roles in plant innate immunity, modulation of plant-fungal interactions, and bacterial signaling. Mechanistically, the ancestral kinase fold, an ideal template for the generation of new functions in pseudoenzymes, has diverged in many distinct ways through the enrichment of unique signature sequence motifs, generating, in turn, a slew of new pseudokinase families. The catalytic kinase domain is also repurposed for noncanonical nucleotide binding or atypical catalysis or to stabilize unique, catalytically inactive kinase conformations associated with catalytically independent types of signaling (see below). To conveniently compare these complex datasets, an annotated, searchable collection of all known predicted pseudokinase sequences, and their evolutionary relationships, has been captured in the freely available protein kinase ontology (http://vulcan.cs.uga.edu/prokino/hierarchy/ProteinPseudokinaseDomain) (24, 33).

Structural biology as a key driver in the pseudokinase and pseudoenzyme fields

Structural studies continue to play a pivotal role in advancing our knowledge of functional evolution among enzyme superfamilies, most notably the kinase superfamily, which is composed of a very broad range of small-molecule, antibiotic, glycan, and protein kinases (34). Notable examples of these important pseudoenzymes are discussed here briefly, beginning with the remarkable finding that nucleotides can bind in different modes in predicted pseudoenzymes, in some cases, leading to unusual (and completely unexpected) types of catalysis. For example, an inverted conformation of adenosine triphosphate (ATP) was observed in the pseudokinase FAM20A (Fig. 4A) (35, 36), which is a catalytically inactive pseudoenzyme regulator of the conventional protein kinase FAM20C (37). However, a structure of Selenoprotein-O (SelO) revealed a similar atypical ATP-binding mode, but rather than being catalytically inactive, SelO has evolved the capacity to “AMPylate,” rather than phosphorylate, a broad spectrum of protein substrates (Fig. 4B) (38). In addition, several (pseudo)kinases have been confirmed to phosphorylate sugar (as opposed to amino acid) residues, including FAM20B (Fig. 4C) (37, 39) and SgK196/protein O-mannose kinase (POMK) (Fig. 4D) (40). Intriguingly, the bacterial pseudokinase SidJ has recently been shown to catalyze protein glutamylation of the SidE family of ubiquitin E3 ligases, inhibiting their catalytic output (41, 42).

Fig. 4 Diversity in ATP-binding mode and the acquisition of noncanonical catalytic functions among bioinformatically annotated pseudokinases.

(A) The catalytically inactive pseudokinase, FAM20A, still binds ATP (red sticks) but in an inverted conformation [Protein Data Bank (PDB): 5wrs] (35). Like the related FAM20B, the position of the αC helix (orange) differs from that in canonical protein kinases. (B) The highly atypical annotated pseudokinase SelO (PDB: 6eac) (38) can catalyze protein AMPylation via an unusual catalytic mechanism. Like FAM20A (A), SelO binds ATP in an inverted conformation but, in addition, it catalyzes AMP transfer to protein substrates. ATP analog AMP-PNP is shown as red sticks; Mg2+ and Ca2+ are shown as yellow and green spheres, respectively. (C) The predicted pseudokinase, FAM20B (PDB: 5xoo), is actually a catalytically active xylose kinase (37) involved in proteoglycan synthesis. The sugar substrate is shown as green sticks; an adenine (red sticks) is also modeled in the structure. (D) Like FAM20B, the predicted pseudokinase SgK169/protein O-mannose kinase (POMK; PDB: 5gza) is actually a sugar kinase (40). SgK169/POMK closely resembles a typical protein kinase fold with conventional αC helix (orange) position, nucleotide-binding mode (red sticks), and Mg2+ cofactors (yellow spheres), and excitingly, this structure captures the protein bound to its sugar substrate (green sticks).

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

Although not classically considered to be a specific catalytic motif, the variable glycine-rich loop sequence in kinases (Gly-X-Gly-X-X-Gly) serves as a flap-like structural feature to promote and lock ATP binding. Interestingly, this motif has been found to be dispensable for catalytic activity in an atypical catalytically active coccidian kinase, which has been termed WNG1 (with no glycine-1) (43). A recent study suggests a catalytically active, druggable conformation in Drosophila BubR1 that can directly phosphorylate the motor protein centromere-associated protein E (CENP-E) (44). This likely differentiates it from human BubR1, which is reported to be an inactive pseudokinase (45) lacking a Gly-rich loop and that does not bind detectably to ATP (46). Although it is more common that proteins annotated as pseudokinases based on the absence of the traditional catalytic residues (which do not, by definition, include the Gly-rich loop) are indeed catalytically defective when scrutinized at the appropriate biochemical level, these examples underscore the versatility of the kinase fold to evolve diverse functions, especially when released from the evolutionary pressure of maintaining a catalytically competent fold (47). More broadly among pseudoenzymes, such functions likely speak to a noncatalytic substrate binding to an ancestral enzyme, which has evolved to perform a catalytic function, as has been demonstrated during in vitro evolution experiments (48, 49) and by using structural features to evolve canonical activity in the pseudokinase CASK (Ca2+/calmodulin-activated serine/threonine kinase), where four amino acid substitutions are required to regenerate an Mg-ATP–dependent enzyme (50). In a similar vein, improving the curation of (nonkinase) pseudoenzyme families will provide the scientific community with valuable information to understand the evolution of these proteins, the etiology of related diseases, and the development and repurposing of pseudoenzyme-targeted drugs, a useful bonus in carefully conducted pseudoenzyme studies (7, 51, 52).

A new approach for the identification of pseudoenzymes using UniProt

To help take the pseudoenzyme field forward, we have designed and tested a simple computational pipeline to identify and annotate pseudoenzymes using sequence alignments, UniProt annotations assembled from the primary literature (53), and information from the Mechanism and Catalytic Site Atlas (M-CSA), a database that currently contains defined catalytic residues and mechanistic data for 964 enzymes (54). We have limited our current analysis to those enzymes for which we have a good or detailed knowledge of their catalytic mechanism (such as those in M-CSA) and therefore know the specific residues involved. To identify pseudoenzymes among these data, we start by finding all the SwissProt sequences that are homologous to entries in M-CSA, by using phmmer with an E-value cutoff of 10−10 (55). This value is reasonably stringent and will act as a filter to include only close homologs. More distant relatives are better identified using the structure-based approaches developed by Orengo and colleagues (12, 56), as described below and in related work. Our procedure yields a broad collection of sequences from all the domains of life, although the sample can be biased by the (current) uneven representation of sequences in SwissProt and enzymes in M-CSA. After their identification, we categorize each homolog as enzyme or nonenzyme according to its annotation in SwissProt. There are at least three types of annotation in SwissProt that we can use to identify enzymes: EC numbers, UniProt keywords, and Gene Ontology (GO) terms. These annotations provide us with three possible rules. A sequence can be categorized as an enzyme if it (i) has at least one EC number, (ii) is annotated with at least one catalytic UniProt keyword (oxidoreductase, transferase, hydrolase, lyase, isomerase, ligase, translocase, or their hierarchical children), or (iii) is annotated with a catalytic GO term (“catalytic activity” and its child terms). Unfortunately, the three rules are not currently entirely consistent. For example, there are about 3000 sequences with a catalytic keyword but no EC number, and about 35,000 sequences have a catalytic GO term but have no catalytic keyword. These differences may be due to out-of-date annotation, no available EC number in the EC classification, or GO terms that are automatically transferred on the basis of homology, which will label pseudoenzymes as enzymes. Overall, the rule (ii) based on the UniProt keywords seems to be the most comprehensive without using extended annotation based uniquely on homology, and so we have applied it for the purpose of this review. There are no pseudoenzymes in 669 of the enzyme families curated in M-CSA. However, in 237 of the enzyme families where pseudoenzymes are found, these account for less than 10% of the sequences in the family, whereas the number of families where pseudoenzymes are more common than 10% is 88 (the sum of the last nine columns; Fig. 5). However, the percentage of families identified as including pseudoenzymes is currently a minimum number—more can and will be found with deeper searches. Below, we highlight how we can use data from individual enzyme families to try to understand how pseudoenzymes have evolved, an important central question in evolutionary biology.

Fig. 5 Estimating the ratio of pseudoenzymes in known enzyme families.

Estimated proportion of pseudoenzymes within known enzyme families. A family is defined here as the group of SwissProt sequences that are homologous to one enzyme/entry in M-CSA. Each entry in M-CSA corresponds to one enzyme with a unique enzyme mechanism, so the same EC number can be represented more than once if it evolved independently with distinct mechanisms. Sequences are categorized as enzymes if they have a catalytic UniProt keyword and as pseudoenzymes otherwise. The orange bar corresponds to enzyme families that contain only enzymes; the blue bars correspond to enzyme families that contain a variable percentage of pseudoenzymes.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

Evolution of pseudoenzymes

Knowing the number of pseudoenzymes associated with each M-CSA enzyme entry is an important starting point for understanding how these proteins came to “lose” (or repurpose) their catalytic function, assuming that they evolved from canonical enzymes in the first place. The reconstruction of their evolutionary trees is necessary to tackle some of the fundamental questions that the field would need to answer. These include the following: (i) How common are the evolutionary “jumps” that transform either enzymes into pseudoenzymes or pseudoenzymes into enzymes? (ii) Which of these transformations is more common? Last, (iii) can we identify the conserved mutation(s) in the catalytic residues that drive these changes? Below, we have attempted to answer the last question by generating and annotating phylogenetic trees for all the enzymes in M-CSA. In the phylogenetic tree for β-amylase (Fig. 6), which contains two pseudoenzymes, the homologs belong mostly to plants, except for five that have a bacterial origin. The red and green circle on the tree represents the point where one pseudoenzyme diverged from an ancestral enzyme. The two existing pseudoenzymes evolved from this pseudoenzyme ancestor, so there is only one “loss-of-function” event represented in the tree. The green circle at the base of the tree means that the last common ancestor for all the homologs in the tree was most likely an enzyme, consistent with the directionality of evolution from enzyme to pseudoenzyme. The pseudoenzymes, which are both classified as inactive in UniProtKB and have no associated EC numbers, have the most mutations in their catalytic residues of all the homologs identified. Furthermore, the pseudoenzymes are the only homologs in the tree with substitutions replacing the catalytic residue, Glu381. This acidic side chain acts as the general base that activates the nucleophilic water molecule to a reactive hydroxide ion, whereas other residues such as Thr343, Leu384, and Asp102 have stabilization roles that are not critical for the reaction and so can (and are) readily replaced during evolution without affecting enzyme function. It is currently impossible to say whether these mutations were the original cause for the loss of function or whether they happened after the loss of function, which may have occurred through other means such as simple accumulation of mutations without corrective selective pressure. After the loss-of-function event, a lack of corrective selection pressure will lead to the accumulation of mutations in the catalytic residues. For this enzyme set, the mutation of catalytic residues alone does not lead necessarily to loss of function, as some of the other active β-amylases have catalytic mutations. Gain of function may also be generated through means other than catalytic residue mutations, as illustrated by the β/α-amylase protein (P21543), which has an additional domain with the α-amylase activity (Fig. 6, dark blue).

Fig. 6 Annotated phylogenetic tree for soybean β-amylase.

Phylogenetic tree for homologous sequences of soybean β-amylase (P10538) annotated with the catalytic residues identified in M-CSA, their EC numbers, and protein families domains. The tree contains 20 homologs from plants (of which 6 are chloroplastic) and 5 from bacteria. The two inactive pseudoenzymes identified (Q9FM68 and Q8VYW2) belong to Thale cress (Arabidopsis thaliana), and one of these (Q9FM68) is known to have physiological regulatory functions in this organism.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

Use of GO terminology to estimate the global extent of pseudoenzymes

There are currently a large number of entries in UniProt (about 35,000) that do not have a catalytic keyword but do have a catalytic GO term. This inconsistency is brought about by GO annotation that is automatically extended from enzymes to their homologs, regardless of the presence or absence of actual catalytic activity. Consequently, many of the sequences annotated with a catalytic GO term are themselves pseudoenzymes. To illustrate this contradiction, we also performed an analysis where we removed all the sequences where annotation is contradictory. This led to a smaller total number of pseudoenzymes identified within the M-CSA entries homologs set, as would be expected (fig. S1), and the number of enzyme families with pseudoenzyme members drops from 325 (32.7% of all the families) to 132 (13.3%). This type of analysis remains rather subjective unless backed up with supporting data and represents one of the major experimental challenges in the pseudoenzyme field, wherein biochemical and cellular analysis is often complex, time consuming, and costly.

Identifying pseudoenzymes in proteomic databases by exploiting the class, architecture, topology, homology resource

Pseudoenzymes have now been identified in many major enzyme families across the tree of life; predictably, this has been predominantly through computational sequence–based analyses (Table 1) (24, 30). Most studies have traditionally compared the sequence of a relative of unknown structure and functional residues against sequences of relatives, which have been structurally characterized and annotated with known catalytic residues that have been confirmed experimentally (14). However, one study has indicated that homologous proteins that share a common core domain structure can often acquire new enzyme functions by changes both in the nature of their catalytic residues and in the absolute position of these catalytic residues in the protein scaffold (56). As a result, it can be very difficult to provide confident predictions of pseudoenzymes by only looking at deviations from known catalytic sites, most notably if this is done so in the absence of experimental characterization and/or phenotypic or mutational information. More complex approaches are therefore deemed suitable wherever possible.

As an extension of the approach described above using enzyme terminology, we have also systematically investigated the distribution of pseudoenzymes in the protein universe, using protein families from the CATH (class, architecture, topology, homology)–Gene3D resource (57, 58), which links protein domain sequences to structures and experimental functions. CATH-Gene3D classifies ~435,000 domain structures and ~95 million protein domain sequences into ~6100 evolutionary superfamilies (57). These can then be subclassified into functional families (FunFams) that share highly similar structures and functions based on sequence patterns, specificity-determining positions, and other conserved positions (Fig. 7A) (59). Although it is difficult to achieve complete separation in some extremely diverse superfamilies, the accuracy of these functional family classifications has been validated by comparison against experimental data and by endorsement through blind, independent assessment in the Critical Assessment of protein Function Annotation algorithm–based evaluation of functional annotations (60). The number of functional families reflects the functional diversity of a particular superfamily and can be used to explore how protein function is modulated in diverse superfamilies.

Fig. 7 Analysis of the distribution of pseudoenzymes.

(A) Sequences in CATH superfamilies are subclassified into functional families predicted to share similar structures and functions, and which can be used to understand protein function evolution. (B) Distribution of the number of enzyme superfamilies (containing catalytic domains) that have varying proportions of functional families with enzyme annotations. (C) The number of putative pseudoenzyme families in enzyme superfamilies (containing catalytic domains). These remain to be confirmed by further analysis and experimental testing.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

We examined 383 enzyme superfamilies in CATH-Gene3D v.4.2 that contain well-known (experimentally validated) catalytic domains (54, 56) and identified the proportion of functional families that have enzyme annotations and compared them to those that lack any enzyme annotation. These are highly populated superfamilies, accounting for 64% of sequences in all CATH enzyme superfamilies and 60% of all sequences in CATH. A functional family was considered to have enzyme annotations if it has at least one relative that has an EC annotation in UniProtKB (29) and an experimental GO (61) annotation for catalytic activity. For a third (131) of these enzyme superfamilies, all functional families were annotated as enzymes (Fig. 7B). However, about 252 enzyme superfamilies (two-thirds) had varying proportions of functional families that had no enzyme annotations in the EC classification or GO (Fig. 7C), suggesting that these are very likely to be pseudoenzymes. To collate this information in a searchable manner, we have created the first online list of putative pseudoenzyme superfamilies (https://uclorengogroup.github.io/cath-pseudoenzymes/index.html). To explore some of these pseudoenzyme-containing families, we also developed a protocol that first generates structure-guided multiple sequence alignments of multiple functional families within a superfamily that share very similar structures [such as where relatives superimpose with a root mean square deviation (RMSD) of <5 Å] and are, therefore, likely to share catalytic mechanisms. For these structural clusters, grouping structurally similar relatives from different functional families, we next examined whether there was at least one functional family within them with known catalytic residues (54) and at least one family with no enzyme annotation. A functional family lacking enzyme annotation was identified as a putative pseudoenzyme family if we observed a loss or substantial change in known catalytic site residues (identified in the enzyme family) in all relatives of the putative pseudoenzyme family, determined using the comprehensive structural cluster alignment. In other words, the putative pseudoenzyme family has conserved residues in the active site that differ substantially in their physicochemical nature or three-dimensional location to the known catalytic residues of the enzyme family. The final confirmation of the sequence as a pseudoenzyme can only come from exhaustive experimental testing, as is now becoming much more commonplace in major signaling superfamilies, such as the phosphatases and pseudophosphatases, kinases and pseudokinases, and proteases and pseudoproteases (5, 14). Last, analysis of some of these structural clusters in superfamilies enabled us to identify families containing previously reported pseudoenzymes, such as the nuclear transfer factor 2 (NTF2) and calsequestrin (62). For example, the CATH superfamily 3.10.450.50 contains ~38% of functional families with enzyme annotations, whereas 62% of the families do not have any enzyme annotations. The distinct functional family containing NTF2 is structurally very similar (<4-Å RMSD, same structural cluster) to the enzyme families containing scytalone hydratases and steroid δ-isomerases but lacks the catalytic machinery of those enzyme relatives (Fig. 8). Another interesting example is N-Myc downstream-regulated gene 2, which is structurally similar to the enzyme families in the α/β-hydrolase superfamily (CATH 3.40.50.1820) and is thought to be a tumor suppressor (63).

Fig. 8 Pseudoenzymes in the NTF2 family.

An example comparing the known pseudoenzyme family of NTF2 (blue) and the related enzyme family of scytalone hydratases (orange) in CATH superfamily 3.10.450.50. The established catalytic residues for the scytalone hydratase family (EC 4.2.1.94) are shown in red. The structural alignment between the structures 1stdA00 and 1ounA00 from the enzyme and pseudoenzyme families was generated using CATH-superpose (https://cath-tools.readthedocs.io/en/latest/tools/cath-superpose/). On the right-hand panel, the highly conserved residues in the two families are shown for structurally equivalent positions lying in the active site of the enzyme family. The height of the characters reflects the degree of conservation, and the colors change according to physicochemical characteristics.

CREDIT: BASED ON A. RIBIERO BY A. KITTERMAN/SCIENCE SIGNALING

The future of pseudoenzyme research

The evolutionary conservation and prevalence of genes that encode pseudoenzymes are now abundantly clear in the natural world, from well-studied model organisms to newly sequenced genomes from previously uncharacterized species. However, one major barrier to accurate pseudoenzyme cataloging before experimental triage is the broad diversity of terminologies previously used in their literature description, differing definitions of what constitutes a bona fide pseudoenzyme, and a historical prejudice that proteins lacking catalytic activity are of unlikely to serve biologically important functions. Over the past 20 years, various studies have referred to pseudoenzymes as “nonenzymes,” “prozymes,” “dead” or “catalytically defective/inactive,” or, even more confusingly, “atypical” or “noncanonical” enzymes. We therefore propose the broad adoption of “pseudoenzyme” as the essential descriptor that will behave as a critical rate-determining step in allowing the direction of this relatively new field to be strategically plotted and accurately delivered. Formally, we propose that pseudoenzymes be defined as “the predicted catalytically defective counterparts of enzymes owing to an absence of one or more catalytic residue”. It should come as no surprise that, in some cases, in which this simple description is applied, residual catalytic activities have been reported among proteins defined as pseudoenzymes (mainly pseudokinases), including the isolated pseudokinase domains for human epidermal growth factor receptor 3 (HER3, also known as EGFR3 or ERBB3), Janus kinase 2 [JAK2; pseudokinase domain called JAK homology-2 (JH2)], CASK, and Tribbles 2 (TRIB2) (7, 50, 64, 65). However, because these very modest catalytic activities are either dispensable for biological function or not conserved in vitro among paralogous proteins across species (such as HER3/EGFR3[2) (24), this vestigial (or residual) catalytic activity is probably not a defining feature of their biological function. Technically, establishing and quantifying such activities remain enormously challenging, because ab initio prediction of substrates and cofactors is not facile and even trace copurifying/contaminating proteins in recombinant pseudoenzyme preparations (which may lack auxiliary endogenous factors) can lead to erroneous attributions of catalytic functions.

Other cases discussed in this review pose a more serious definitional challenge: Diverse proteins predicted to be “catalytically dead” pseudokinases, such as SelO, SidJ, and SgK196/POMK, have instead evolved distinct, and quantifiable, catalytic functions, illustrating a potential weakness of ab initio prediction of protein function based on a comparative assessment of sequence. However, these important findings serve to highlight the versatility of protein domains as ancestral folds for the creation of newly evolved activities. Moreover, studying these “odd” folds, and various adaptions at the amino acid level, can potentially lead to paradigm-shifting discoveries, including the three examples listed above. Rather than being defined solely as pseudoenzymes by virtue of the unconventional positions of their catalytic residues, such catalytic proteins can more accurately be considered as enzymes that exhibit atypical or previously unidentified catalytic mechanisms. Historically, this is best exemplified by WNK1, which guided by structural studies was revealed to contain a compensatory ATP-positioning lysine in the β2 rather than in the β3 strand found in the vast majority of protein kinases (32). The discovery of catalytically active WNG1 (with-no-Gly-loop), one of several secreted Toxoplasma atypical kinases that together form a clade [termed rhoptry organelle protein 33 (ROP33), ROP34 and ROP35], comes as an additional pleasant surprise (43), especially given that the related pseudokinase Bradyzoite pseudokinase 1, which has no detectable nucleotide binding in vitro (46), was previously shown to be involved in both the development and infectivity of Toxoplasma cysts (66). The discovery of non–Gly-rich loop-containing atypical protein kinases adds an interesting new twist to the WNK1 paradigm, because it has generally been assumed that the complete absence of a Gly-rich loop would preclude ATP binding and catalysis among phosphotransferases (46, 6769). However, it is now abundantly clear from these, and other, examples that a detailed knowledge of protein structure aligned with robust biochemical assays continues to be essential for understanding and validating the novel catalytic or pseudoenzyme-based mechanisms that are at play.

A rich diversity of pseudoenzymes in genomes revealed by bioinformatics

Databases such as UniProtKB, and the biocurators that populate new and existing entries with information extracted from the biomedical literature, play a critical role in cataloging experimentally evaluated pseudoenzymes. These characterized proteins, however, represent only a small percentage of the pseudoenzymes believed to be present in even well-studied model genetic organisms. Moreover, their identification is critical in enabling the recognition of pseudoenzymes in all proteomes and for the development of more sophisticated recognition criteria, which should allow these proteins to be distinguished from catalytically active family members. While bioinformatic studies can readily predict a lack of conventional activities based on founder members of a particular enzyme class or mechanism, they cannot exclude the acquisition of novel enzymatic functions or predict the biological nature of the noncatalytic functions a pseudoenzyme might perform, which is the major reason to study them. Nonetheless, the evolution of noncatalytic functions by pseudoenzymes provides a window into formerly unrecognized regulatory functions mediated by conventional enzymes, such as ATP-binding site occupation in the absence of catalysis in protein kinase C (70), and the stabilization of NMYC by catalytically inactive Aurora A (71). While this moonlighting allows enzymes to perform additional functions beyond their first-characterized catalytic function(s), it is likely that gene duplications have allowed pseudoenzymes to evolve specific functions, often within the same pathway as the parental enzyme. Duplication allows a parental enzyme to perform an essential function, because it relieves the selective pressures on the duplicated enzyme that would otherwise constrain active site geometries for catalysis and substrate recognition. Divergence arising from relief of such selective pressure is very well illustrated by structural plasticity among pseudokinase domains. For example, the pseudokinase domains of mouse and human mixed lineage kinase domain–like (MLKL) exhibit divergence in the structures of their “pseudoactive” site clefts and the relative contributions of pseudoactive site residues to ATP binding, which is a relatively common feature of many catalytically inactive pseudokinases (47, 72). In addition, the positions of one of the core elements in active protein kinases, the αC helix, whose Glu positions the ATP-binding lysine in β3 of the N-lobe for catalysis, is highly variable among both kinase and pseudokinase domains. In some cases, such as the pseudokinase Sgk223, the αC helix is absent in crystal structures (73) and/or has evolved alternative functions, such as serving as a regulated platform for peptide binding (74, 75) among the Tribbles pseudokinases (76), which can themselves be targeted with small molecules originally developed as inhibitors of canonical kinases (52, 75).

We believe that bioinformatics approaches, including those under development discussed here, will permit the identification of many new pseudoenzymes among the large number of enzyme families in annotated proteomes. Initially, this will be by leveraging our knowledge of protein structure and catalytic mechanisms among active enzyme counterparts to predict deficiencies in catalytic residues within their pseudoenzyme cousins. Furthermore, the continued massive expansion of the sequence repositories, particularly with many hundreds of millions of novel enzyme sequences coming from metagenome studies, combined with increasing protein structure data, will enhance the power of those bioinformatic methods that analyze the presence or absence of highly sequence-conserved residue positions in experimentally uncharacterized functional families. Expanding the size and diversity of family membership gives clearer, more accurate sequence patterns. These can then be more easily compared against sequence patterns of known enzyme families in the same superfamily to more accurately predict loss of essential catalytic machinery. In addition, they might also identify novel (and experimentally testable) catalytic machinery, spatially located in the active site pocket that is potentially indicative of a shift or change in signaling function. Such approaches will facilitate broad mining of protein sequences to identify candidate pseudoenzymes but will ultimately rely on a combination of structural and biochemical studies, ideally including the mapping of enzyme <−> pseudoenzyme evolutionary trajectories, to formally evaluate catalytic deficiency and/or the acquisition of noncanonical catalytic mechanisms.

Improved use of mechanistic data to better understand pseudoenzyme evolution

In this review, we have shown how UniProt data can be used to identify pseudoenzymes associated with a dataset of enzymes of interest. One such dataset is the M-CSA, which contains information about the catalytic residues and the reaction mechanisms of 964 enzymes. By creating annotated phylogenetic trees, we are able to show where in the evolutionary past the loss-of-function events occur and which catalytic residues mutations are associated with those events. In the particular example shown, we observe that not all catalytic mutations lead to loss of function and that the use of this rule to all enzyme families, in general, may be too simplistic. Our future major aim is to analyze the phylogenetic trees of all enzymes in M-CSA that have pseudoenzyme relatives. By using annotation specific to M-CSA, we will understand whether the loss-of-function events are related with particular catalytic residue functions and the specific chemistry the enzyme catalyzes. For example, a mutation in a residue acting as an electrostatic stabilizer may be tolerated, whereas a mutation in a nucleophile may not be. Furthermore, this tolerance to mutations may be dependent on the specific reaction and the types of chemical groups in substrate(s), which adds another level of complexity.

Overview of different approaches to identify and characterize pseudoenzymes

In this review, three ways are described to identify and characterize pseudoenzymes across protein families. The first is the manual curation by UniProt using the literature or manually checked computational annotation so that although this method is labor intensive, it is the most accurate of the three. Efforts are presented to characterize all the kinases and phosphatases in C. elegans, for which about 20 and 3% of these family members, respectively, are pseudoenzymes. The pseudoenzymes identified and annotated in this manner will eventually form the “gold standard” from which other automated methods will develop. The second method uses sequence homology to identify close relatives and existing SwissProt data to identify pseudoenzymes, which are then annotated with the catalytic residues of the original enzyme, to check their conservation. This sequence-based method identifies pseudoenzymes in about one-third of the enzymatic families in M-CSA. The final method uses the CATH structural database as a starting point to identify related proteins and UniProt and GO annotation to categorize them as either enzyme or nonenzyme. Protein structure is more conserved during evolution than sequence; hence, structural methods can see further into the past to uncover more ancient relationships. Therefore, as expected, this method detects more families with pseudoenzymes than the previous one, identifying about two-thirds of a set of enzymatic CATH superfamilies as pseudoenzyme-containing families. The Achilles heel of the second and third methods is their reliance on the lack of catalytic annotation as a useful criterion for defining nonenzymes. This is a general problem in the pseudoenzyme field, which must use negative evidence, or lack of observed experimental catalytic activity, as a benchmark. The problem is compounded when using databases, where lack of annotation does not mean that catalytic activity was evaluated. More experiments are, therefore, needed to test more broadly for enzymatic activity, and manual curation of the absence of evidence needs to be distinguished from clear experimental proof of a lack of catalysis. Until then, comparative computational approaches (including those described above) are the most powerful method for pseudoenzyme identification that we currently have.

Concluding remarks

The next decade will be an exciting, and potentially transformative, period of rapid development in the pseudoenzyme field, as experimental and bioinformatic findings rapidly merge, creating new databases that bring together exploitable information for specialists and nonspecialists alike. Feeding this information into experimental workflows will rapidly lead to a revolution in our understanding of enzyme and pseudoenzyme evolution and inform fundamental fields ranging from protein folding and enzyme mechanism to cell signaling, metabolism, and drug discovery. Such endeavors will require much broader comparative studies between enzyme families from diverse species, rather than the piecemeal approaches currently favored for studying enzymes and pseudoenzymes in isolation away from their physiological environment. Beyond readily searchable comparative datasets, a major outcome for such studies will also be the creation of benchmarks for studying, and predicting, the effects of evolutionary and disease-associated mutations that take place in enzymes and pseudoenzymes, especially when these changes are conserved in molecular “hot spots.” More data together with more sophisticated methods will facilitate the development of highly accurate tools that can identify pseudoenzymes computationally. The assembly of these new sets of rules (and their allied rule breakers) will then create truly useful, and biologically informative, outputs. This will help to loosen current phenomenologically complex classifications that can constrain, and often blur, the numerous strands of enzyme-based research taking place. In particular, the fruits of these labors are likely be a complete species-level catalog of pseudoenzymes across hundreds of distinct enzyme families and superfamilies and the prioritization of biochemical, cellular, and guided evolution frameworks to study pseudoenzymes that are of interest across scientific disciplines.

SUPPLEMENTARY MATERIALS

stke.sciencemag.org/cgi/content/full/12/594/eaat9797/DC1

Fig. S1. Estimating the ratio of pseudoenzymes in enzyme families.

REFERENCES AND NOTES

Acknowledgments: We thank all the scientists and support staff who organized, attended, and contributed to the scientific input at “Pseudoenzymes 2016: from signaling mechanisms to disease,” which took place at the Liverpool Maritime Museum, United Kingdom, in September 2016 and “Pseudoenzymes 2018: From molecular mechanisms to cell biology,” which took place in May 2018 in Sardinia, Italy, both of which informed this review. Funding: This work was initially funded by a Royal Society Research Grant (to P.A.E.) and subsequently by North West Cancer Research grants CR1088 and CR1097 (to P.A.E.). A.J.M.R. is supported by an EMBL postdoctoral fellowship. We thank the Biochemical Society and EMBO for awarding dedicated conference and workshop funding to support the development of the pseudoenzyme field. Author contributions: A.J.M.R., S.D., N.D., R.Z., S.O., J.M.T., and C.O. performed bioinformatics analyses and created databases and websites. E.Z., J.M.M., and P.A.E. assembled the information in Table 1. All authors co-wrote the manuscript, and all authors approved the final version before submission. Competing interests: The authors declare that they have no conflicts of interest related to the content of this review.
View Abstract

Navigate This Article