Research ArticleEVOLUTION

Eukaryotic Protein Domains as Functional Units of Cellular Evolution

See allHide authors and affiliations

Science Signaling  24 Nov 2009:
Vol. 2, Issue 98, pp. ra76
DOI: 10.1126/scisignal.2000546


Modular protein domains are functional units that can be modified through the acquisition of new intrinsic activities or by the formation of novel domain combinations, thereby contributing to the evolution of proteins with new biological properties. Here, we assign proteins to groups with related domain compositions and functional properties, termed “domain clubs,” which we use to compare multiple eukaryotic proteomes. This analysis shows that different domain types can take distinct evolutionary trajectories, which correlate with the conservation, gain, expansion, or decay of particular biological processes. Evolutionary jumps are associated with a domain that coordinately acquires a new intrinsic function and enters new domain clubs, thereby providing the modified domain with access to a new cellular microenvironment. We also coordinately analyzed the covalent and noncovalent interactions of different domain types to assess the molecular compartment occupied by each domain. This reveals that specific subsets of domains demarcate particular cellular processes, such as growth factor signaling, chromatin remodeling, apoptotic and inflammatory responses, or vesicular trafficking. We suggest that domains, and the proteins in which they reside, are selected during evolution through reciprocal interactions with protein domains in their local microenvironment. Based on this scheme, we propose a mechanism by which Tudor domains may have evolved to support different modes of epigenetic regulation and suggest a role for the germline group of mammalian Tudor domains in Piwi-regulated RNA biology.


The sequencing of diverse eukaryotic genomes has indicated that the numbers of protein-coding genes that they contain are remarkably similar, raising the question of how organisms have become more complex during the course of evolution. In this regard, much attention has focused on the role of gene regulatory networks in yielding increasingly sophisticated patterns of combinatorial gene expression (1, 2). A distinct mechanism by which genes can develop new functions is through recombination events that endow their protein products with new domains that confer novel biological properties.

Protein domains are cassettes of ~35 to 250 amino acids that typically retain their structure and biochemical properties when expressed in isolation (3). Most eukaryotic proteins are predicted to have one or more domains, with primary functions either in directing molecular interactions or in catalysis (46). Such domains can be viewed as portable units of biological function that provide a mechanism for the evolution of new cellular activities (711). Following duplication of the relevant coding sequences, domains can potentially gain new functions through mutations that alter their intrinsic properties; alternatively, genetic recombination can generate new domain combinations, most commonly by the insertion of a new domain at the N or C terminus of a preexisting protein (12, 13). The acquisition of a new domain can potentially provide a protein with new molecular connections within the cell, create a new mode of allosteric regulation, or endow a catalytic domain with access to new substrates (14, 15). Comparisons of the proteomes of different organisms have suggested that proteins have evolved increasingly complex functions primarily by the acquisition of preexisting domains, resulting in the formation of new multidomain architectures, whereas the emergence of an entirely new domain is a relatively rare event (16).

Eukaryotic organisms encode a limited number of protein domain families and each family is usually present in tens to hundreds of copies in different proteins (17). This suggests that domains with useful functions, or with folds that can yield diverse biochemical activities, have expanded during the course of evolution. Additionally, although a given domain may potentially be linked to any other domain type, only a rather small fraction of possible domain combinations are represented in naturally occurring proteins (4). In terms of their covalent linkage to one another, domain families follow a power law distribution, such that most domains have only a few partners, whereas a smaller subset of abundant domains are more highly connected (18, 19). In this regard, domains with roles in protein-protein interactions and signal transduction appear particularly versatile in their association with other families of domains (20). Some combinations of multiple domains, in which the domains have a conserved spacing, are represented more frequently than expected from the versatility of the individual component domains, and thereby have likely been duplicated as a unit (8); the Src homology 3 (SH3)–Src homology 2 (SH2)–kinase domain combination found in cytoplasmic tyrosine kinases is an example of such a supradomain (21, 22).

Bioinformatic analysis suggests that domains have been linked through a relatively random shuffling process during the course of evolution, but that a few domain combinations with beneficial properties have then been selected and propagated based on their contribution to the fitness of cells and organisms. However, the mechanisms by which this selection process operates remain to be clearly established. Here, we consider the selective forces that may have shaped the functions of individual domains and the emergence of particular domain combinations, and thus the evolution of new cellular functions used by multicellular animals.


Protein-domain profiling

The functional relationships among proteins, based on the similarities of their domain composition and architecture, have been previously explored using various approaches, including graph theory–based methods and hierarchical clustering (23, 24). To profile the domain compositions of a set of proteins, such as regulators of Rho guanosine triphosphatases (GTPases), protein kinases, or full proteomes, we use a matrix format (Fig. 1A) in which a row of the matrix identifies the domains contained within a specific protein (the protein’s “domain-composition”). We do not consider the ordering of these domains along the polypeptide chain. Conversely, each column of the matrix yields the collection of proteins containing a particular domain (the domain’s “protein-context”). Figure 1A shows a matrix of a set of contrived proteins (p1 to p10), each with a subset of domains (d1 to d10), and Fig. 1B shows their resulting domain-composition and protein-context profiles. In this example, proteins p1 and p9 are directly related because they share a common domain (d5, colored gray) as are p5 and p9, which also share a common domain (d1, colored red); whereas p1 and p5 are indirectly related because both share a domain in common with p9 but do not share any domains in common with each other. The protein-context profiles show that domains d1, d3, and d5 are present in one common protein, p9 (enclosed by a dashed outline in Fig. 1B, bottom panel); whereas d3 and d5 are found together in two proteins, p7 and p9 (enclosed by a dashed outline). The co-occurrence of domains d3 and d5 in multiple proteins would suggest that they are associated with a common biological process (details in Methods).

Fig. 1

Protein domain profiling. (A) Simulated proteins, p1 to p10, were each assigned a combination of individual domains selected from d1 to d10. In the matrix representation, “1” and “0” represent the presence or absence of a domain in a protein, respectively. Domain types are color-coded. (B) The domain composition profile of a given protein is the collection of its constituent domains (top; “beads-on-a-string” and pie chart representations; domain types are color-coded). The protein context profile of a given domain is the collection of proteins containing that domain (bottom). (C) Similarities among proteins and among domains are measured in distances between every pair of proteins, and every pair of domains, respectively, and computed using the Jaccard coefficient. (D) In a two-way hierarchical clustergram (black pixels representing “1” and white pixels representing “0”), proteins and domains are aggregated according to their relatedness, as indicated by the trees.

A domain-profiling matrix can be reordered by hierarchical clustering to display domain-to-protein relationships. To this end, we compute Jaccard similarity coefficients between every pair of proteins, using their domain-composition profiles, and every pair of domains, using their protein-context profiles (Methods). This yields statistical descriptions of the relatedness of any two proteins, based on their domain compositions, and of the relationship between any two domains, based on their co-occurrence among proteins. Average linkage is then used to cluster both rows and columns of the matrix and produce a two-way clustergram of the given protein set (Fig. 1, C and D). This procedure gives a “protein-domain profile” of the gene products under investigation.

Protein-domain profiling of Rho GTPase regulators

Before applying this approach to full proteomes, we investigated its use in analyzing a limited group of signaling proteins. The Rho family of GTPases, which control many aspects of cytoskeletal organization, are activated by guanine nucleotide exchange factors (GEFs) that induce the exchange of GDP for GTP and are inhibited by GTPase-activating proteins (GAPs), which accelerate the rate of GTP hydrolysis (25). The number of predicted RhoGEFs (~80) and RhoGAPs (~70) exceed the number of Rho GTPases (~20) in the human genome. In addition to their conserved catalytic domains, RhoGEFs and RhoGAPs frequently have several interaction domains, which may coordinate the spatial and temporal regulation of Rho GTPase signaling (26). To investigate this point, we enumerated the domains found either in human RhoGEFs, omitting the catalytic GEF (DH) domain and linked pleckstrin homology (PH) domain, or in RhoGAPs, omitting the GAP domain. We did not include DOCK180-related GEFs (27) in this analysis because they represent a different family with a distinct catalytic domain and an associated regulatory subunit. Protein-domain profiling revealed subgroups of GEFs and GAPs that are clustered because they have related noncatalytic domains (Fig. 2, boxes A to G). Tiam1, Tiam2, and ARHGAP21, for example, cluster together because both have PH and PDZ domains (Fig. 2, box A). Co-clustered GEFs and GAPs might be involved in related cellular pathways. For example, Vav and p85 proteins (PIK3R1 and PIK3R2) (Fig. 2, box G) are functionally related because both are involved in phosphotyrosine signaling by virtue of their SH2 domains (28).

Fig. 2

Domain profiling of human RhoGAP and RhoGEF proteins based on their noncatalytic domains. (Left) Two-way hierarchical clustering of human RhoGAP and RhoGEF proteins using only their noncatalytic domains (domain definitions from SMART) and excluding the RhoGAP domain or RhoGEF (also termed DH) and linked PH domains (listed in separate columns at left). Results are shown as color pixels on the matrix. GEF proteins that have at least one domain in common with a GAP, and vice versa, are highlighted in the yellow boxes. (A to G) Selected clusters of RhoGEF and RhoGAP proteins are illustrated in more detail. (H) The prevalence of domain-sharing between GEFs and GAPs is significantly higher than expected in a random simulation, with P values in both cases less than 1 × 10−10.

Some co-clustered GEFs and GAPs have not previously been functionally linked, such as those in clusters B and C, and we have therefore tested whether their activities may intersect (details in Supplementary Section S-I and fig. S1). The proteins in cluster B share a BAR (Bin1, amphiphysin, Rvs167) domain, implicated in the recognition of curved membranes (29), and belong to the Tuba family of GEFs and Rich1 family of GAPs, both regulators of the Cdc42 GTPase. Consistent with a functional relationship, Tuba and Rich1 interact with an overlapping set of proteins that regulate cell junctions and endocytic trafficking (fig. S1B). Proteins in box C are related because they have a DEP (disheveled, Egl-10, pleckstrin) domain in common. Overexpression of the DEP-containing RacGEF P-Rex1 resulted in the relocalization of DEPDC1, which has a RhoGAP domain, from the nucleus to the plasma membrane, indicating a functional linkage between these proteins (fig. S1C and Supplementary Section S-I). Together, these results indicate that particular groups of RhoGEFs and RhoGAPs share specific subsets of interaction domains, through which they may be targeted to similar subcellular sites to coordinately regulate local Rho family GTPase activity. The frequency of domain co-occurrence in RhoGEFs and RhoGAPs greatly exceeds that expected from a random distribution (Fig. 2H). The domain compositions of RhoGEFs and RhoGAPs, therefore, appear to have been under similar selection pressures, potentially because they interact with a common set of cellular targets, the Rho GTPases.

In a similar fashion, tyrosine phosphorylation is mediated by specific kinases and reversed by phosphatases. Co-clustering of these enzymes, based on their noncatalytic domains, yielded groups of tyrosine kinases and tyrosine phosphatases that have similar targeting or regulatory modules, such as SH2, FERM (4.1 protein, ezrin, radixin, and moesin; also termed B41), FN3 (fibronectin type III), and immunoglobulin (Ig) domains (fig. S2). A requirement for recognition of the same substrates may have promoted selection of related interaction domains linked to tyrosine kinases and phosphatases.

As a larger test case, we performed protein-domain profiling on the full set of human protein kinases after excluding their kinase domains (Supplementary Section S-II). This unbiased hierarchical clustering of protein kinases, based on their noncatalytic domains, produced groupings that resemble families of protein kinases derived from the sequence and function of their kinase domains (fig. S3) (30). This supports a model in which the catalytic and interaction domains of protein kinases have co-evolved because of a requirement to ensure appropriate substrate recognition and kinase regulation. We also compared the kinomes from Saccharomyces cerevisiae (yeast), Dictyostelium discoideum (slime mold), Caenorhabditis elegans (nematode worm), Drosophila melanogaster (fly), Mus musculus (mouse), and Homo sapiens (human), which shows marked expansions and contractions of domain-based kinase clusters between distinct organisms (fig. S4).

Domain combinations in eukaryotic evolution

Because these analyses indicated that protein-domain profiling can identify functionally relevant clusters, we extended this technique to encompass full proteomes and cross-proteome comparisons. Because the complexity of the resulting clustergrams makes them difficult to interpret, we developed a comparative approach based on groupings of linked domains that we call “domain clubs.”

To this end, we selected seven well-annotated eukaryotic species—S. cerevisiae (yeast), D. discoideum (slime mold), C. elegans (nematode worm), D. melanogaster (fly), Danio rerio (fish), Gallus gallus (chicken), and H. sapiens (human)—and amalgamated the proteins encoded by their genomes. We used these genomes as representatives of eukaryotic evolution to obtain insight into the organization of protein domains and their functions in cellular communication and multicellularity. We performed protein-domain profiling on this combined protein set, using 623 domains that correspond to all SMART domains found in these organisms (details in Methods). A protein tree was extracted from the resulting two-way clustergram, which we truncated at a cutoff point corresponding to a natural jump in the number of domain clubs (fig. S5). Moving the cutoff point did not alter the regions of the tree in which particular domains were found, but refined the number of domain types and combinations in a club (fig. S5). Each terminal branch of the resulting tree contains a group of proteins that are related because they have a similar domain composition. The set of all domains in such a group comprises a club. This approach yielded 1245 domain clubs, each of which was indexed by a number according to its order in the truncated protein tree.

The domain clubs split roughly into two groups (Fig. 3): The larger set of clubs (#1 to ~1080) contains proteins with multiple distinct domains, and the proteins show rich interrelationships, as indicated by the hierarchical structure in this region of the tree. In contrast, the remaining clubs branch directly from the root and typically contain only a single domain that is never linked to other domain types in the proteins that we have analyzed. This analysis was based on SMART definitions of protein domains, and the results might conceivably be different if alternative domain databases were used (see Methods for details). Domains within these clubs are considered as “social” or “isolated,” respectively. Sampling of the tree at two well-separated locations (clubs #53 to 56 and clubs #420 to 423) shows that social clubs with related subsets of domains cluster in the same neighborhood of the tree, whereas clubs with different groups of domains are well separated. For example, clubs #53 to 56 all have Bromo domains in conjunction with other domains implicated in chromatin binding and transcriptional regulation, such as PHD (plant homeodomain) and PWWP (Pro-Trp-Trp-Pro motif domain), whereas clubs #420 to 423 contain domains associated with growth factor receptor signaling (such as SH2, SH3, and tyrosine kinase). This framework identifies the linkage of a particular domain with other domain types, both within a proteome and across species. To validate the profile results, we applied the same approach to an extended nine-proteome data set. The consistency in the results obtained from the seven-proteome and nine-proteome data sets was statistically significant (fig. S6), providing support for the robustness of this approach and the reliability of the inferred clustergram.

Fig. 3

Domain clubs across seven eukaryotic proteomes. Domain-based clustering of the proteins encoded by seven Ensembl genomes yields a hierarchical structure (truncated at the cutoff shown in fig. S5). The set of proteins belonging to a terminal branch have a group of domains, termed a domain club. The interconnectivity among clubs distinguishes social domain clubs from isolated domain clubs. The colored boxes give examples of clubs from two distinct locations of the tree (indicated with arrows). Domain compositions and architectures are indicated for each club.

We then determined the distribution of these 623 domains across the clubs present in each of the seven species. More precisely, for each domain (X), we computed the number of X-containing proteins associated with each club and plotted this number (y axis) across all clubs for each proteome. The results for six domains (actin, death, Gal4, SH3, SH2, and FN3), illustrating different aspects of protein and domain organization and their evolution, are shown in Fig. 4A. This comparison shows how the number of clubs that contain a particular domain, and the number of proteins within each club, vary between species, as well as the location of these clubs within neighborhoods of the tree.

Fig. 4

Conservation, expansion, and elimination of domain clubs. (A) Domain clubs containing Actin, Death, GAL4, SH3, SH2, and FN3 domains across seven proteomes. The hierarchical structure of all domain clubs (1 to 1245) is indicated on the top of each panel. The total number of proteins in each domain club (y axis) and the position of each club on the tree (x axis) are indicated for each of the seven proteomes. (B) Domain clubs containing all human SH2 domain proteins. Domains in brackets are those shared by some, but not all, club proteins. The arched arrow indicates a possible transition of the yeast SH2 club (#146, in which SH2 is linked to S1 and YqgFc domains for transcriptional regulation) to the cluster of SH2 clubs with signaling functions. Two clubs (#924 and #968) marked with open circles comprise SH2 domain combinations unique to the Dictyostelium genome and are shown for comparison. (C) With ClustalW2, each human SH2 domain sequence was compared to that of the Spt6 (also known as YGR116W) protein of S. cerevisiae. The resulting plots of sequence divergence (y axis, with 1 being the greatest) form a pattern of separation (indicated by two circles). A lone pairing of SH2 domains from human Supt6H and yeast Spt6 (arrowed, with divergence score at 0.63) is well segregated from pairings of the remaining human SH2 domains against the yeast SH2 domain (top cluster of dots, divergence score at 0.78 to 0.94).

The actin domain appears in only one isolated club that is conserved across the seven proteomes, because of its exclusive presence in actin and related proteins. Most domains of such isolated clubs are strongly conserved in sequence, suggesting that the corresponding proteins are likely to mediate core functions in eukaryotic cells.

By contrast, the domains in social clubs are linked to other domain types, as exemplified by members of the death domain (DD) superfamily, which includes DDs, death effector domains (DED), CARD domains, and pyrin domains. Domains within these subfamilies undergo heterotypic interactions to assemble signaling complexes involved in apoptosis, innate immunity, and inflammation (31), and the clubs to which they belong differ from the actin domain club in several ways. For example, they are present in the five metazoan species that we analyzed, but are absent from yeast and slime mold. In addition, the number of clubs with DDs expands as multicellular animals become more complex, increasing from 4 clubs in worm to 10 in human. Among the recently emerging DD clubs, club #163 contains the RAIDD protein, which is composed of a CARD domain followed by a DD, and club #956 contains the PIDD precursor protein that has a leucine-rich repeat (LRR) domain, a ZU-5 domain, and a DD. These two proteins form a complex with caspase-2 called the PIDDosome. Complex formation is mediated by oligomeric DD and CARD domain interactions, and the PIDDosome is implicated in the cellular response to genotoxic stress and DNA damage (32). In addition, club #488, containing the DD protein nuclear factor κB p100 subunit (also termed NFκB2 or Rel), is absent from worm, fly, and yeast, but present in fish, chicken, and human. These observations suggest that DDs and their attendant clubs have formed and expanded coordinately with the evolution of multicellular animals to mediate increasingly complex responses to apoptotic and inflammatory signals.

In contrast, the numbers of some domains and their corresponding clubs are diminished, or entirely lost, in the proteomes of more complex species. For example, the GAL4 domain, which is the DNA-binding module of the yeast transcriptional activator protein GAL4 (33), is present in yeast and Dictyostelium but is absent from metazoans, indicating that this function has decayed or been usurped by another domain during eukaryotic evolution.

SH2 and SH3 domains are prototypic interaction domains that direct intracellular signaling pathways important for cellular communication in metazoans (34, 35). SH3 domains, which often bind proline-rich peptide motifs, are found in 10 domain clubs in yeast, of which 6 are within a close distance on the tree. The corresponding proteins are commonly involved in cytoskeletal regulation and vesicular trafficking (36). Multicellular animals have three- to fourfold more clubs with SH3 domains as compared with yeast, and the number of proteins within these clubs is also increased. The combination of SH3 domains with new domain partners in metazoans suggests the emergence of proteins that make novel interactions, and thereby have new biological properties. Most of the more recently evolved SH3 domain clubs remain in the same neighborhood of the tree as those present in yeast. In metazoans, this region of the tree also has the majority of clubs with SH2 domains, indicating that these domains have converged on related signaling processes. In support of this argument, clubs with SH2 and SH3 domains share several other domains involved in signaling, such as PTB (phosphotyrosine binding), C1, PH, FCH (FER/CIP4-homology), RhoGAP, RhoGEF, and protein kinase domains (table S1).

Despite these similarities, SH2 and SH3 domains have distinct evolutionary patterns. The yeast S. cerevisiae has a single protein, Spt6, with an SH2-like sequence that reportedly lacks phosphotyrosine-binding activity and is linked to S1 and YqgFc domains, both commonly found in ribosomal and RNA-associated proteins. Indeed, the Drosophila Spt6 protein interacts with RNA polymerase II to regulate messenger RNA (mRNA) processing (37, 38). Spt6 is represented by a domain club (#146) that is conserved among the seven proteomes and does not expand beyond the single Spt6 protein. In the social amoeba, D. discoideum, the number of clubs with SH2 domains has grown to six, and the number of SH2 domain–containing proteins to 14. Compared with yeast, Dictyostelium SH2 domains have jumped into distinct areas of the domain club tree as a result of their association with a new set of domains involved in signal transduction, including dual-specificity protein kinase, EF hand and Ring domains, LRR repeats, F-box and ankyrin repeats, and STAT-related domains (Fig. 4A). This expansion correlates with the emergence of SH2 domains that bind phosphotyrosine sites, such as those found in Dictyostelium STAT proteins (39). This acquisition of phosphotyrosine-binding activity likely generated a new signaling circuitry and allowed the selection of new SH2 domain clubs that direct phosphotyrosine signals to intracellular targets. Of interest, the sequence of the SH2 domain in the human Supt6H protein is highly related to yeast Spt6, whereas the other human SH2 domains are more weakly related to yeast Spt6 (Fig. 4C). One interpretation of this observation is that a domain, once freed from the constraints of its ancestral function, may explore new areas of sequence and function space.

Dictyostelium likely represents an early stage in the development of tyrosine kinase pathways, because it only has a small number of SH2-containing proteins and domain clubs and has no conventional tyrosine kinases (40). Two of the six SH2 domain clubs present in Dictyostelium have not been found in metazoans. These contain proteins in which an SH2 domain is linked to LRRs or to an F-box and ankyrin repeats (Fig. 4B), which likely subserve functions uniquely important to the Dictyostelium lineage. The other four Dictyostelium SH2 clubs, present in STAT, Cbl-like, kinase, and Spt6 proteins, correspond more closely to architectures found in metazoan animals. Phosphotyrosine-binding SH2 domains have therefore been exploited in varied ways by different phyla, as indicated by the selection of overlapping but distinct sets of SH2 domain clubs in organisms such as Dictyostelium, the protozoan Monsiga brevicollis, and metazoans (41, 42).

Clubs with the FN3 domain also show varying domain combinations between different species, suggestive of evolutionary jumps accompanied by functional divergence (Fig. 4A). S. cerevisiae has two genes that encode FN3-containing proteins, Chs5 (chitin synthase 5) (also known as YLR330W) and YEL043W. The YEL043W product is a single domain protein of unknown function that is assigned to club #902. This single FN3 domain architecture resembles that of the mammalian interleukin (IL) receptors, which are also grouped in club #902. However, the yeast protein lacks an obvious transmembrane region and, therefore, is unlikely to be a true member of the IL receptor family. Chs5p is required for the traffic of Chs3p from the Golgi to the plasma membrane during chitin biosynthesis (43) and has an FN3 domain immediately followed by a BRCT domain (club #493). This is the only example of such domain combination among the seven proteomes. In higher eukaryotes, chitin biosynthesis and consequent cell wall formation are not required, likely explaining the disappearance of the FN3 and BRCT domain combination. This exemplifies a domain combination that serves the idiosyncratic needs of a particular organism.

In multicellular organisms, most FN3 domains are found in the extracellular regions of transmembrane proteins and matrix proteins, including receptor tyrosine kinases and cytokine receptors, often as repeating units of FN3 domains in combination with domains of the Ig superfamily, both of which share a similar fold (44). They can, in some cases, interact heterotypically with each other, as shown for the FGFR1 (fibroblast growth factor receptor 1) and NCAM (neural cell adhesion molecule) (45), and, in other cases, mediate binding of the extracellular matrix to integrins (44). In metazoans, FN3 domain clubs are centered on club #902, which is situated in a neighborhood of the tree crowded with transmembrane receptors. In addition, there is a broad expansion in the number of FN3 domain clubs throughout the tree. These data suggest that FN3 domains may have evolved from a limited role in organisms such as yeast to assume prominent functions in the extracellular regions of transmembrane proteins in metazoans and were combined with other modules such as Ig domains to recognize extracellular ligands in a fashion required for multicellularity.

Together, this analysis suggests that the expansion or contraction of particular domain families, and the varying combination of domain types, correlates with the gain or loss of cellular functions, such as those important for intercellular communication (gain) or construction of a cell wall (loss). The pattern of domain clubs occupied by each domain type in the proteomes of the seven selected species can be viewed at (Supplementary Section S-III and fig. S6 showing clubs that contain domains found in human Dicer1 protein as an example).

Comparative analysis of the molecular environments of interaction domains

A possible explanation for the limited number of eukaryotic protein-domain clubs is that particular combinations of domains are used in specific aspects of cellular function, for example, growth factor signaling or chromatin organization (Fig. 3). The selection of such domain linkages during evolution could, therefore, mark proteins involved in a defined cellular process and provide them with a common currency of molecular recognition, while at the same time insulating them from undesirable crosstalk with functionally distant subsystems that use other domain types. Conversely, domain combinations that cross these boundaries might usually be selected against.

To test this hypothesis, we developed a two-step approach to analyze the relationships of human proteins (Figs. 5 and 6). First, we used protein-domain profiling to cluster 16,282 human proteins with identifiable domains and thereby obtain a matrix in which these proteins are positioned relative to one another based on their domain composition (Fig. 5). Data in the preceding sections have suggested that the distance between different proteins on such a matrix correlates with their functional similarity, in the sense that proteins that cluster together are more likely to participate in a common functional process. Therefore, a local neighborhood of the human protein-domain matrix can be viewed as representing a molecular compartment that supports a particular area of cellular function.

Fig. 5

Correlative analysis of the molecular environments of 70 domain families. (A) To compare the protein-domain environments of selected domain families, we plot the interactions of proteins containing a domain of interest onto a template map of a protein-domain clustergram of the human proteome (schematized in the left panel). A small area of the map is depicted in greater detail (right panel). Proteins are first clustered based on the similarities of their domains (shown as the white and gray pixels). Proteins on the matrix that interact with proteins containing a particular domain are then highlighted (bright pixels). In the example, RasGAP, PLC-γ-1, and PLC-γ-2 reportedly form complexes with the Shc1 adaptor molecule (HPRD database), which has a PTB domain and an SH2 domain. These three proteins, and their respective domains, are then considered to form part of the molecular environment for both the PTB and the SH2 domains of Shc1 (see fig. S8A for further detail). In a full analysis for a domain family, such as SH2, all of the proteins on the matrix that interact with proteins containing SH2 domains would be highlighted. Before analyzing the resulting maps, we use spreading on graph (SOG) clustering, in which the effect of each pixel in the scatter pattern is spread to neighboring areas in the clustergram (bottom right) to reflect the relatedness of the pixels, yielding a smoothed distribution of density, suitable for vector scanning (details in fig. S8B and Methods). (B) The smoothed distributions for the actual molecular environments of several domain families (SH2, SH3, Ring, and DED) are visualized on the clustergram as heat maps.

Fig. 6

Clustering protein domains into related molecular environments. (A) After vectorization of the smoothed density distribution of the protein-domain environment recognized by each domain family, the Euclidean distance between every two such vectors is computed and then used in a subsequent clustering procedure to obtain a hierarchical organization of the domain families that shows the relatedness of the molecular environments for each domain. Domain families in gray text (ANK, RING, WD40) are highly connected. (B) Three branches of domain families, indicated by the green, blue, and red shading in (A), were further examined for their linked domains. For every domain type (pfamA and pfamB definitions) in the human proteome, we recorded the total number of proteins in which they are linked to any domain from the branch (# in branch), and the total number of proteins containing the domain in the whole proteome (# in proteome). These domains, excluding the 70 selected seeding domains [from the tree in (A)], are listed in a box colored according to the relevant branch (listed in order of P values indicating the level of nonrandomness at which the domain falls into the branch). N.B.: The Ded_cyto domains in the green box are the Dedicator of cytokinesis domains found in the DOCK proteins; and Pkinase domains, including the Pkinase_Tyr domains, listed in the blue box, are the kinase domains of the IRAK family proteins.

Second, we mapped known protein-protein interactions onto this human protein-domain matrix, using information from the Human Protein Reference Database (HPRD; To this end, we selected 70 families of domains that mediate molecular interactions ( (fig. S9, bottom graph), and then we identified all of the polypeptides arrayed on the matrix with which proteins having a given domain reportedly interact (illustrated in Fig. 5A and fig. S8A). The total numbers of such interactions for individual domain families are shown in fig. S9 (top graph), albeit these are likely biased by existing experimental analysis and the accuracy of curation. Every such interaction is then scored on the matrix by highlighting the row that represents the relevant interacting protein and its constituent domains. The resulting data are then smoothened to yield a pattern that can be quantified by scanning each image row-by-row (using a procedure termed vector scanning) (Fig. 5A, bottom-right; fig. S8B; and Methods). In this approach, the clustering of proteins on the matrix provides a way of visualizing the domain-based environment within which interacting proteins carrying a particular domain may operate.

We mapped the interactions of proteins containing each of the 70 domain families onto the matrix obtained by protein-domain profiling. Examples of the distributions yielded by proteins with SH2, SH3, RING, or DED domains are given in Fig. 5B. Superficially, the interaction profile of proteins with SH2 domains appears related to that of SH3 domain proteins, whereas RING domains and DEDs domains give different signatures. To directly compare the profiles for the 70 different domain families (Fig. 6A), we first quantified the individual density plots by vector scanning and then computed the pairwise similarities between any two domain interaction sets. The resulting pairwise Euclidean distance between any two density-grams, based on the same proteome-wide clustering matrix, was then represented by an un-rooted tree (fig. S8B and Fig. 6A; see details in Methods). This yields a hierarchical organization that relates the various domains based on the similarities of the protein-protein interactions in which they are involved and the grouping of their interaction partners on the human protein-domain matrix (Fig. 6A).

The resulting tree shows that domains involved in growth factor–related signaling from the plasma membrane [that is, SH2, SH3, PH, PTB, B41 (also known as FERM), C2, and C1] are clustered, as are domains involved in apoptosis and inflammation (DD, DED, CARD, BIR, and TRAF), or chromatin organization (Bromo, Chromo, BRCT) (depicted in green, blue, or red, respectively, in Fig. 6A), but that these three clusters are distant from one another on the tree. Interestingly, one branch of the tree has a number of domains involved in protein and vesicle trafficking, many of which have phospholipid- or ubiquitin-binding activities.

To pursue the concept that particular domain groupings are characteristic of specific functional processes, we investigated how domains not included in the original 70 families map onto three of the major branches (growth factor signaling, apoptosis, and chromatin remodeling). For each branch of the domain-interaction tree, we obtained a list of all domains that are linked in the same protein with at least one known domain in the branch; in this case, domains were predicted by running a hidden Markov model (HMM) based on Pfam domain definitions. We note that Pfam provides more comprehensive coverage of peptide sequence notations and is less subject to literature or case study bias by running HMMs (Methods) and is, therefore, suited for such screening purposes. For each domain in the list, we computed the frequency with which it occurs in conjunction with the other domains in the branch and its frequency in the proteome. We thereby ranked these additional domains according to the likelihood that they belong to each of the functional branches (ranked by P values in Fig. 6B). This analysis identified a number of domains that are known to have selective roles in the relevant cellular processes. For example, the Peptidase_C14 family domain, which mediates the protease activity of caspases, ranks at the top of apoptosis group (P value, 6.71 × 10−17; blue box). This approach also identifies a number of domains for which little information is available, but which map to one of the three relevant branches. These include several elements annotated by Pfam as “domains of unknown function” (DUFs, asterisks in Fig. 6B), which we infer to have functions in the indicated areas of cellular activity. These findings warrant further investigation into the biochemical functions of these DUF domains.

These results are consistent with the notion that distinct subsystems within the cell use different sets of interaction domains, which may act both in a cooperative fashion to mediate specific processes, such as growth factor signaling at the plasma membrane, and also in an exclusionary mode to separate functionally discrete processes. One caveat to these conclusions is that domains with a large number of combinatorial partners, and thus diverse cellular functions, could be artificially forced into a particular branch of the tree because of a well-known limitation of hierarchical representations (details in Methods). Modules, such as ANK, ARM, WD40, and TPR repeat domains, as well as RING and EF hand domains, are highly connected, and therefore likely have multiple potential functions and would not be restricted to any particular functional branch.


Interactions of protein domains as a potential evolutionary force

Gene duplication provides a mechanism to generate a reservoir of protein domains, which may then experience a number of different fates. For example, such duplicated domains may be highly conserved in sequence and function or may be inactivated because of degenerative mutations and genetic drift (2, 46, 47). Alternatively, genetic alterations that change a domain’s intrinsic biochemical properties or that link it to new domain types, or both, may confer an improved biological fitness and, therefore, be fixed by positive selection. The consequences of such variations in domain function likely depend on the reciprocal interactions of the altered domain with other proteins and their corresponding domain clubs within the cell (Fig. 7A). Therefore, the cluster of domains that constitute a particular subsystem (for example, Fig. 6A) may constrain the selection of evolving domains and domain combinations to those compatible with this niche. Conversely, the rare movement of a domain into a new region of the domain club tree (Fig. 4A) may restructure the domain’s new environment. Based on these arguments, we consider different evolutionary paths potentially followed by distinct domains (schematic illustration in Fig. 7B).

Fig. 7

A conceptual model for domain evolution. (A) Proposed evolutionary interrelationship between protein domains and their molecular environment. Individual domains (step 1) acquire specific intrinsic functions, and are linked together in various combinations to form protein-domain clubs (step 2). These clubs mediate selective interactions with specific proteins and other biomolecules, yielding particular functional subsystems (step 3) within the cell. Each such molecular niche, in turn, preferentially interacts with particular subsets of protein domains. The selection of a new domain once it enters a given molecular environment through the formation of a new domain combination will depend on the functional compatibility of the domain with its new environment. (B) Possible paths for domain-based protein evolution. In each case, the evolving domain under consideration is illustrated as an open box, and its linked partner domains are indicated by solid boxes. (Left) The interaction of an isolated domain with its environment (gray background) is conserved due to functional constraints. (Middle) In an existing, functional protein, a social domain is linked to a partner domain (blue) and the molecular environment of that domain (light blue background). Linkage of the evolving domain to a new domain (dark blue) with similar properties to its original partner may lead to modified functions within the constraints of that particular environment. (Right) The evolving domain becomes linked to a partner domain (orange) that resides in a different molecular compartment (orange background); it is thereby exposed to a different biochemical milieu. Typically, the evolving domain is eliminated (through nonfunctionalization), because it is incompatible with this environment. Rarely, the evolving domain acquires new intrinsic properties that allow it to adapt to its new environment and to contribute a new function in the environment. The domain then expands within this environment, as in the middle panel. (C) A domain drives the evolution of its molecular environment. A particular molecular environment is organized by its resident domains (orange). When these are linked to a new domain (blue), arriving from a different subnetwork, the properties of the environment may be expanded through new connections provided by the immigrant domain. This hybrid environment may then be populated by domains and domain combinations with emergent functions (blue background shading).

In the first scenario (conservation), isolated domains, such as actin or 14-3-3, may expand in numbers, but retain conserved sequences and functions, and are excluded from multidomain proteins. A plausible explanation is that they mediate central cellular functions and, therefore, cannot tolerate any linkage to other domain types, because this could prove deleterious to their core activity.

In a second scenario (negative selection), alterations in domain function could result in negative selection. Mutations that inactivate or modify the intrinsic function of a domain will usually be detrimental, as seen in numerous disease states and, therefore, will not be sustained. Similarly, fusion with a new domain type will most commonly result in aberrant connections that are inimical to cellular function [discussed in (48)].

In a third scenario (modified functions), a domain may expand its functions within a particular subsystem by mutations that modify its intrinsic properties without altering its core activity or through mutations that yield new linkages to domains that already contribute to the relevant cellular process. The cross-species analysis of domain clubs (Fig. 4A) provides examples of such incremental increases in domain function, such as SH3 domains.

In a fourth scenario (neofunctionalization), the acquisition of an entirely new intrinsic function by an existing domain type, for example, phosphotyrosine-binding by an SH2 domain, may signal an evolutionary advance. Such neofunctionalization may be assisted by genetic recombination that places the modified domain in combination with a different domain type that occupies a distinct subcellular environment. On the one hand, this process would relieve the domain of a requirement to maintain its original function and, on the other hand, could allow it to evolve novel activities that favorably expand the properties of the new subcellular environment to which it has been recruited. This altered domain function would then be selected for, if it provided an advantage in the context of its new molecular niche.

For example, as noted, phosphotyrosine-binding SH2 domains are found in Dictyostelium, which lacks a conventional tyrosine-specific protein kinase. The observation that 5 of the 14 SH2 domain proteins in Dictyostelium are linked to TKL (tyrosine kinase-like) dual-specificity protein kinase domains suggests that the interplay between these two domain types may have favored the emergence of fully specific tyrosine kinases and the expansion of phosphotyrosine signaling seen in metazoans and their close relatives.

Once a domain has been localized to a new cellular environment, it may form the basis for additional domain combinations that, in turn, recruit new molecular components into the evolving subsystem (illustrated in Fig. 7, B and C). The marked expansion of SH2 domain clubs involved in phosphotyrosine signaling in metazoans (Fig. 4, A and B) and choanoflagellates (42) illustrates this phenomenon. After the evolution of SH2 domains as phosphotyrosine binders and of bona fide tyrosine kinases that efficiently generate phosphotyrosine signals, the linking of SH2 domains to new catalytic and interaction modules likely provided a simple mechanism to couple phosphotyrosine sites to new downstream pathways.

In summary, this hypothetical model suggests a mutual interplay between the intrinsic properties of individual domains, their linkage to other domain types, and the domain-based organization of the molecular subnetworks within which they function.

The trajectory of Tudor domains

To pursue the validity of this model of protein and cellular evolution, we searched for additional instances of a domain that may have acquired a novel function coordinately with its recruitment into a new molecular environment as a result of domain shuffling and mutation. To this end, we investigated the expression patterns of genes that encode particular domains to obtain clues regarding specialized domain functions. This approach revealed that 9 of 27 mouse genes encoding proteins with Tudor domains are selectively expressed in spermatocytes, as compared with other genes for Tudor domain proteins, which are broadly expressed (Fig. 8A).

Fig. 8

Proposed evolutionary trajectory of Tudor domains. (A). Expression of genes encoding Tudor family proteins profiled against previously characterized tissue-specific genes for skeletal muscle (SK), cardiac muscle (CA), adipocyte (AD), endothelium (EN), smooth muscle (SM), epithelium (EP), neuron (NE), B cell (BC), T cell (TC), liver (LI), pancreas (PA), kidney (KI), stem cell (ST) and spermatocyte (SP) (data retrieved from the human microarray repository of GEO). Genes selectively expressed in spermatocytes (SP) are highlighted in brown (asterisks mark genes encoding multi-Tudor domain repeat proteins—domain architectures to the right). The SND1 gene (black arrow) does not exhibit such selective expression. (B) Classes of Tudor domain types (JMJD, Smn, SND1, germline Tudor) are indicated, together with their localization and known or predicted functions. Black arrows indicate a potential trajectory of Tudor domains from Smn to SND1 to germline Tudor. The structural organization of Tudor domain classes is compared within the colored boxes, with the Tudor domain core marked in gray, and the N-terminal α helix and β strand extensions (known or predicted) in red and blue, respectively.

Tudor domains belong to the “Royal Family” of domains, which also includes Chromo, MBT, PWWP, and Agenet-like domains (fig. S10). They contain ~60 amino acids that form a β-barrel–like core structure, which, in some cases, binds selectively to methylated lysine or methylated arginine sites. They are implicated in diverse epigenetic functions from chromatin remodeling and histone-binding to RNA processing (49). On the basis of their distinct functions and sequences, we distinguish four groups of Tudor domains (Fig. 8B): One group that binds modified histone tails is represented by the JMJD family of histone demethylase (blue box) (50); a second group is found in mRNA splicing factors, such as the Smn proteins (pink box) (51, 52); a third group is represented by the Tudor domain in SND1, a component of the RNA-induced silencing complex (RISC) (yellow box) (53); and the last is the germline group, the functions of which are unclear (brown box) (54).

In considering the relationships between Tudor domains, we noted that the various groups can differ in the nature of the sequences flanking the core domain (Figs. 8B and 9A). The acquisition of such extensions may have provided Tudor domains with new intrinsic properties, which when coupled to their linkage to novel domain types may have generated new biological functions. We summarize the domain architectures of Tudor-containing proteins in fig. S10.

Fig. 9

Proposed stepwise variation in Tudor domain function through evolution of the Tudor domain. (A) Proposed stepwise variation in Tudor domain function. An Smn-type Tudor domain, with a predicted N-terminal α helix, is inserted into an SN nuclease domain (SN5 in SND1) downstream of its N-terminal double β strands (blue). Germline Tudor domain proteins can have multiple Tudor repeats, some of which are predicted to have both an α helix and β strands at the N terminus; these could have derived from an SND1-type protein or evolved independently. Arrows indicate a potential evolutionary trajectory of these Tudor domain classes. (B) Multiple alignments of germline Tudor domains and closely related domain sequences (human), showing 70 amino acids N-terminal to the Tudor core; β strands and α helices (marked in blue and red, respectively) were predicted with the Proteus Structure Prediction Server ( (56) or are based on known structures (57). The SND1 sequence is shown in red type. Corresponding sequences of staphylococcal nuclease (SNase) of Staphylococcal aureus and human Smn1 are shown at top. Sequences in gray are Tudor domains that are not predicted to share these secondary structural features.

The known or predicted structural features to the N terminus of various human Tudor domain members are depicted in Fig. 9, A and B. This reveals a seemingly stepwise accumulation of structural features (Fig. 9A). The JMJD family proteins have two conventional core Tudor domains in tandem, which form an integrated structure (50); a similar double Tudor domain configuration is present in 53BP1 (55). By contrast, predictions of secondary structure (56) suggest that the Smn Tudor domain found in the Smn1 protein involved in small nuclear ribonucleoprotein biogenesis for mRNA splicing has an α helix N-terminal to the core domain. Experimental structural analysis of a distinct Tudor domain protein, SND1, has revealed an α helical arm preceding a Tudor core domain. SND1 is conserved from fungi to human and has five tandem staphylococcal nuclease (SN)–like domains. The SND1 Tudor domain is inserted into the fifth SN domain (SN5), after the second β strand of SN5 (Fig. 9A), thereby creating a composite SN-Tudor structure (57). This marks a transition in Tudor domain function from regulating mRNA splicing machinery in the context of the Smn proteins to an involvement in processing small noncoding RNAs, including siRNAs (small interfering RNAs) and microRNAs, in SND1 (Fig. 8B). The modified SND1 Tudor domain is directly linked to intact SN domains 1 to 4, forming a conserved SND1 domain architecture present in virtually all modern-day organisms in which RNA interference functions can be detected. These observations suggest a model in which the SND1 Tudor domain, through insertion into a protein with tandem SN domains, may have acquired a new intrinsic function, and concomitantly have been recruited into a new molecular environment, yielding a new cellular activity involved in small RNA processing, as outlined in Fig. 7.

In SND1, the Tudor domain with its α helical extension is preceded to the N-terminal side by two β strands from SN5. Secondary-structure predictions suggest that all nine germline Tudor proteins also have at least one Tudor domain with a pair of predicted β strands N-terminal to the α helical bundle (Fig. 9B). One possibility is that this class of Tudor domains may have originated from a precursor domain of the SND1 class that escaped from its environment of SN domains, carrying with it the double β strand sequence; alternatively, the SND1 and germline Tudor classes may have converged on the same N-terminal structural motif. One feature that distinguishes the germline class of Tudor domain proteins from Smn and SND1 is that many of their members form Tudor domain repeat proteins (Fig. 8, A and B). In mice, these proteins have been previously localized to the germinal body–like structures of the male germ cells, such as the chromatoid bodies and nuage (Fig. 8B, right panel) (54). It is interesting that a third class of noncoding small RNAs, known as piwi-interacting RNAs (piRNAs), is predominately expressed in male germ cells (58). Their biogenesis involves the Piwi clade of Argonaute proteins, which are also found in these germinal bodies (59, 60) in the male germ cells (Fig. 10A).

Fig. 10

Arginine methylation of Piwi proteins and specific interaction between Piwil1/Miwi and Tdrd1. (A) Testis-specific expression of the mouse Piwi clade of Argonaute proteins (marked brown), including Mili, Miwi, and Miwi2, as compared to distantly related Ago clade of proteins (Ago1 to 4), which are more widely expressed. To the right are the domain architectures of the relevant proteins. (B) Comparison of the N-terminal (broken box in A) sequences of Argonaute family proteins (full alignment using ClustalW programming in fig. S11). The Piwi clade proteins have multiple arginine-glycine motifs (RG boxes, highlighted) in their N termini, which are absent from the Ago proteins. Arginine residues shown in red boxes were found to be mono- or dimethylated or both by mass spectrometry analysis of full-length proteins isolated from testis lysate (for Miwi, immunoprecipitated with antibodies against Miwi) or from transfected HEK293T cells (for Flag-tagged Miwi, Mili, and Miwi2, immunoprecipitated with an antibody against Flag). (C) Germline Tudor protein Tdrd1 selectively coprecipitates with Miwi. HEK293T cells were cotransfected with an EGFP-Tdrd1 expression plasmid together with either an empty vector (lane 1), FLAG-Miwi (lane 2), or FLAG-mAgo1 (lane 3). The anti-FLAG immunoprecipitations from the cell lysates (IP:FLAG) and whole-cell lysate (WCL) inputs were blotted with antibodies against GFP and FLAG as indicated.

In this regard, we note that the Piwi clade of the Argonaute family proteins have multiple distinctive arginine-glycine repeat sequences (RG boxes), which are absent from Ago proteins (Fig. 10B and fig. S11). Such RG motifs are typical of methylation sites and, in some cases, can be recognized by members of the Tudor family (61). Indeed, we found by mass spectrometry that endogenous Piwil1 (Piwi-like 1; more commonly termed Miwi in the mouse) isolated from mouse testis is dimethylated on Arg53, which is located in an RG motif (Fig. 10B and fig. S12). We also identified methylation of Arg53 of Miwi and of four other RG sites (in the form of either mono- or dimethylation, or both) in the N termini of Piwil2/Mili and Piwil4/Miwi2, in transfected human embryonic kidney (HEK) 293T cells (Fig. 10B, red boxes). In addition, mass spectrometry–based analysis of Miwi-associated proteins, after immunoprecipitation of endogenous Miwi from murine testis lysate, identified multiple germline Tudor domain proteins as Miwi-binding partners (62). To test the idea that germline Tudor domain proteins selectively interact with RG-containing Piwi-clade proteins, we cotransfected HEK293T cells with the multi-Tudor domain protein Tdrd1 and either Miwi or mAgo1. We found that green fluorescent protein (GFP)–tagged Tdrd1 coprecipitated with Flag-tagged Miwi, but not with mAgo1 (Fig. 10C). These results are consistent with evidence showing that Piwil2 forms a complex with Tdrd1 and that two antibodies against symmetrically methylated arginines of the Sm proteins can recognize the N termini of the Drosophila Piwi proteins (6366). Although the precise mechanism through which Miwi and Tudor domain proteins associate remains to be established, we hypothesize that the emergence of germline Tudor domains, in the context of proteins with tandem Tudor domain repeats, played a role in structuring the molecular environment of germinal bodies in a fashion that facilitated the evolution of a subsystem for generating Piwi-regulated RNAs (Fig. 9B).

This analysis suggests that alterations to the Tudor domain itself, and its linkage to varying domain types, has been associated with the development of several modes of epigenetic regulation, including chromatin organization, RNA splicing, and processing of small RNAs.

Concluding remarks

Domains characteristic of eukaryotic proteins represent fundamental units of function that facilitate the evolution of biological activities both through the acquisition of new intrinsic properties and through the formation of new domain combinations. We suggest a reciprocal interaction between the domain-based organization of individual proteins and of the proteins that form the molecular environment in which they operate, which both promotes the development of new protein functions and helps to pattern subcellular structures and pathways.


Additional experimental methods are provided in Supplementary Methods (section S-IV).

Data sources

Protein sequences were obtained from the Ensembl database ( except for the sequences of kinases (figs. S2 to S4), which were obtained from ( (30), and of interacting proteins (Figs. 5 and 6 and fig. S9), which were obtained from HPRD ( (67). Protein domains correspond to SMART’s definition of domains ( and Pfam’s definition of families ( Protein sequences and genome identifiers (Gene ID and HGNC symbols) were collected from Ensembl ( and dictyBase (, and in case of the kinome and the protein tyrosine phosphatases, from (30) and (68), respectively. As a general rule, the longest peptide sequence was automatically selected to represent each gene. To identify nonredundant domains in a protein sequence, either one of two methods was used. (i) With help from a distributed program, protein sequences were searched in “batch access” against SMART’s precalculated domain database. SMART was also used to select some nonredundant Pfam definitions (69). The resulting domain list was filtered to removed the low-confidence but overrepresented domains, which include low_complexity_regions, signal_peptide, coiled_coil_regions, transmembrane_domains, LRR, LRR_TYP, LRRCT, LRRNT, Pfam:7tm_1 (In the analysis of smaller set of proteins, that is, in Fig. 2 and fig. S1, these domains are included.). In the kinome study (supplementary material), kinase domain sequences are defined by their SMART notations either as S_TKc, S_TK_X, TyrKc, STYKc, or PFAM:Pkinase. (ii) In the domain frequency study (Fig. 6B), we directly identified domains by their sequence-consensus using HMMs to search sequence databases with the HMMER package (

Computer programming environment

The algorithms for data analysis were programmed with resources such as MATLAB, C, Perl, R, and modified source code in Cluster 3.0 developed by M. Eisen ( (70). Protein secondary-structure predictions were conducted with Proteus, a Web-based tool for protein structure prediction (56).

Computer simulation of co-occurrence frequency of domains

A statistical method was used to address whether the frequency of domains shared by RhoGEFs and RhoGAPs is significantly higher than can be explained by a random combination of domains. Let G denote the set of all proteins encoded by the human genome. Let Q denote a given set of proteins in the genome and T denote a given set of domains. We are testing, and hopefully rejecting, the null hypothesis H0 that the proteins in Q contain domains in T in a way that has no difference from a randomly selected set of proteins in G, where the size of Q is larger than 50.

Denote by S a random set of m proteins selected uniformly at random from G. Define the statistic X(S) as the fraction of proteins in S that contain at least one domain in T. Noting when m is reasonably large (for example, m > 50), X(S) is well approximated as a Gaussian random variable, where we will use μm and σm to denote the mean and standard deviation of X(S), respectively, and use N(x; μ, σ) to denote the Gaussian distribution of X(S).

The example of domains shared by RhoGEFs and RhoGAPs is used to illustrate the computation required to reject the null hypothesis. With computer Monte Carlo simulations, we compute μm and σm (for example, T was chosen as the union of the domains contained in RhoGEF and RhoGAP proteins). The P value of the statistical test for H0 is then the tail probability of the N(x; μ, σ), namely,Pvalue=X(Q)+N(x;μm,σm)dx where m is the size of Q. When the P value is smaller than a prescribed significance level, the null hypothesis is rejected.

In the context of this analysis, we choose T to be the union of the domains contained in RhoGEF and RhoGAP proteins and perform two statistical tests, where in one test Q is chosen to be the set of all GEF proteins and in the other Q is chosen to be the set of the of all GAP proteins. To reject the hypotheses with higher confidence, we could choose T as the set of domains contained in GAP proteins in the first test and as the set of domains contained in GEF proteins in the second test. But for the ease of presentation, T is chosen as the union of the two sets of domains. Although the ability to reject the null hypotheses is reduced by this choice of T, as shown in Fig. 2H, the null hypotheses are still rejected with a large margin, with the computed P values in both cases less than 1 × 10−10.

Microarray data retrieval

The experimental data sets using human and mouse Affymetrix probe sets were downloaded from the Gene Expression Omnibus (GEO) Web site ( We then calculated the means and the standard deviations of the Pearson’s correlations of gene pairs in each experiment, from which we established a standard score (Z) of 2.0 as the threshold. For the purpose of identifying tissue (or cell type) specifically expressed genes, we first used a set of selected tissue-specific marker genes as the reference to select relevant data sets in GEO. The qualified data sets had to satisfy the criterion that the median correlations among the marker genes were above the threshold. Using these preselected data sets, our prediction score for any tissue-specific gene expression is defined as the correlation between the genes and the known marker genes for each tissue or cell type. We applied the procedure to each human and mouse gene in the Affymetrix probe sets and established two large score matrices that profile the global tissue-specific expression patterns of human and mouse genes. These score matrices can be broken down into individual domain groups as in the examples of Tudor and PAZ domain genes (Figs. 8A and 10A, respectively).

Two-way clustering

In protein-domain profiling, a matrix representation of a given set of proteins is derived, where each entry of the matrix is {0,1} valued. The rows of the matrix are indexed by the proteins and the columns are indexed by domains. An entry of the matrix equal to 1 indicates that the corresponding domain is contained in the corresponding protein. Every row of the matrix is the domain-composition profile of the indexed protein, and likewise, each column of the matrix is the protein-context profile of the indexed domain. A protein’s domain composition suggests its structural and functional relationship to other proteins, whereas a proteome-wide comparison of a domain’s protein context reflects its neighborhood relationship among multiple domain types.

Standard hierarchical clustering may be performed for both rows and columns of the matrix. We use Jaccard similarity coefficient as the distance metric for clustering. Specifically, for any two sets A and B, the Jaccard similarity coefficient is defined as the cardinality of the intersection of A and B divided by the cardinality of the union of A and B. Note that every domain-composition profile, although represented as a vector, may also be written as a set of domains. Therefore, the Jaccard similarity coefficient between two domain-composition profiles is well defined. Compared with other measures of similarity (for example, that based on the Hamming distance), the Jaccard similarity coefficient has the advantage of eliminating the artifact caused by elements commonly absent in A and B.

In addition to computing the distances (average linkage) between every two rows and between every two columns, two-way clustering results (the hierarchical structure) may be used to simultaneously permute the rows and columns of the matrix to better visualize the aggregation patterns. Here, we refer to the resulting matrix as a two-way clustergram.

It should be noted that although two-way clustergrams are the most intuitive way to visualize the results of clustering in terms of functional and evolutionary relatedness, such a means of visualization is limited by the fact that the proteins or domains are forced into a linear order along the axes. This occasionally results in some related proteins or domains being placed an artificially large distance apart.

Domain subnetwork correlation in the global context of domain profiling by spreading on graph clustering

We developed a method referred to as spreading on graph (SOG) clustering to correlate multiple sets of protein networks in a global context.

Specifically, denote S1, S2, … SK be K sets of proteins, where each set is a subset of a larger collection of proteins G, which typically is a proteome of interest. The purpose of SOG clustering is to correlate (and as well differentiate) the functional networks of S1, S2 … SK within the global network of proteome G.

With appropriate annotation of the domains involved in G, let Δ denote the set of all domains involved in the proteins in G. Two-way clustering is first performed, yielding not only the clustergram (for visualization purposes), but also the distances between every pair of domains and between every pair of proteins. For every pair of proteins (p1, p2), we use DP (p1, p2) to denote the distance between p1 and p2, and for every pair of domains (d1, d2), we use DD (d1, d2) to denote the distance between d1 and d2. The notion of distance here is not the Euclidean distance on the grid, but the index of Jaccard coefficient obtained from clustering the proteome. For each protein set Pk, k=1, 2, … K, we compute a two-variable function Qk(p,d) as follows, where p ranges in all proteins in G and d ranges in all domains in Δ:

Qk(p,d)=(p,d)[pd]exp{(ωdDd2(d,d)+ωpDp2(p,p))} (1)ωp=ωd|G|/|Δ|(2)

In Eq. 1, notation [p′ ~ d′] is the function that evaluates to 1 if domain d′ is contained in protein p′, and evaluates to 0 otherwise. Parameters ωd and ωp are appropriately chosen positive numbers. Effectively, Qk(p, d) is a “density” at coordinate (p, d) as sketched in Fig. 5A and fig. S8B, where ωd and ωp govern the spread along the “domain direction” and along the “protein direction,” respectively.

It is worth noting that such a spreading procedure is a necessary step to compare two sets of proteins on the same template of a clustering matrix. Such a comparison is essentially a comparison between two sets of (protein, domain) pairs, each respectively characterizing one of the two protein sets. However, a naive comparison of the two binary matrices representing the domain profiles of the two protein sets is not sufficient to capture the similarity and dissimilarity of the two protein sets. Specifically, one should not regard two different proteins as being completely dissimilar. Rather, the similarity or relatedness between proteins, between domains, and between (protein, domain) pairs should be taken into account when comparing the two sets of proteins.

For instance, one way to relate a pair of proteins sharing varying degrees of similarity in domain composition can be achieved by spreading the impact on each other following an appropriate decay formula (Eqs. 1 and 2, and as illustrated in fig. S8B). Given that these two proteins belong to distinct protein sets under consideration, their relatedness as a result of spreading contributes to the relatedness of the two sets in a quantitative way. As a simple example, two proteins having identical domain architectures would occupy different sets of coordinates in the two-way clustergram. When the two proteins are assigned to two different protein sets of interest and spreading is not applied, these two proteins would, undesirably, enhance the dissimilarity between the two protein sets.

Heuristically, we have chosen Eq. 2 (where |G| and |Δ| denote the cardinality of a set) so as to appropriately reflect the rather significant difference between the number of considered proteins and the number of involved domains. The value chosen for ωd plays no essential role in the processing, serving only to globally scale each density function Qk. With this processing, the Qk values across the two-way clustergram G can be visualized as an image where coordinates on the horizontal and vertical axes are ordered respectively according to their respective linear orders in the clustergram, and the pixel attributes are the values of Qk. Any two such images with the identical coordinate system (for example, Fig. 5B) are readily compared using standard mathematical tools. Specifically, we scanned each image row by row to form a vector and then compared two such vectors with Euclidean distance to capture the relatedness of the two protein sets represented by the two images. Finally, the relatedness across all images can be reflected in a tree presentation (Fig. 6A) obtained with a standard clustering procedure using average linkage based on the Euclidean distance of the scan profiles.


We acknowledge G. Gish, L. Taylor, K. Colwill, J. Min, C. Wells, and D. Wishart for advice; S. Zhang and E. Tenaglia for technical assistance; G. Wiggin, R. Bagshaw, G. Bader, M. Kofler, J. Parkinson, J. Scott, J. Dennis, and I. Taylor for critical reading of the manuscript; L. Taylor, P. Metalnikov, V. Nguyen, and A. Pasculescu for technical assistance with mass spectrometry; and H. Welch for P-Rex1 complementary DNA. J.J., C.C., and J.G.P. are recipients of fellowships from the Canadian Institutes of Health Research. This work is supported by grants from Genome Canada, through the Ontario Genomics Institute, and the Canadian Institutes for Health Research (grant MOP-6849), the Canadian Cancer Society, the Ontario Research Fund, and the Premier’s Summit Award.

Supplementary Materials

Section S-I. Characterization of the functional relationships between RhoGEF/RhoGAP pairs Tuba/Rich1, and DEPDC1/P-Rex1.

Section S-II. Correlation between protein kinase regulatory and catalytic domains.

Section S-III. Web site for domain clubs across seven genomes.

Section S-IV. Supplementary methods.

Section S-V. Supplementary references.

Table S1. Domains found in common versus domains that are unique to SH2 domain proteins or SH3 domain proteins.

Fig. S1. Functional characterization of RhoGEF and RhoGAP protein pairs.

Fig. S2. Domain profiling of human protein kinases (PTK) and protein tyrosine phosphatases (PTP).

Fig. S3. Domain profiling of the human kinome.

Fig. S4. Domain profiling across six eukaryotic kinomes.

Fig. S5. Domain club cut-off selection.

Fig. S6. Test of the robustness of the domain profiling method for applications in large data sets.

Fig. S7. A Web site for domain club analysis—example of Dicer1 and its DSRM domain.

Fig. S8. General procedure for spreading-on-graph (SOG) clustering in the analysis of the molecular environments of domain families.

Fig. S9. The interactomes of the 70 domain families employed in the analysis of domain-based functional compartments.

Fig. S10. An overview of the Royal Family of domains.

Fig. S11. Sequence characteristics of the mouse Piwi clade of the Argonaute proteins.

Fig. S12. An MS/MS spectrum of a tryptic peptide containing dimethylated (2me) Arg53 of endogenous Miwi protein from testis.

References and Notes

View Abstract

Navigate This Article