COMPUTATIONAL BIOLOGY

Rules for Modeling Signal-Transduction Systems

See allHide authors and affiliations

Science's STKE  18 Jul 2006:
Vol. 2006, Issue 344, pp. re6
DOI: 10.1126/stke.3442006re6

Abstract

Formalized rules for protein-protein interactions have recently been introduced to represent the binding and enzymatic activities of proteins in cellular signaling. Rules encode an understanding of how a system works in terms of the biomolecules in the system and their possible states and interactions. A set of rules can be as easy to read as a diagrammatic interaction map, but unlike most such maps, rules have precise interpretations. Rules can be processed to automatically generate a mathematical or computational model for a system, which enables explanatory and predictive insights into the system’s behavior. Rules are independent units of a model specification that facilitate model revision. Instead of changing a large number of equations or lines of code, as may be required in the case of a conventional mathematical model, a protein interaction can be introduced or modified simply by adding or changing a single rule that represents the interaction of interest. Rules can be defined and visualized by using graphs, so no specialized training in mathematics or computer science is necessary to create models or to take advantage of the representational precision of rules. Rules can be encoded in a machine-readable format to enable electronic storage and exchange of models, as well as basic knowledge about protein-protein interactions. Here, we review the motivation for rule-based modeling; applications of the approach; and issues that arise in model specification, simulation, and testing. We also discuss rule visualization and exchange and the software available for rule-based modeling.

Introduction

Many diseases are caused by molecular changes that affect signal-transduction systems, and some of these diseases, such as chronic myelogenous leukemia, can now be treated by drugs that target signaling proteins, such as kinases (1). Thus, in addition to our curiosity about the fascinating mechanisms that cells use to respond to signals, there is practical motivation to better understand the processes of cellular signaling, in which protein-protein interactions play a central role. Indeed, we would like to make accurate predictions about the functional roles of proteins and the effects of modifying the interactions of proteins in particular systems. Our ability to make such predictions is a measure of our basic understanding of molecular cell biology and is likely to have practical consequences in drug discovery (2, 3).

The behavior of a signal-transduction system depends on dynamic interactions among its proteins (4, 5). The combined effects of these interactions are difficult to predict from intuition alone. When intuition is insufficient, a mathematical model is often useful for acquiring a quantitative and predictive understanding of a complex dynamical system, and mathematical modeling is being increasingly used to aid in studies of cellular signaling (6, 7). Models have now been developed and tested for signaling events mediated by several well-studied cell surface receptors, including the epidermal growth factor receptor (EGFR) (8) and two of the antigen-recognition receptors of the immune system (9).

However, current models are still far from capturing all of the relevant mechanistic details of signal-transduction systems that must be considered to provide realistic and complete pictures of how these systems work (10). In particular, models often fail to account for the complexities of protein-protein interactions, such as how these interactions depend on contextual details at the level of protein sites. Also, few models account for the enormous number of possible posttranslational covalent modifications of proteins and for formation of all the possible protein complexes (11). A major reason for this shortcoming is the strain on conventional modeling approaches caused by the combinatorial potential of protein-protein interactions. New modeling approaches that address this problem involve the use of rules to represent protein-protein interactions. (Rules are also useful for representing other types of biomolecular interactions, but we will focus on protein-protein interactions for purposes of discussion.) The introduction of rules greatly eases the task of specifying a model that incorporates details at the level of protein sites. A rule—such as "ligand binds receptor with rate constant k whenever ligand and receptor have free binding sites"—describes the features of reactants that are required for a particular type of chemical transformation to take place. Rules simplify the specification of a model when the reactivity of a component in a system is determined by only a subset of its possible features. Here, we review rule-based modeling of signal-transduction systems.

Modeling Goals and Challenges

What do we expect from a model?

A model should incorporate enough details to make testable predictions about quantities that can be measured and controlled in experiments. Whenever possible, the parameters of a model should have a physical rather than phenomenological basis, such that parameters are independent of system behavior. A model should provide insights and guide experimentation by revealing the logical consequences of knowledge and assumptions about the mechanistic details of a system, including the proteins in the system and their binding sites, enzymatic activities, and sites of posttranslational modification. The development of models at this level of resolution is justified by current and emerging capabilities for characterizing systems at the level of protein sites and for monitoring the dynamics of multiple readouts of protein interactions (1216). Thus, although other levels of abstraction may prove useful, we are most interested in models with physicochemical parameters that capture details about proteins and their interactions at the level of protein sites, which we take to include motifs, domains, and subunits.

General features of proteins and protein-protein interactions

Protein-protein interactions triggered by a signal can lead to various covalent posttranslational protein modifications, such as phosphorylation, ubiquitination, acetylation, or methylation (4, 5, 17). The effect of such modification, which is often reversible, may be a change in a protein’s enzymatic or binding activity, location, or rate of turnover. Signal-regulated protein-protein interactions also mediate the assembly of heterogeneous protein complexes. A common effect of complex formation is an increase in the activity and specificity of an enzyme through colocalization of the enzyme with a substrate (18, 19). Signal-induced protein complexes are also known to regulate enzymatic activity through allosteric mechanisms (20, 21).

The enzymatic and binding activities of proteins involved in signaling tend to be localized to modular domains (19, 22), such as the protein tyrosine kinase (PTK) and Src homology 2 (SH2) domains of a Src family kinase. Proteins may also contain smaller parts, like immunoreceptor tyrosine-based activation motifs (ITAMs) (23), that are sites of modification or binding or both (24). A signaling protein generally has multiple domains (25, 26) and binding sites (and binding partners, which may have overlapping specificities), as well as multiple sites subject to posttranslational modifications (which may include the binding sites themselves) (27) (Fig. 1). Such multiplicity may permit many combinations of modification and binding events, which can be difficult to track in a model (11, 28, 29).

Fig. 1.

Frequency distribution of multisite protein phosphorylation. This plot is based on information about phosphorylation of serine (S), threonine (T), and tyrosine (Y) residues of 1663 proteins documented in version 3.0 of the Phospho.ELM database (http://phospho.elm.eu.org/). The vertical axis indicates the fraction of proteins; the horizontal axis shows the number of phosphorylation sites. More than half the proteins are phosphorylated at two or more sites.

Combinatorial complexity

As a modeler considers more proteins and protein sites of a signal-transduction system, the number of possible protein complexes and combinations of protein modifications tends to increase exponentially. Moreover, hundreds to thousands of chemical species may be generated by the interactions of only a few proteins (28, 3033). Thus, a signal-transduction system can be quite large when viewed as a chemical reaction network. We refer to this potential for large size as "combinatorial complexity."

One source of combinatorial complexity, to which we have already alluded, is multisite protein modification (27). Consider the EGFR. According to a recent review (34), at least nine tyrosines in EGFR are phosphorylated during signaling (Fig. 2). Phosphorylation of these tyrosines is mediated by a nonreceptor PTK, Src, or EGFR itself when the PTK domain of one EGFR in a ligand-induced receptor dimer catalyzes phosphorylation of the paired EGFR in the dimer. As a simplification, let us assume that only one of nine tyrosines of EGFR can be phosphorylated at a time. Under this assumption, there are still 10 different phosphoforms of EGFR and 55 distinct combinations of these phosphoforms for a receptor dimer. These numbers grow to 512 and 131,328 if more than one tyrosine of EGFR can be phosphorylated at a time. The number of phosphorylation states relevant for signaling in particular contexts is unknown, and it seems unlikely that all states are functionally relevant or even realized. Indeed, some phosphorylation states may be prohibited (35). However, a modeler may wish to consider the full spectrum of possible phosphorylation states to assess their relevance in an unbiased manner. The states included in a model will depend on the questions asked and the opportunities a modeler may have to make meaningful simplifications.

Fig. 2.

Multisite phosphorylation of EGFR. The nine tyrosine residues indicated here are phosphorylated by EGFR or Src (34). Phosphorylation of particular sites regulates interactions of EGFR with intracellular binding partners, eight of which are indicated in the figure [including PLC-γ (phospholipase C–γ)]. These binding partners bind EGFR through their SH2 and/or phosphotyrosine binding (PTB) domains. Other tyrosine residues of EGFR are also subject to phosphorylation, and additional binding partners of EGFR are known (14, 16). Numbering of residues is based on the full-length sequence of EGFR.

Another source of combinatorial complexity is multivalent binding. Consider the death domain (DD), a protein interaction domain found in many proteins (36, 37). A model of a complex of DDs of the death receptor Fas (also called CD95) and the Fas-associated protein with death domain (FADD) are shown in Fig. 3 (38). The model, which is based on interfaces seen in crystal structures (39, 40), predicts that the Fas and FADD DDs form a hexamer through three interfaces. In fact, interactions through the three interfaces can generate a distribution of homo- and heterooligomers: 6 homodimers, 6 heterodimers, 20 homotrimers, 66 heterotrimers, and so on. Without regard for physicochemical limitations, the reaction network implied by the DD-DD interactions is of infinite size. Of course, the actual distribution of oligomers will be limited by a number of factors, such as protein copy numbers, binding affinities, steric effects, and cooperativity. However, to elucidate how these factors shape the distribution, a modeler may wish to account for all possibilities.

Fig. 3.

Multivalent binding of the DD components of Fas and FADD. (A) View of a three-dimensional model of a 3:3 heterohexamer of the DDs derived from knowledge-based docking of homology models of the individual DDs (38) [reproduced by C.-S. Tung, Los Alamos National Laboratory]. The configuration of DDs is planar in this complex. (B) A schematic model of the same complex indicating that the DDs interact through three types of interfaces. The type I and II interfaces are seen in crystal structures of other protein complexes composed of members of the DD superfamily (39, 40). These interfaces were used to guide docking. The type III interface is a prediction of the model (38). Binding of Fas and FADD DDs through the different interfaces can produce a variety of protein complexes.

Dynamical models that account comprehensively for the possible species in a system can help determine which of these species are relevant for signaling, under what conditions, and why. In an analysis of such a model (41), Faeder et al. (42) found that only a small fraction of the possible species in the model is effectively populated, although there are enough proteins to populate all species in the model at a nontrivial level. When parameters that affect system dynamics (such as protein concentrations) change, the populated species can shift dramatically. These shifts are not predicted by reduced models that omit species. Other theoretical analyses also indicate that the populated species in a system are influenced by parameters that affect system dynamics (4345). Thus, if certain species of a system appear to be favored, as in temporal ordering of phosphorylation events (35), a dynamical explanation might be suggested and evaluated with the aid of a model that has the capability to predict how the favored species will be affected by perturbations of system dynamics. A model that has such a capability is one that accounts for all possible species.

Approaches to Model Specification

Combinatorial complexity raises a series of problems when one is modeling signal-transduction systems as chemical reaction networks. One problem is model specification, the task of stating one’s knowledge and assumptions about a system in a way that enables mathematical analysis.

Conventional approach to model specification

The textbook approach to modeling a biochemical system is to draw a reaction-scheme diagram depicting the chemical species and reactions in the system and then to translate this diagram, which is essentially an organized layout of a list of reactions, manually into a set of equations (46), such as a system of coupled ordinary differential equations (ODEs). Reaction-scheme diagrams that have been drawn for signal-transduction systems, some of which are quite large (47, 48), are based mostly on knowledge and assumptions about protein-protein interactions. An example of a simple scheme is provided in Fig. 4. A drawback of a reaction-scheme diagram is that it obscures the underlying protein-protein interactions by not explicitly representing them. Still, a diagram is far easier to interpret than the corresponding equations, and the readability of a diagram can be improved by using iconographic annotation. For example, Fig. 5 shows how the scheme of Fig. 4 can be made more readable with the notations of Kitano et al. (49) and Faeder et al. (50). In some cases, a reaction scheme can serve the purpose of model specification well.

Fig. 4.

A reaction scheme. The scheme includes seven chemical species, labeled R2, RP1, RP2, Grb2, RP1-G, RP2-G, and RP2-G2, and 12 reactions, labeled vp1, vp2, vp3, vd1, vd2, vd3, v+1, v–1, v+2, v–2, v+3, and v–3. Such notation is typical. Each label must be defined before this reaction scheme can be understood. Also, to derive a mathematical model from this scheme, a rate law must be assigned to each reaction.

Fig. 5.

Elaborations of the reaction scheme shown in Fig. 4. (A) A process diagram drawn according to the conventions of Kitano et al. (49). (B) The same scheme drawn according to the conventions of Faeder et al. (50) and Blinov et al. (61) for graph-based representation of chemical species and reactions. Below each chemical species graph, a corresponding definition in the BioNetGen Language (BNGL) is given. The meaning of the scheme when drawn as in either of these two panels is more self-evident than when drawn in the conventional way, as in Fig. 4.

Software tools have been developed to help biologists to draw reaction-scheme diagrams, to translate these diagrams into mathematical equations, and to perform model-based calculations by using standard methods of scientific computing (http://sbml.org/). Examples include CellDesigner (51) and the Virtual Cell (52). However, these tools are useful only if the textbook approach to model specification can be applied. In many cases, combinatorial complexity makes this approach prohibitively time-consuming and error-prone, even with the aid of powerful software tools for drawing reaction-scheme diagrams (53).

Limits of conventional modeling

Although the conventional approach to model specification is problematic for signal-transduction systems, which are marked by combinatorial complexity, it is nevertheless the approach most often used to specify models for these systems, but at a cost. Models derived in this way are invariably based on assumptions, which may be difficult to justify, that limit the chemical species and reactions considered to a fraction of those possible. An example is the model of Kholodenko et al. (54) for EGFR signaling, which has been extended by a number of researchers (5558). This model is based on mechanistic assumptions that result in a selective focus on only a fraction of the protein complexes and phosphorylation states that could potentially arise from the protein-protein interactions considered in the model. For example, one assumption is that ligand-induced dimers of EGFR are unable to dissociate when receptors are phosphorylated, which seems unlikely. This assumption arises from a description of signaling events as an ordered pathway, which is consistent with the way this system is presented in typical diagrammatic interaction maps, but inconsistent with rapidly reversible reactions and multiple branching possibilities. Lifting this and other assumptions causes a combinatorial explosion in the number of possible reactions and species (59), which makes manual model specification impractical.

Rule-based approach to model specification

Given that protein-protein interactions can generate large reaction networks, what can be done to capture the essence of these interactions without ignoring their combinatorial complexity? To address this question, a number of research groups have suggested, in one form or another, a new starting point for model specification. The basic idea is to specify protein-protein interactions as rules that serve as generators of chemical reactions (or reaction events) and species. A rule specifies the features of proteins that are required for or affected by a particular protein-protein interaction. A rule can be viewed as a definition of a reaction class, a generalized reaction. The relevant features of an interaction can often be identified, at least to a first approximation, because of the modularity of protein catalytic and interaction domains (19).

In one approach to rule-based modeling (60, 61), which is implemented in the BioNetGen software package, rules are used as generators of chemical species and reactions as follows. A rule comprises patterns for recognizing reactants, a mapping of reactants to products, and a rate law. Given a set of initial chemical species, represented with text strings or graphs, each rule is used to identify, through pattern matching, the species that have features required to undergo the transformation from reactants to products specified in the rule. Each transformation is performed to obtain product species and the transformation is assigned a rate law, the one associated with the corresponding rule. By convention, it is assumed that the interaction represented in a rule is independent of features not explicitly indicated. Thus, multiple species may qualify as reactants in a type of reaction defined by a rule, and multiple reactions may be generated that have the same characteristic rate law, although parameters of the rate law may need to be adjusted for a variety of contextual reasons (60, 62). The exact number of reactions generated by a rule depends, in general, on the entire set of rules in which the rule of interest is embedded and also on the set of species to which rules are initially applied. The species and reactions may either be generated in advance of a simulation, the "generate-first" approach, or during the course of a simulation, the "on-the-fly" approach (60, 62, 63). With the generate-first approach, a network of species and reactions is generated through iterative application of the rules until a specified, arbitrary termination condition is satisfied or no new reactions are generated (60). With the on-the-fly approach, reactions are generated as new species become populated, which may be advantageous when the network is large or unbounded, as is the case when rule application is nonterminating in the absence of an arbitrary halting condition (60, 62). An example of a rule for which rule evaluation is nonterminating is provided in Fig. 6.

Fig. 6.

An example of a BioNetGen model specification, made up of a single rule and set of two seed species, for which the iterative rule evaluation procedure of Faeder et al. (60) and Blinov et al. (61) is nonterminating. (A) The seed species, a free ligand (oval) and a free receptor (box with rounded corners). Both the ligand and receptor have two identical binding sites (circles). The species are represented as graphs according to the conventions of Faeder et al. (50) and Blinov et al. (61). (B) The rule, which is a graph-rewriting rule, defines a class of irreversible bimolecular association reactions in which the products are polymer-like chains of ligands and receptors. The rule indicates that a bond can form between a ligand with a free binding site and a receptor with a free binding site if the ligand and receptor are not already associated directly or indirectly. The plus symbol in the rule serves to specify a molecularity of two for reactions generated by the rule. (C) Reactions generated after two rounds of rule evaluation. The first reaction is generated after one round of rule evaluation. The product of this reaction is a 1:1 ligand-receptor aggregate, a new species. The next round of rule evaluation generates two new reactions, the products of which are two new species (1:2 and 2:1 ligand-receptor aggregates). Each subsequent round of rule evaluation generates at least one new reaction in which a product is a new ligand-receptor aggregate composed of more molecules than any aggregate previously generated. Thus, generation of all possible reactions implied by a rule for a protein-protein interaction is not always possible.

Applications of rule-based modeling and first software tools for model specification

To date, rule-based modeling has been applied to only a handful of systems, but there is an increasing awareness of the need for such models. In considering whether a particular application represents an example of a rule-based model, we ask the following question: Are the chemical species and reactions in the model drawn from a fixed user-specified list or are they generated automatically in some way by a computer algorithm from a set of rules? The latter represents what we call rule-based models. The programs used to generate rule-based models can be divided into two categories: those that are specific to a particular application and those that generalize to a range of problems.

Rule-based modeling has been applied to bacterial chemotaxis mediated by the Tar-receptor complex [for reviews of the biology, see (64, 65)]. The complexity of models for receptor-mediated phosphorylation events (66) and the aggregation of receptors (67) led Bray and co-workers to develop two general-purpose computer programs, OLIGO (68) and StochSim (69). These programs divide the modeling problem into two parts—formation of multisubunit signaling complexes, and phosphorylation cascades.

OLIGO uses graphs to represent multiprotein complexes, with each node representing a protein and each edge representing a noncovalent interaction between two proteins. A user draws a contact graph with the help of a graphical user interface (GUI). OLIGO then generates a reaction network by disassembling the complex in all possible ways. For example, the six-subunit complex that is active in Tar-mediated signaling is decomposed into its three protein constituents, and in the process, a network of 14 reversible binding reactions is generated. A subsequent program, oligo-d (43), generates a network in the opposite order, starting with a list of proteins and binding sites. Both programs generate networks on the basis of the assumption that rings form whenever possible. This simplification prevents polymerization through chain propagation. Although this assumption may be valid in some cases, there are cases where extensive oligomerization of signaling proteins occurs (70). A further and more severe limitation of these software tools is that regulation of protein interactions through covalent posttranslational modifications, such as phosphorylation, cannot be considered.

Use of oligo-d provided a qualitative explanation for how signaling networks can exhibit a decrease in signaling when complex-forming proteins are overexpressed, a phenomenon we will refer to as high-dose inhibition. Bray and Lay (43) found that bridging subunits that connected separate parts of a complex were particularly likely to exhibit such effects, but also showed that any subunit with multiple bonds to a complex could exhibit high-dose inhibition if the binding constants in the network were optimized for such behavior. Levchenko et al. (44) modeled high-dose inhibition for the particular contact geometry of a scaffold protein that binds kinases of a mitogen-activated protein (MAP) kinase cascade. Although this model was constructed manually, a Mathematica-embedded biochemical modeling package called Cellerator (71), an early example of software for rule-based modeling, was later used to generate automatically and to solve the differential equations for a generalized model of kinase-scaffold interactions (72). Specific functions (rules) were written to generate the species and reactions for scaffolds and scaffold dimers with various numbers of binding sites and other properties. The effects of previous assumptions made to restrict network complexity, such as a rapid second phosphorylation step for kinase activation, were investigated and found to have a quantitative, but not qualitative, effect on the observed behavior of the system. In particular, the scaffold concentration for optimal downstream activation was found to be insensitive to modeling assumptions.

StochSim (73, 74), the other main modeling package developed by Bray and co-workers, is complementary to OLIGO in that it allows detailed modeling of the kinetics of covalent modifications, but has fairly severe limitations for modeling complex formation. Further discussion of StochSim’s design and capabilities is provided below. An interesting feature of StochSim is that it can be used to consider nearest-neighbor interactions on regular two-dimensional lattices. This capability was developed to study lateral interactions among receptors in a static configuration (75, 76).

Goldstein and co-workers have developed a detailed model of early events in signaling by FcεRI (41), the high-affinity receptor for immunoglobulin E (IgE) antibody, which plays an important role in allergic reactions. The model considers phosphorylation and binding events initiated by receptor aggregation and encompasses 354 chemical species and 3680 chemical reactions, but it is based on only 25 parameters (21 rate constants and 4 protein concentrations). This economy of parameters is achieved by using reaction rules describing conditional protein binding and phosphorylation events to generate the species and reactions that arise in the network model. Each reaction generated by a given rule uses the same rate constant. Thus, reaction rates are assumed to depend on a subset of the features of reactant species. A typical assumption is that the rate constant for ligand-receptor binding is not affected by the cytoplasmic state of the receptor. Analysis of this model has provided a number of insights into how the rate constants and affinities that govern particular interactions within receptor aggregates affect the outcome of signaling (41). For example, it was found that the multiple requirements for transphosphorylation within a receptor aggregate enable the system to discriminate effectively between ligands that bind with short and long half-lives (9), an effect known as kinetic proofreading (77).

Chakraborty and co-workers have developed two distinct models of early events in T cell receptor (TCR) signaling that have led to interesting insights and hypotheses about the effects of spatial organization (45) and cooperative interactions (78) on signal amplification. In the model of Lee et al. (45), the junction between a T cell and a stimulatory antigen-presenting cell is represented by a three-dimensional lattice on which various proteins involved in early signaling events diffuse and react. A set of interaction rules based on a detailed reaction scheme is used to determine which biochemical events can occur between a given pair of particles. At each simulation step, one of three possible update moves is selected at random with equal probability: diffusion of a molecule or complex, association or dissociation involving a pair of molecules chosen at random, or a state transition involving a pair of molecules chosen at random. This method avoids the problem of explicitly generating species and reactions but is restrictive in that the computational cost scales with the number of particles squared, limiting simulations to relatively small numbers. A more severe problem that is specific to this model is that although the model includes both reaction and diffusion effects, little attempt is made to simulate the correct rates of reaction and diffusion relative to each other or the correct absolute rates of reactions. Because quantitative simulations can be performed by using physically correct and computationally tractable methods (79), it might be interesting to do so to see if conclusions inferred from the original model change.

A different, nonspatial model was used by Li et al. (78) to model cooperative effects involving the TCR, its coreceptor CD4, and its ligands [peptide–major histocompatibility complex (MHC) complexes]. The model was designed to test the hypothesis that low-affinity binding between MHC molecules bound to endogenous (self) peptides could amplify signaling by a small number of high-affinity peptide-MHC complexes containing antigenic (foreign) peptides. The basic mechanism for cooperative enhancement was proposed to be the formation of a pseudo-dimer complex, in which a high-affinity bond between antigenic peptide-MHC and a TCR acts as a scaffold to recruit a CD4 molecule, its constitutively associated molecule of the kinase Lck, and a second peptide-MHC molecule, most likely bound to endogenous peptide. The pseudo-dimer complex serves as an efficient machine for TCR activation, because TCRs that transiently bind the free peptide-MHC in the complex come into close proximity with the Lck, which phosphorylates TCR as the initial step in TCR activation. Reaction rules are given that describe the association and dissociation of these components, as well as phosphorylation and dephosphorylation, up to hexameric complexes (the size of the pseudodimer loaded with TCR), generating a network of 741 distinct reactions. A special-purpose program was used to generate the network, and dynamics were simulated by using the Gillespie algorithm (80, 81). The model is constructed and parameter values are chosen so as to favor the formation of pseudo-dimer complexes only when high-affinity peptide-MHC is present. Under these conditions, cooperative enhancement of TCR signaling is observed at low densities of antigenic peptide, and the strength of this enhancement depends strongly on the affinity of interactions between TCR and the low-affinity (endogenous) peptide-MHC, which suggests a possible biological role for positive selection of TCRs that exhibit a relatively high affinity for endogenous MHC. On the basis of new experimental evidence, Krogsgaard et al. (82) have proposed a modification of the pseudodimer model in which the second TCR is recruited to the complex through its interaction with CD4, and it would be interesting to see how this change might affect the behavior of the network model.

Woolf and Linderman (83) used rules to model binary interactions among sets of G protein–coupled receptors (GPCRs), which they took to be in active or inactive states. Using Monte Carlo methods, they determined how rules for GPCR dimerization and changes of these rules affected the spatial distribution of receptors on a two-dimensional grid and receptor signaling. Using their rule-based models, they were able to make predictions consistent with experimental measurements of receptor clustering and second messenger signaling.

Haugh et al. (84) used a series of rule-based models to investigate mechanisms by which protein tyrosine phosphatases (PTPs) might regulate signaling through their association with membrane-bound receptors. Several cytosolic PTPs contain SH2 domains that allow them to bind directly to phosphorylated receptors. These interactions can increase PTP activity through at least three mechanisms: (i) direct allosteric activation, (ii) indirect allosteric activation through phosphorylation of the PTP by a receptor-associated kinase, and (iii) proximity to receptor phosphotyrosine substrates. Haugh et al. (84) found that the relation between PTP activity and receptor activation could be tuned by alterations in the relative expression levels of proteins that competed with PTPs for binding sites, leading to high-dose inhibition (with respect to the degree of receptor activation) under some conditions. The differential equations of the model were generated by a special-purpose program. The model also included algebraic equations, because Michaelis-Menten rate laws were used to characterize the kinetics of phosphorylation and dephosphorylation reactions with multiple substrates. These equations can be avoided by using elementary reactions of the Michaelis-Menten mechanism to model catalytic steps (85), at the cost of increasing the size of the reaction network (86).

An important application of rule-based modeling has been to examine the validity and implications of assumptions that have traditionally been made to limit the extent of combinatorial complexity. Conzelmann et al. (87) and Blinov et al. (59) have both developed rule-based versions of the EGFR signaling model originally developed by Kholodenko et al. (54). Because the primary focus of Conzelmann et al. (86) was model reduction, which we discuss elsewhere, we focus here on the efforts of Blinov et al. (59).

The rule-based version of the EGFR model was developed by generalizing the individual reactions in the Kholodenko model with the BioNetGen software package (88). Each reaction was assumed to be a specific instance of an interaction that could be expressed more generally as a reaction rule by using the rate constant of the original reaction. Thus, there is a one-to-one correspondence between reactions in the original model and reaction rules in the expanded model, and no new rate parameters are introduced. As an example, in the original model, dimerization of EGFR bound to the epidermal growth factor (EGF) is described by a single reaction with forward and reverse rate constants k+2 and k–2, respectively. In the expanded model, this reaction is converted to a reaction rule that specifies that a molecule of EGFR that is bound to EGF but has a free dimerization site can bind to another EGFR molecule meeting the same criteria with the same forward and reverse rate constants as in the original model. Because EGFR has several additional binding and phosphorylation sites, this rule applies to a large number of EGFR-containing species and generates a large number of potential dimerization reactions. In addition, this rule lifts an assumption implicit in the original model that dimers of EGFR could only break up if both receptors were unphosphorylated. This assumption precluded the existence of phosphorylated EGFR monomers, which have been shown to be significant contributors to signaling through EGFR under certain conditions (89).

For the same values of the rate parameters, the original and expanded models were found to predict nearly identical behavior for the outputs examined by Kholodenko et al. (54), but the expanded model makes new predictions that can be tested experimentally. For example, the expanded model predicts different phosphorylation kinetics for the two primary EGFR tyrosines included in the model. Recent proteomic measurements indicate that different phosphorylation sites on EGFR exhibit different kinetics and that the kinetics depend on the interaction partners at the various sites (14). These results provide motivation for considering individual tyrosines in models. Further motivation is provided by results indicating that small-molecule kinase inhibitors have different effects at different tyrosines (90). In summary, recasting a heuristic model of a signaling pathway increased the computational complexity, but did not require additional model parameters and also facilitated understanding of the model by unveiling hidden assumptions in the original model. The rule-based model better reflects our understanding of protein-protein interactions, is more easily extendable, and generates additional predictions.

Model Simulation, Reduction, and Checking

In addition to model specification, a number of other problems arise from the large number of possible reactions in a signal-transduction system. In this section, we briefly review how rules have been used to improve the efficiency of procedures for simulating the dynamics of a system. We also discuss systematic methods for reducing a model to a simpler one and for verifying that the dynamical behavior of a model is consistent with specified properties.

Simulation

The cost of simulating the kinetics of a reaction network depends on the size of the network. The exact scaling of cost with network size depends on a number of factors and varies from problem to problem, but the scaling is nonlinear for practical methods. Thus, if a network is large, the simulation cost can be expensive in terms of computation time and/or computer memory. To address this problem, Lok and Brent (62) proposed a method of simulation that takes advantage of rules, the on-the-fly method mentioned earlier, which is similar to approaches that have been used to simulate chemical systems (9193). In this method, rule evaluation is embedded in a discrete-event Monte Carlo simulation of reaction kinetics, and reactions are generated only when a species is first populated during a simulation. Thus, parts of a network unconnected to populated species are ignored, which may speed calculations and reduce memory requirements. This technique of on-the-fly network generation provides a practical benefit only when the populated fraction of a network is sufficiently small and branching is limited. For a highly branched network, the cost of updating the list of reactions connected to populated species can be an important factor even if few of the possible species are populated (Fig. 7).

Fig. 7.

The number of potential reactions versus the number of populated species when trivalent ligands interact with bivalent cell surface receptors. These results are generated through on-the-fly network generation with BioNetGen. Reactions are generated by six rules, which are similar to the rule shown in Fig. 6. The parameters of the simulation are as follows. The total ligand and receptor concentrations are 7 and 0.5 nM, respectively. Rate constants associated with elementary rate laws for three of the rules are k+1 = 0.0004 nM−1s−1, k+2 = k+3 = 0.04 nM−1s−1, where k+1 is the rate constant for binding of a receptor site to a site on a free ligand, k+2 is the rate constant for binding of a receptor site to a site on a ligand bound once, and k+3 is the rate constant for binding of a receptor site to a site on a ligand bound twice. The corresponding reverse rate constants for the elementary rate laws associated with the three remaining rules are each 0.01 s−1. As can be seen, a few hundred species have the potential to participate in hundreds of thousands of distinct reactions. Thus, simulation methods that rely on reaction generation, including on-the-fly methods, will not be computationally feasible for large-scale networks in some cases.

The problem of simulating a highly branched network is addressed by another Monte Carlo method that takes advantage of rules (69, 94) and is implemented in the StochSim software package (73). In the StochSim algorithm, rules are used during a simulation to generate discrete reaction events (that is, to execute reactions), rather than to generate reactions. A list of reactions is unnecessary, and the cost of generating one is avoided. Thus, the StochSim algorithm may provide a computational cost savings when many reactions are possible but only a fraction actually occur. However, its applicability is limited, because StochSim rules explicitly represent only state changes of proteins. Representing changes of connectivity can be problematic. Thus, for example, StochSim cannot be used in a straightforward way to model the system of Fig. 6.

New simulation capabilities are needed for rule-based modeling, because reaction networks derived from rules are unusually large, which is problematic for conventional simulation procedures. It may be possible to develop additional methods, like those mentioned above, that take advantage of the underlying rules of a rule-derived reaction network to speed simulation. For example, it may be possible to extend on-the-fly methodology from stochastic simulations to ODE-based simulations (95, 96), which would likely be more efficient. Current on-the-fly methods essentially consider a species or set of species to be relevant if it is populated by just one molecule, which triggers reaction generation. It would be desirable to refine these methods to incorporate more stringent definitions of relevance such that the trigger for reaction generation can be tuned for efficiency. Finally, we wonder whether the method of StochSim could be made to work with more general rules or could be implemented in hardware (97, 98). A general conclusion we draw from our experiences with simulation to date is that, because there are trade-offs involved in all of the methods that have been developed, it is highly desirable for a software platform to provide access to multiple simulation approaches through the same interface.

Model reduction

Another way to address the issue of combinatorial complexity is through model reduction. Some researchers have considered ways to obtain a model that is smaller than one needed to represent all microscopic details, yet behaviorally equivalent to it (42, 87, 99101). For example, Borisov et al. (99) suggested lumping microscopic species of a model together into new variables to obtain an equivalent dynamical model in terms of coarse-grained variables. Recently, Conzelmann et al. (101) have developed a systematic method for obtaining reduced models in this way. Exact time courses of the macroscopic variables are obtained by solving a reduced set of differential equations that includes both the macroscopic variables and possibly a set of auxiliary (mesoscopic) variables. The extent of model compression depends on the degree of independence among the protein sites being modeled, because correlation among sites requires tracking of additional auxiliary variables. A number of factors, such as cooperative binding or enzyme-substrate relations within a complex, can give rise to coupling among sites. For example, the phosphorylation states of two sites in a receptor are correlated if binding of a cytosolic PTK to one site depends on phosphorylation of the site, and the PTK, once recruited to the receptor, catalyzes phosphorylation of the second site. The variable transformation approach has so far only been applied to the case of a single scaffold protein containing multiple sites of modification and binding. More work is needed to extend the approach to more complex situations. In addition, it would be desirable to develop a systematic method for developing approximate reduced models (for example, by dropping some of the auxiliary variables), which might be helpful when exact compression is not possible because of correlations between sites (100). The problem of model reduction through aggregation of variables is as difficult as it is necessary, in particular for systems that operate on several spatial and temporal scales.

Model checking

Model checking (102) refers to a suite of techniques deployed to prove that a model of a system behaves in a specified manner. The techniques originate in computer science and engineering, where they are also known as program verification. For example, an objective of program verification is to prove that a highly complex piece of software meets design requirements. A useful step in this direction is to construct a simplified program (a model) that preserves essential design characteristics, and then proving (or disproving) its compliance with specifications. This is typically done before the actual piece of software is built, precisely to avoid early design flaws. In the present biological context, the "simplified program that preserves essential design characteristics" is a rule-based model of a signal-transduction system. Various researchers have sought to apply the powerful tools developed in the context of program verification to biological models, particularly when they can be cast in the form of rules that are executed stochastically on the fly. Consider a volume in which molecular species collide randomly and react when an appropriate rule becomes applicable, much as in StochSim described above. Unlike the unique deterministic trajectories of ODEs, a stochastic simulation will only yield a particular history at each run. Even when many runs are gathered to perform a statistical analysis, observing a time series of concentrations (or molecule numbers) does not necessarily lead to an understanding of the model (and the system it is intended to elucidate). Moreover, a detailed time series may not be the most appropriate representation to match against empirical observations that are qualitative or expressed in terms of salient events. Here is where model checking becomes useful.

The model checking process requires two inputs: (i) a suitable representation of the signal-transduction model and (ii) a property whose truth within the model we wish to check. In the process, a rule-based reaction network is translated (automatically) into an automaton. An automaton is a graph whose nodes represent global states of the system (a global state is determined by the numbers of all molecular species) and whose edges represent transitions between states due to possible reactions. An automaton so defined represents, at once, all possible trajectories the system can take and may therefore be very large, even infinite. Yet, it does not have to be held in computer memory in its entirety. As with on-the-fly methods described previously, the automaton graph can be expanded when needed by computing the partial graph of accessible states (nodes) from the current state. As the nodes of the automaton are generated, they are traversed systematically (breadth- or depth-first); at the same time, the automaton is checked in a divide-and-conquer fashion (that is, by recursively breaking the problem down into simpler subproblems) to determine whether the given property holds. This procedure reflects a strategy similar to that implemented in programs that play chess. Most important, when a property fails to hold, the procedure will reveal why the property fails by exhibiting the transition that led to its violation. Model checking is not a simulation, but a clever exhaustive search of the state space of a model. Naturally, model checking has limitations with regard to the type of properties that can be effectively checked.

Model checking requires properties to be expressed as formulas in a logical framework, such as standard first-order logic (predicate calculus) augmented with operations that capture temporal relations between events ("before," "after"); temporal modalities ("sometimes," "always," "next time," "until"); constraints; and perhaps even structural properties (such as graph connectedness).

The application of model checking to biological systems is in its infancy. At present, biological applications constitute more an exploration of methods than breakthroughs in biological insight. An illustrative example is the analysis of Kohn’s map of the cell cycle (103). The properties checked for the Kohn map were static assertions, such as, "The activation of CDC25C is necessary to generate the CDK1-cyclin B complex," as well as dynamic ones, such as oscillations in the abundance of cyclin A (which translates into, "There is a path such that whenever cyclin A is present it eventually disappears, and whenever it is absent, it eventually appears"). These formulas do not push the boundaries of biological insight. Yet, they illustrate the potential of queries for consistency checks, behavioral comparisons across models, and compliance with empirical observations.

Rule-based systems may be translated into different types of automata that reflect various levels of resolution. The class of hybrid automata defines a particularly important level. A hybrid automaton represents a mixture of continuous dynamics and discrete events, as in a system that is well described by a differential equation model until a condition is reached that triggers a discrete event altering the dynamics. For example, the net rate of production of a protein may switch from negative to positive when the concentration of another protein falls below a threshold value. In the class of piecewise affine systems, the dynamical equations are linear with an offset. Such automata can serve as qualitative approximations of complicated nonlinear reaction systems. They effectively transform a nonlinear system into a linear one, but with a structure that can be switched. This type of model yields a representation of a reaction network that facilitates model checking. Batt et al. (104) provide an interesting and predictive application of model checking to a hybrid-automaton representation of the nutritional stress response in Escherichia coli. The queries have illuminated the role of mutual inhibition between the transcription factors FIS and CRP, which are involved in regulating metabolism. Mishra and co-workers have applied hybrid automata techniques and model checking to a stylized Delta-Notch signaling circuit (105). We expect to see increasing activity in this area.

Software for Rule-Based Modeling

General-purpose software is needed to enable rapid rule-based modeling of diverse systems. Problem-specific programs can be difficult to reproduce independently, and the time and effort required to write and debug such a program can be expensive. Also, modifications of rules may require reprogramming, which discourages the consideration of alternative hypotheses about mechanisms of signaling or extensions of a model. In recent years, various general-purpose software tools and corresponding methods based on rules for protein-protein interactions have been developed. Software capabilities and methodology are advancing quickly, and it seems likely that rule-based modeling will soon be a practical, mature, and accessible method of computational systems biology. Below, we discuss several specific software tools and end with a short summary of the lessons learned from the development of these tools.

StochSim 1.4

The salient feature of this tool (ftp://ftp.cds.caltech.edu/pub/dbray/) is the representation of proteins as computational objects or agents, which have states (73). Model specification is accomplished by using a wizard-like GUI, which helps a user create a set of formatted input files defining proteins, their states, individual reactions, and rules for generalized reactions that change the states of proteins. A protein is represented by a list of binary flags encoding the modification or binding states of its sites. In the state-change rules of StochSim, state flags are specified as 0, 1, or "?," which is a wildcard, and plus (+) and minus (–) symbols are used to direct the switching of flags from 0 to 1 and vice versa. A limitation of StochSim is its representation of complexes. A complex is either represented implicitly in terms of protein states, in which case the topology of a complex may be difficult to track, or it is represented explicitly, in which case the molecular composition must be declared manually beforehand in an input file.

An interesting feature of StochSim is its method for discrete-event Monte Carlo simulation of reaction kinetics. The method involves picking two agents and then checking to see whether they react. The method is approximate because time is incremented in fixed steps (69, 94). Thus, a check for accuracy, against the results of exact Monte Carlo methods (80, 81) or a StochSim run with a smaller time step, is necessary. The procedure may be slow because, for correct results, the time step must be small enough such that pairs of reactants most often do not react. However, as discussed above, the method avoids generating reactions, which can be an advantage essential for computability in some situations. Shimizu and Bray (94) discuss the scenario of a protein with n sites of modification and m binding partners. This protein has up to 2n modified forms and may participate in as many as m2n distinct bimolecular association reactions. For such cases and cases like that of Fig. 7, methods that avoid the cost of reaction generation, such as the StochSim method, may be essential for computability.

Moleculizer 1.0

As with StochSim, a set of input files is used to specify a model in Moleculizer 1.0 (http://www.molsci.org/~lok/moleculizer/) (62); these files are written in XML. Template input files are provided that correspond to a fixed set of rule types (reaction generators), which are coded as separate software modules. The templates are used to define individual rules and their parameters, such as rate constants. Although the types of rules available to a modeler are fixed, they provide sufficient flexibility to represent a wide array of protein-protein interactions. A feature of Moleculizer, not available in StochSim, is the ability to represent the topology of a protein complex explicitly in terms of pairwise connections of binding sites. Also, the connectivity of a complex can be considered in a rule definition.

As can StochSim, Moleculizer can be used to perform a discrete-event Monte Carlo simulation of reaction kinetics, but an exact method, Gillespie’s direct method (80, 81), is used. This method relies on a list of reactions, which Moleculizer generates on the fly during a simulation. When a species is first populated, rules are evaluated and new reactions involving this species are generated if necessary (that is, if they are not already generated and stored in memory). Given the parameters that govern a simulation of network dynamics, Moleculizer provides a principled means for identifying the relevant portion of the reaction network (that is, the populated species and reactions connected to these species). Moleculizer provides other simulation methods as well, such as a τ-leap method for stochastic simulations (106). A new simulation feature is "look ahead" rounds of reaction generation (107). If the number of look-ahead rounds is set to a large enough number, then Moleculizer can generate a network without simulating it. This method is then equivalent to the generate-first method mentioned earlier.

BioNetGen 2.0 and BNGL

Earlier versions of this tool (1.0 and 1.1), BioNetGen (http://cellsignaling.lanl.gov/bionetgen/), which are based on the use of term or string rewriting rules to represent protein-protein interactions, are discussed elsewhere (60, 88). We will focus on version 2.0, which is based on the use of graphs to represent proteins and protein complexes and the use of graph rewriting rules to represent protein-protein interactions (50, 61). For historical perspective and a review of applications of graph transformation in molecular biology, see Rosselló and Valiente (108). Graph rewriting rules are generalizations of term rewriting rules. Unlike StochSim and Moleculizer, BioNetGen interprets a formal model-specification language, called the BioNetGen Language (BNGL). An advantage of introducing a language is that model specification becomes independent of software implementations and therefore portable. In subsequent sections, we will discuss several languages for rule-based modeling that have been proposed. Interesting languages that we will not discuss include Bioglyphics (http://www.bioglyphics.org/) (109), Dynamical Grammars (http://www.arxiv.org/abs/cs.AI/0511073), and "little b" (http://www.littleb.org/).

In BNGL, chemical species are represented by using graphs, which are encoded as structured strings. The basic unit of representation is a molecule string, which may contain component strings enclosed in parentheses. In a model specification, molecule strings actually serve several purposes, one of which is definition of the types of molecules included in a model. An example of a string used for this purpose is EGFR(ECD,aa1092~Y~pY), which defines a molecule called EGFR containing components called ECD (ectodomain) and aa1092. The string also indicates, by the prefix "~" for state labels, that the latter component has two possible internal states called Y and pY. This definition corresponds to one type of molecule represented in the reaction network of Fig. 5B, which also shows how molecule strings are concatenated to represent complexes.

In BNGL, rules for protein-protein interactions are represented by using text-encoded graph rewriting rules, which contain graph matching patterns. Patterns composed of (incomplete) molecule strings explicitly indicate the features required of reactants and implicitly indicate, through omission, the features that are irrelevant for a reaction. An example of a rule is Grb2(SH2)+EGFR(aa1092~pY) -> Grb2(SH2!1).EGFR(aa1092~pY!1). This rule indicates that the adaptor protein Grb2 and EGFR can associate if the SH2 domain of Grb2 is free and residue 1092 in EGFR is free and phosphorylated. By omission, the rule also indicates that association of Grb2 and EGFR is independent of the bound state of the ECD of EGFR. The "!" character is a prefix for bond labels, which indicate how components of proteins are connected in a complex. The bond labels in this rule indicate that the SH2 domain of Grb2 binds pY1092 in EGFR. The rule for Grb2-EGFR association and others are illustrated in Fig. 8 by using the conventions of Faeder et al. (50).

Fig. 8.

Representation of protein-protein interactions by using graphical reaction rules. The reactions illustrated in Figs. 4 and 5 can be generated by rules for the underlying protein-protein interactions, which are illustrated here with the conventions of Faeder et al. (50) and Blinov et al. (61). The rules define generalized reactions for 1, autophosphorylation of EGFR; 2, dephosphorylation of EGFR mediated by a phosphatase assumed to be present in excess; 3, association of Grb2 and EGFR, which depends on phosphorylation of Y1092; and 4, dissociation of Grb2 and EGFR. Below each graph-rewriting rule, a corresponding definition in BNGL is given.

A model is specified as a set of graph-rewriting rules, which are associated with rate laws, and a set of seed-species graphs to which the rules are initially applied. Using the iterative algorithm for processing rules outlined above (60, 61), BioNetGen can generate a parameterized reaction network (meaning that the reactions are assigned rate laws) without performing a simulation of the network dynamics. Once a network has been generated, it can serve as the basis for ODE-based or stochastic simulations (60). Pregenerating a network and then performing an ODE-based simulation, if possible, may be more efficient than the stochastic simulation methods of StochSim and Moleculizer, because ODE-based calculations are usually less expensive than Monte Carlo calculations for a "small" model (63). If the size of a network is too large to permit ODE-based calculations, then BioNetGen also provides the capability to generate a network on the fly during a simulation, as Moleculizer does.

One problem that BioNetGen attempts to solve is how to automatically adjust the rate law of a rule to account for contextual differences among reactions in the class of reactions defined by the rule. Contextual effects on rate laws can be specified explicitly in an expanded set of rules, but this approach may be laborious, and the problem arises often enough that general automatic solutions merit pursuit. The contextual factors that may necessitate modifications of a rate law for a class of reactions are numerous: (i) Collision frequency can vary (compare A+A→ and A+B→ reactions); (ii) reaction path degeneracy (that is, the number of distinct reaction paths from reactants to products) can vary, giving rise to different statistical factors (compare A.A→A.B and A→B reactions); (iii) the turnover frequency of an enzymatic reaction may depend on the numbers of enzymes and substrates in a complex; (iv) a factor equal to a volume ratio may arise for reactions in different compartments; (v) diffusion-limited reactions may be affected by masses (or equivalently, diffusivities) of reactants (60, 62); and so on. BioNetGen handles statistical factors, for example, by considering the symmetries of graphs in reaction rules and the reactions generated by rules (61). Assigning correct statistical factors is essential for a self-consistent model. Figure 6C illustrates reactions of the same class that have different statistical factors.

BioSPI and other tools of process algebra

BioSPI interprets a model-specification language, a variant of the π-calculus, and provides the capability to perform a stochastic simulation (http://www.wisdom.weizmann.ac.il/~biopsi/) (110). Similar tools, which have also been used to specify models for biological systems, include SPiM (111) and the PEPA Workbench (112). The π-calculus is a general and minimalist model-specification language originally designed to capture essential features of concurrent and distributed systems in computer science (113). It is one of many process calculi (algebras) studied in this field (114, 115). These languages are axiomatic, which allows properties of a model specification to be formally proven and facilitates certain types of model checking (for example, the observational equivalence of two model specifications can be determined).

Use of π-calculus to model protein-protein interactions was suggested by Regev et al. (116), who provided, loosely speaking, the following translations of biochemistry to π. Proteins are mobile processes (meaning agents that exchange messages that can affect their behavior); protein sites are communication channels (ports through which messages are passed from senders to receivers); and protein-protein interactions are communications. The channels of a system and the messages that can be sent and received along these channels are essentially rules for protein-protein interactions. The proteins in a complex are linked by a backbone (that is, by colocation in a communication compartment), and the topology of a complex is indicated by pair-wise communications. In this framework, signal transduction can be viewed as asynchronous concurrent computation and studied by using pertinent methods from computer science (117119).

Because π-calculus is a minimalist language and some of its notational features are irrelevant and even inappropriate for biological applications (for example, communication has directionality, whereas physical association does not), researchers have sought to develop a more congruent higher-level language for modeling biochemical systems that retains the mathematical formality of π. One proposal is κ-calculus (120). In this language, a protein is represented as a collection of sites, which can be visualized as a box with nodes, representing sites, on the border. Coincidentally, the appearance of such a box is much like that of a state node in a process diagram (49). Also, protein complexes are represented as graphs, in which edges connect interacting sites of proteins. For simulation purposes, the κ-calculus can be reduced to π-calculus.

Although process algebra has been used to specify biological processes in some detail, such as cell adhesion (121, 122), it is unclear at this time whether methods of process algebra can reach fundamentally beyond the level of understanding provided by ODEs and stochastic simulations. Process algebra may, however, provide a formal foundation for rule-based modeling that enables principled mathematical reasoning about system behavior and its dependency on interaction capabilities of system components.

Pathway Logic Assistant

Pathway Logic Assistant provides an interface to "Pathway Logic" models of signal-transduction systems (http://www.csl.sri.com/users/clt/PLweb/pl.html) (123) defined by using the Maude specification language (http://maude.cs.uiuc.edu/). Pathway Logic models can formally represent protein-protein interactions at different levels of resolution, ranging from details about abstract protein states (124) to details about functional domains and binding sites (125, 126). Eker et al. (127) proposed conventions for using Maude to specify term rewriting rules for protein-protein interactions. Later, conventions for specifying graphs and graph rewriting rules were introduced (126), which allow the topology of a protein complex to be explicitly considered. The models obtained with Maude are essentially unparameterized chemical reaction networks, meaning that rate laws are not specified for reactions. With the addition of rate information, a reaction network can be converted to ODEs, for example, but this capability is not native to Pathway Logic Assistant. Instead, this tool, through the formalism of Petri nets, enables a qualitative analysis of a reaction network, such as visualization and interactive exploration of the network. Also, a modeler can specify a formula in linear temporal logic (LTL) that defines a putative property, and then evaluate this formula in a model-checking query to determine whether the property holds for the system. Properties that can be specified in LTL are qualitative and relate to stable states of paths. For example, a query can be specified to determine whether a particular protein complex is potentially present in a signal-transduction system and to find a sequence of reactions leading to it. Of course, the value of qualitative analysis is limited, as the dynamics of protein-protein interactions are important for system behavior, for example, dynamics influence which protein complexes are populated during signaling (42). Still, it will be interesting to see what biological insights can be obtained from qualitative analysis alone, because quantitative rate parameters for protein-protein interactions are often unavailable or uncertain.

BIOCHAM 2.4

This tool, BIOCHAM 2.4, interprets a model-specification language in which term-rewriting rules are used to represent protein-protein interactions (http://contraintes.inria.fr/BIOCHAM/) (128). Structured strings are used to represent chemical species. Rules, which may contain string-matching patterns, indicate how strings representing reactants are rewritten to obtain strings representing products. As is also true for other term-rewriting approaches (60, 88, 127), the topology of a protein complex cannot be explicitly considered in a rule, which can be a limitation. Unlike the π-calculus or the general Maude programming language, but like BNGL, the κ-calculus, and Pathway Logic, the BIOCHAM language has been designed specifically for the purpose of modeling biological systems. The language is simple by design but still expressive enough to specify meaningful models. For example, a model has been specified for the cell cycle–control system described by Kohn (32) (103).

Like BioNetGen, BIOCHAM can process a set of rules associated with rate laws to generate a parameterized reaction network and the corresponding ODEs, and then perform and analyze ODE-based simulations. In addition, like Pathway Logic Assistant, BIOCHAM can evaluate model-checking queries, which can be specified by using Computation Tree Logic (CTL) as well as LTL. CTL and LTL are closely related; however, CTL queries can be specified that cannot be expressed in LTL and vice versa (129). Model checking in the framework of BIOCHAM is provided through an interface to the NuSMV model checker (http://nusmv.irst.itc.it/). BIOCHAM essentially converts a chemical reaction network to a NuSMV input file. Thus, model-checking capabilities are likewise accessible, in principle, from any rule-based modeling tool that generates a reaction network. An intriguing feature of BIOCHAM is a machine-learning system for modifying rules or parameters to obtain consistency with behavioral constraints specified in CTL or LTL (130).

Lessons learned

Many of the software tools discussed above are based on a formal model-specification language, which has the desirable property of being application-independent. A standard language has yet to emerge, but the idea of using rewriting rules to represent protein-protein interactions has been suggested multiple times. Likewise, in several approaches, graphs are used to represent the contact map of protein complexes. One needs to consider the configuration of a complex when connectivity affects reactivity, as exemplified in Fig. 9. Chemical reaction network models derived from rules tend to be large. It may therefore be nontrivial to confirm that the model produced by a set of rules is the one intended. A high standard of software reliability, including careful consideration of physical chemistry, is essential. To cope with the complexity of models, we need software tools that assist in reasoning about a model and that provide means for automatically monitoring system properties at a quantitative and qualitative level. For example, BioNetGen provides output rules for calculating properties of sets of species, such as a sum of concentrations of species containing a particular protein. Finally, rules can be used in different ways to study a system. The different approaches now available appear to be complementary, and new approaches may be useful.

Fig. 9.

Two protein complexes with identical composition but different connectivity. (A) A chain of bivalent ligands, represented as ovals, and bivalent receptors, represented as boxes with rounded corners. Circles represent ligand-binding and receptor-binding sites, and lines connect sites that are bound to each other. (B) The ring formed through closure of the chain. The reactivities of the chain and ring differ. The chain can close through intramolecular binding of ligand and receptor sites at the chain ends or elongate by binding either a ligand or receptor with a free site. In contrast, the ring cannot react with ligands or receptors. It can only break apart through the opening of one of its ligand-receptor bonds. Thus, tracking the connectivity of proteins in a complex can be important for modeling protein-protein interactions.

Extension of SBML for Rules

The Systems Biology Markup Language (SBML) (131), which is based on XML, is a popular standardized format for the electronic storage, exchange, and reuse of mathematical models of biochemical systems. A number of software packages are now available that import or export models specified in SBML (http://sbml.org/), and a public repository for annotated SBML-encoded models has been established (132, 133). However, SBML, like most SBML-aware software, is based on the assumption that a model can be specified adequately in terms of a reaction scheme, which is not likely to hold for a model of a signal-transduction system because of the context-sensitivity and combinatorial complexity of protein-protein interactions. Versions of SBML up through Level 2 Version 2, which is presently being finalized, do not provide direct support for compactly representing multiple protein complexes or multiple states of proteins. Likewise, there is no support for defining rules for protein-protein interactions or other generalized reactions. To specify a model in strict SBML, one must enumerate all of the individual chemical species and reactions included in the model. Rules used to derive these species and reactions could actually be included in a SBML file as annotation, but such annotation would be nonstandard. Thus, for interpretability, SBML can only be used to represent a rule-based model as a list of rule-derived reactions, which means that SBML encodings of rule-based models tend to be exceedingly verbose and difficult to comprehend or modify (50, 88). In light of this limitation, several extensions of SBML that incorporate rules have been proposed (1342137). It is anticipated that such extensions will be available with the SBML update to Level 3 (138, 139).

SBML Level 3 is anticipated to introduce a modular language extension capability, which will allow different language features to be added to a common language core, which will be based on SBML Level 2. Language extensions will add syntax and semantics for software tools that share a common theme, such as the tools for rule-based modeling reviewed here. Models specified in SBML Level 3 will include a declaration of the feature sets (beyond the core Level 3 features) required for proper interpretation. The presence of a feature tag will inform a software tool reading the model that the model uses that particular feature and will permit the tool to quit gracefully if it does not have the necessary interpretive capabilities.

The latest proposed Level 3 extensions for rule-based modeling (136, 137), like several of the software tools reviewed here, are based on the concept of using graphs to represent proteins and graph-rewriting rules to represent protein-protein interactions. As discussed above, such rules are expressive enough to convey information about the contextual constraints on a protein-protein interaction and the parts of proteins responsible for and affected by an interaction. The same is not true of the standardized exchange format widely used today to store information about protein-protein interactions in electronic databases (140). Thus, extension of SBML to incorporate rules for protein-protein interactions has the potential to improve not only our ability to model the dynamics of protein-protein interactions, but also our ability to archive basic knowledge about them. Rules for large numbers of protein interactions might be specified on the basis of high-throughput assays (141), data extracted with literature-mining tools (142), or statistical analysis of large data sets (143). An important next step in the development of SBML is software for converting a putative SBML Level 3 specification of a rule-based model into Level 2 form.

Visualization of Rules

A mathematical model provides at least two benefits. Its implications, no matter how hidden from intuition, can usually be revealed through computer-aided analysis. A model also constitutes a precise statement of a modeler’s understanding and assumptions about a system. Thus, it is a powerful tool for summarizing knowledge about a system, although a model may be difficult to comprehend. If it is represented as a list of equations or reactions, the formalism of the model obscures the underlying grammar of protein interactions that generates the chemical species and reactions included in the model. A widely used and more comprehensible way of representing knowledge about a signal-transduction system is a diagrammatic interaction map in which proteins and their interactions (or the functional consequences of these interactions) are indicated by labeled cartoons and arrows. Most maps are ambiguous and lack a mathematical interpretation. However, formal conventions have been proposed for drawing interaction maps such that they have precise meanings (144, 145). An example of a simple map drawn according to these conventions is provided in Fig. 10. It illustrates interactions of EGFR and Grb2. No automatic procedures for translating this type of map into a mathematical model are yet available (145), although software intended to provide such a capability has been developed on the basis of similar diagrammatic formalisms (146149). Nevertheless, automatically translating a map like that of Fig. 10 into a mathematical model remains an open problem when combinatorial complexity is present.

Fig. 10.

A diagrammatic interaction map drawn according to the conventions of Kohn et al. (145). This map illustrates protein-protein interactions underlying the reaction scheme illustrated in Figs. 4 and 5.

An alternative to representing a model as a list of reactions or equations is to express a model as a set of rules, which resemble reactions in many ways but are highly compressive. If the rules are graph-based, then they can also be visualized, and therefore, they provide a new means of visualizing knowledge of protein-protein interactions. An example of rules that represent interactions considered in Fig. 10 is shown in Fig. 8. Similar illustrations would be obtained by using any of the other graph-based model specification approaches discussed above. An advantage of visual rules over interaction maps is that the rules can be evaluated automatically to obtain a mathematical model. When the rules of Fig. 8 are applied to a seed set of species composed of unphosphorylated dimeric EGFR and cytosolic Grb2, they generate the reaction network illustrated in Figs. 4 and 5. Rules can also be made more readable by enhancing them with graphical notation developed for annotating protein-protein interactions (49, 53) (http://www.sbgn.org/).

Outlook

New technologies are enabling an increasingly quantitative characterization of signal transduction events at the level of protein sites. The resultant data streams challenge our capacity to interpret on the basis of intuition alone. Understanding the dynamics of signaling systems hinges on appropriate mathematical methods and models. A mathematical model tells a story that converts facts into putative knowledge by animating facts with dynamical behavior, making assumptions explicit, and enabling formal manipulations whose correctness can be assessed. Presumed knowledge generated in this way is more useful, predictive, and empirically testable. A model is a bookkeeping device that helps us to assess whether we know enough about a system.

The multiplicity of sites at which a signaling protein can be modified and a protein’s potential for assembling into diverse complexes result in a combinatorial explosion of molecular states that stymies conventional modeling approaches. Models of signal-transduction therefore require software tools grounded in a mathematical framework that represents proteins as grammatical structures with specific context-dependent actions. Well-known challenges to modeling include the problem of compositionality, that is, the need for a framework for growing models in a cumulative fashion by plugging together independently developed models of subsystems. A different compositionality problem arises from the need to juxtapose systems that operate at different spatial and temporal scales, or at vastly different molecular densities. Some of these challenges are beginning to be addressed with the introduction of rules for protein-protein interactions and the development of methods for rule-based modeling, like those reviewed here.

Rule-based modeling is a fundamental departure from traditional dynamical systems based on ordinary differential equations (117). Dynamical systems require all possible interactions to be explicitly specified ahead of time. In contrast, rule-based approaches can be used to spawn reactions and molecular states only if the dynamics of the system generates the appropriate context (on-the-fly operation). Unlike differential equations, rule-based approaches represent molecules with an internal structure that encodes potential behavior that may never be triggered in a given system but that could be triggered if the system were changed. This aspect of rules is what makes rule-based approaches extensible on the go, much like dropping a new molecule into a reaction mixture alters the intrinsic possibilities of that mixture. (Note that a modeler should avoid specifying rules that are overly general, because such rules may generate unintended reactions when a model is extended.) Rule-based approaches do not jettison differential equations. In their simplest mode of operation, they enable the automatic generation of ODE systems that are too large to be written down by hand. Although rules have been used to formalize interactions among objects of various types for over two decades (150), they only recently have been applied to signal-transduction systems.

General-purpose tools for rule-based modeling would benefit from the adoption of a standard specification language for defining models in software-independent ways. SBML now serves as such a standard, but it conceives of a model as an exhaustive list of reactions. Because this view fails to recognize the linguistic character of complex proteins and protein complexes, it becomes quickly impractical. In response to this shortcoming, extensions of SBML consider the adoption of rules for representing protein interactions. Much remains to be done to devise languages capable of expressing more fine-grained aspects of protein structure and contextual constraints, including those of a geometric kind. Early examples of capturing spatial effects in a rule-based fashion were phenomenological models of morphogenesis (151) and models of virus capsid assembly (152, 153).

There is a clear need for making the modeling process more accessible to nonspecialists. Likewise, it is highly desirable to invite exploration of the possible, such as neighboring signaling networks that are accessible from actual ones by a few shuffling operations. Deriving mechanistic network models from empirical observations, evolving or designing networks to behave in specified ways (154), and predicting the consequences of interventions are all tasks that depend on efficient ways of exploring alternative network models. These objectives require the partial automation of the modeling process. Such automation would be critically enabled by the adoption of a formal language to express properties of proteins and their interactions or properties of whole systems, as extracted from models (via model checking) or empirical observations. This language could be used to annotate the behavior of a protein or a small system and could be stored alongside other information in databases. A statement in a formal language has a well-defined semantics, that is, a clear and precise interpretation, which could perhaps be quickly grasped through visualization. At the same time, such a statement would lead a double life as a codelet, a fragment of computer code, which can be downloaded and used as a component in a model of any system in which the corresponding protein or protein-protein interaction plays a role. Formal statements about system behavior (originating in empirical observations or trusted models) could be used as test-suites for new models or refinements. We believe the progress in rule-based modeling summarized in this review may lead ultimately to a common language for systems biology, much like the four-letter code for DNA sequences provides a common language for bioinformatics.

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
  45. 45.
  46. 46.
  47. 47.
  48. 48.
  49. 49.
  50. 50.
  51. 51.
  52. 52.
  53. 53.
  54. 54.
  55. 55.
  56. 56.
  57. 57.
  58. 58.
  59. 59.
  60. 60.
  61. 61.
  62. 62.
  63. 63.
  64. 64.
  65. 65.
  66. 66.
  67. 67.
  68. 68.
  69. 69.
  70. 70.
  71. 71.
  72. 72.
  73. 73.
  74. 74.
  75. 75.
  76. 76.
  77. 77.
  78. 78.
  79. 79.
  80. 80.
  81. 81.
  82. 82.
  83. 83.
  84. 84.
  85. 85.
  86. 86.
  87. 87.
  88. 88.
  89. 89.
  90. 90.
  91. 91.
  92. 92.
  93. 93.
  94. 94.
  95. 95.
  96. 96.
  97. 97.
  98. 98.
  99. 99.
  100. 100.
  101. 101.
  102. 102.
  103. 103.
  104. 104.
  105. 105.
  106. 106.
  107. 107.
  108. 108.
  109. 109.
  110. 110.
  111. 111.
  112. 112.
  113. 113.
  114. 114.
  115. 115.
  116. 116.
  117. 117.
  118. 118.
  119. 119.
  120. 120.
  121. 121.
  122. 122.
  123. 123.
  124. 124.
  125. 125.
  126. 126.
  127. 127.
  128. 128.
  129. 129.
  130. 130.
  131. 131.
  132. 132.
  133. 133.
  134. 134.
  135. 135.
  136. 136.
  137. 137.
  138. 138.
  139. 139.
  140. 140.
  141. 141.
  142. 142.
  143. 143.
  144. 144.
  145. 145.
  146. 146.
  147. 147.
  148. 148.
  149. 149.
  150. 150.
  151. 151.
  152. 152.
  153. 153.
  154. 154.
  155. 155.
View Abstract

Navigate This Article