Bayesian Network Approach to Cell Signaling Pathway Modeling

See allHide authors and affiliations

Science's STKE  03 Sep 2002:
Vol. 2002, Issue 148, pp. pe38
DOI: 10.1126/stke.2002.148.pe38


The modeling of cellular signaling pathways is an emerging field. Sachs et al. illustrate the application of Bayesian networks to an example cellular pathway involving the activation of focal adhesion kinase (FAK) and extracellular signal-regulated kinase (ERK) in response to fibronectin binding to an integrin. They describe how to use the analysis to select from among proposed models, formulate hypotheses regarding component interactions, and uncover potential dynamic changes in the interactions between these components. Although the data sets currently available for this example problem are too small to definitively point to a particular model, the approach and results provide a glimpse into the power that these methods will achieve once the technology for obtaining the necessary data becomes readily available.


Advances in the understanding and prediction of how complex biomolecular signaling pathways operate in regulating cell functions will increasingly depend on the emerging spectrum of multivariable, high-throughput, quantitative, integrative methodologies gathered under the term "systems biology" (1). This field encompasses interrelated developments in experimental measurement and manipulation that allow the collection of larger, more complete data sets, coupled with computational analysis of the data to find patterns and formulate new hypotheses. Modeling approaches for signaling pathways can range from Boolean algebraic (2) to differential equation (3) frameworks. It can be envisioned that a hierarchical matrix of "high-level" (showing which components influence which other components) and "low-level" (including details such as the kinetics and stoichiometries of the interactions) models might usefully be connected for comprehensive, yet efficient, analysis purposes (4). An appealing approach falling between Boolean algebra and differential equation frameworks on this modeling spectrum is the graphical modeling methodology known as Bayesian network analysis (5). This methodology has recently proven to be very useful for hypothesis generation and testing in gene expression regulation problems (6-9). In this article, we describe the novel application of graphical modeling to model discovery and model selection, an important preliminary step in hypothesis testing, in protein signaling networks. Our purpose here is to outline the procedure for analysis of this kind of problem and illustrate it with a very simple preliminary example.

We have begun to explore the application of Bayesian networks to represent multivariate probabilistic relationships among variables of interest (in this context, phosphorylated receptors, activated enzymes, and other biomolecules). Bayesian networks can uncover statistical relationships among variables from a set of data. Revealed relationships are not restricted to pairwise or linear functions and, in fact, can be arbitrarily complex, so there is no limitation on the number of factors involved in each dependence relationship or on the form that this dependence must take. One example is an "exclusive or" (XOR) dependence relationship. If an XOR relationship describes the dependence of variable C on variables A and B, C will be in the active state when A is on and B is off, and when A is off and B is on, but not when A and B are both on or when both are off. This slightly more complex instance of combinatorial regulation comprises a nonlinear relationship between the variables. If such a relationship, or even one of far greater complexity, does in fact exist among the considered variables, the Bayesian network approach will be able to uncover it, given sufficient data. Because statistical relationships may imply a physical or functional connection, we can use Bayesian networks to refine existing knowledge or uncover potential relationships in signaling pathways.

In a Bayesian network, a directed acyclic graph structure (DAG) summarizes dependence relationships among variables: Graph nodes (the geometric shapes) depict variables, and edges (the lines) represent dependencies among variables (more precisely, the lack of an edge indicates conditional independence) (Fig. 1) (5, 10). The DAG enables a biologically interpretable model structure, although dependence relationships must be interpreted with care, as described below. Graph structures depicting interactions among biomolecules can be scored against data to determine their relative probability. The Bayesian score of a model S is the conditional probability that the model S is the correct one, given specific data D (Fig. 2). Thus, when more than one model is scored using the same data, it is possible to judge which model best explains the observed data. A more detailed description of the Bayesian scoring metric is given elsewhere (5, 10).

Fig. 1.

A DAG showing dependence relationships among variables A, B, and C. C is dependent on B, and B is dependent on A. Variables A and C are conditionally independent of each other. B is the child of parent node A, and C is the child of parent node B.

Fig. 2.

Equations that define the Bayesian Score (Eq. 1). The Bayesian score of a graph is the relative probability of that graph structure (S), given the available data (D). Eq. 2 comes from application of Bayes' rule, and Eq. 3 defines the P(D|S) term. P(D|S) is the likelihood; P(S) is the prior, which incorporates any bias that the experimenter has for a particular structure(s) due to prior knowledge; and c is the constant term, which cancels in the comparison between models. P(S|D) is termed the posterior, because it is determined after observation of the data. θi is the parameter vector for the conditional probability distribution over variable I, conditioned on its parent variables. The likelihood term is computed as an average, with a distribution over the parameters of each local conditional probability distribution. Given a set of assumptions, including discrete valued variables, the multivariate integral can be computed analytically.

Because the Bayesian scoring metric determines the relative probability of each model structure, it can be applied to rank competing models according to their ability to explain available data. This is a form of model selection in which the data set defies interpretation through classic application of biological intuition and instead necessitates a computational tool to determine which hypothesis the data support best. In the case of Bayesian networks, that selected hypothesis is simply the model with the highest relative probability, given the data. When we extend this approach to model discovery, however, we examine features common to the set of highest scoring models (rather than the single highest scoring model) in order to avoid selecting one particular high-scoring model when others may score nearly as well.

Model Selection

Model selection can be an important stage in hypothesis testing, when a single model out of a number of equally favored candidates must be singled out for wet-lab verification--in particular, when parallel verification of all candidate models is infeasible or prohibitively expensive. As a simple preliminary example, we illustrate here an attempt at model selection analysis within the mitogen-activated protein kinase (MAPK) cascade, a ubiquitous pathway vital for such processes as the survival, proliferation, differentiation, and migration of eukaryotic cells.

In particular, we address the activation of focal adhesion kinase (FAK) and extracellular signal-regulated kinase (ERK) that results from the interaction between the extracellular matrix protein fibronectin (fn) and the integrin α5β1. There has been literature controversy about the pathway(s) relating FAK and ERK activation, in terms of the relative directions of influence of FAK activation on ERK activation and vice versa; for example, see (11-15). We consider several possible models of the signaling events. To score them, we take advantage of an existing, relatively large-scale, quantitative, dynamic data set of this system (16). We emphasize here that, to date, efforts to generate quantitative dynamic data on protein signaling pathway component activities in a systematic multivariable mode are exceedingly scarce in comparison to gene expression (and protein level) studies. Asthagiri et al. cultured CHO-B2 cells, transfected with human α5 integrin, on fn-coated plates. Two levels of integrin expression and five concentrations of fn were employed. A transient time course of ERK2 activity measurements was made for each of the 10 possible fn-integrin combinations, using a high-throughput kinase activity assay. FAK activation time courses were measured by Western blotting. We use two metrics (data sets) for FAK and ERK2 from Asthagiri et al.: an overall activation level and the initial rate of activation under the 10 fn-integrin combinations (Table 1).

We pose four candidate models (Fig. 3). It is important to keep in mind that links indicate a statistical dependence of relative influence, but they do not necessarily imply a direct functional or physical connection (17). For instance, a link from variable A to B may imply that A activates B, but it may also exist if an additional, unobserved variable regulates both A and B, resulting in a statistical dependence between them. On the other hand, a functional connection will generally result in statistical dependence, and some level of prior knowledge in the domain may help us decipher which connections are likely to be causal rather than merely statistical.

Fig. 3.

Candidate and control models. "Cue" represents the signal from the interaction betwen fn and integrin, F is FAK, and E is ERK2. M0 is the control model. An additional model identical to M4, but with the edge from F to E reversed, is not represented because it is in the same equivalence class as M4 and will, therefore, always score the same.

Model 1 (M1) shows a link from fn and integrin to FAK, and one from FAK to ERK2. Such a dependence structure is expected if fn and integrin activate (activate is used to indicate regulation in the general sense, and may include repression or combinatorial regulation) FAK, which activates ERK2. Model 2 (M2) is consistent with a pathway in which fn and integrin activate ERK2, which activates FAK. Model 3 (M3) shows ERK2 and FAK both linked to fn and integrin but conditionally independent of each other. Finally, Model 4 (M4) is a mixture of the previous models and has no conditional independencies. Our goal was to score the models against data from Asthagiri et al. to discover which model might be best supported by these data.

In addition to the four candidate models, we also consider the independence model, M0, as a control. Because some dependence is known to exist between fn-integrin signaling and FAK and ERK2 activity, this model lacks at least one dependence relationship, so it is intentionally designed to score poorly. A total of 8 measurements of initial activation rate and 10 measurements of overall activation for FAK and ERK2 were used. Data were grouped to two or three levels using a discretization method that strives to preserve mutual information between pairs of variables while reducing the number of discretization levels (6). Resulting model score comparisons are presented (Table 2). M0, the control, scores poorly, as anticipated. Of the four hypothesized structures, data set I (initial activation rate) best supports M1, which scores ~3 times higher than M2 and ~4 times higher than M3 and M4. Data set II (overall activation level) supports M4 ~100 times better than M1, ~70 times better than M2, and ~10 times better than M3. These differences indicate to what degree the high-scoring model explains the data set better than the other models. These results support distinct dependence structures for the processes resulting in initial activation rates and overall activation of FAK and ERK2.

Although M1 and M4 emerged as the most likely model structures of the candidates tested under data sets I and II, respectively, they are superior in likelihood to the remaining models by only small or moderate margins (large or decisive margins may be on the order of 1000-fold or more). Therefore, based on these results, we cannot select M1 or M4 with high degrees of confidence. A useful conclusion is that we will need a larger data set, acquired through a similar systematic, quantitative, dynamic, and multivariable approach.

Hypothesis Discovery

We can extend our use of the Bayesian scoring metric to search for high-scoring graphs among all possible model structures. Although we cannot find the single highest scoring model (because this problem is computationally infeasible), we can look for features common to high-scoring models and consider them as hypothesized dependencies, automatically generating hypotheses of links, subpathways, and networks of biomolecules of interest. In contrast to model selection, this approach can be used even in a very poorly characterized domain where few or no hypotheses of network structure exist.

To search for high-scoring graphs, we pick at random an edge to add, delete, or reverse; score the new graph; and retain the change if the score improves. To avoid local minima, we also incorporate changes that reduce the score with some small probability. Finally, to minimize overfitting to a particular high-scoring model, we consider features common to a set of highest scoring models.

We searched for high-scoring dynamic graphs using time series data from Asthagiri et al. We used a dynamic Bayesian network graph expanded in time, in such a way that each time point of each variable is represented as a node (18). This representation confers a number of advantages, including time resolution of dependencies. The search is constrained to forbid dependence of early time points on later ones and of the cue node on downstream signals. We sample from the high-scoring region of the model space; thus, the search algorithm will find a subset of the highest scoring models. (Results shown here represent 200 runs of 1000 or more iterations each. The high-scoring graph in each run was typically found by approximately iteration 500.) The resulting graph comprises a weighted average of high-scoring models visited over 200 runs (Fig. 4). Darker arcs represent edges with higher posterior probability (0.86 to 1); lighter arcs indicate a posterior probability of 0.50 to 0.85.

Fig. 4.

Features common to high-scoring graphs. The model presented comprises a weighted average of high-scoring graphs from 200 runs of the search algorithm. "Cue" represents the signal from fn and integrin, E is ERK2, and F is FAK. Subscripts indicate time in minutes. Thinner arrows are features with a posterior probability of 0.5 to 0.85; thicker arrows represent consensus arcs or arcs with posterior probability >0.85. Arcs 1 through 10 are numbered for convenience.

The data appear to support a dependence between the signaling cue (the signal generated by fn binding integrin) and FAK levels at the early time point, as well as ERK levels at 5, 10, and 15 min (Fig. 3, arcs 1 through 4). An arc connects later time points with their respective predecessor(s) for ERK nodes (Fig. 3, arcs 8 through 10). The graph suggests an interesting dynamic between FAK and ERK: A dependence is seen between ERK at 5 and 10 min and FAK at 90 min, and between FAK at 7.5 min and ERK at 15 min. Thus, the model predicts a bidirectional dependence between the two molecules. This cyclic relationship could not be elucidated with static Bayesian networks, which do not permit cycles. It is interesting to appreciate that the apparent complexity of these dynamic ERK-FAK interactions is quite likely responsible for the difficulty in determining clear "upstream" versus "downstream" influence relationships by means of standard molecular cell biology methods.


Given a set of data, Bayesian networks can determine the relative probability of statistical dependence models of arbitrary complexity. For this and other reasons, they are well suited for analyzing data in a biological domain. We have applied them here as a tool for performing hypothesis selection and formulation, in the context of a simple preliminary problem in the realm of cell signaling pathways for illustration purposes. This problem is the influence relationship between FAK and ERK after fn-mediated integrin activation, and we used two dynamic data sets incorporating multiple levels of integrin expression and concentrations of fn (16).

Of the static graphs considered for model selection analysis (Fig. 3), data set I (initial activation rate) best supports M1, in which FAK is dependent on the fn-integrin cue and ERK is dependent on FAK, but conditionally independent of the fn-integrin cue. Data set II (activation level) supports M4, in which there are no conditional independencies (that is, M4 indicates dependence among all the variables). This result implies that the dependence structure leading to initial activation rate is distinct from that resulting in overall activation levels of FAK and ERK. We have no a priori reason to doubt this result, which could in fact implicate an interesting mechanism for short-term versus long-term effects of fn-integrin signaling. However, in this case we cannot determine with certainty that M1 and M4 are the most probable dependence structures for initial and overall FAK and ERK activation, respectively, for a number of reasons. All the noncontrol models are defeated by a small or moderate margin, which may be attributed to noise. Moreover, the data sets are very small in probabilistic terms, in particular in comparison to the number of model parameters. A larger data set (five times larger at least) would be expected to yield larger differences between scored models. Thus, we conclude that we currently have insufficient data to select M1 or M4 definitively.

The use of dynamic Bayesian networks enables the elucidation of regulatory loops and dynamic dependencies not evident in static networks. The data used to search for dynamic graphs are a measurement of the level of active FAK and ERK at each time point, in contrast to the rate of initial activation and overall activation levels used for scoring the static models (Table 1). Because the nodes physically represent different measurements, comparing the static and dynamic models is not completely straightforward. Nevertheless, the link between the cue and early time points of FAK and ERK and later dependence between FAK and ERK in the dynamic graph (Fig. 4) are consistent with M4, a top-scoring static graph, suggesting dependence among all the variables. However, the early dependence of both FAK and ERK on the cue alone lends support, for the initial activation data set, to static M3, showing conditional independence between ERK and FAK, yet M3 scores about four times lower than the top-scoring model, M1. This discrepancy may point to inaccuracies in either model.

Table 1.

Data summary. For model selection, data sets I and II are used to score static models. Data set I is the initial rate of activation from (16); data set II is overall activation. Model discovery examines dynamic models and therefore employs the unprocessed time course data.

Table 2.

Model scores. Data set I is initial rate of activation from (16); data set II is overall activation. Columns 3 and 5 indicate to what degree the top-scoring model explains the data better than the indicated model. This fold difference is equal to e to the power of (difference in model scores).

The search results (Fig. 4) suggest an interesting model of FAK-ERK dependence, combining a number of hypotheses previously proposed (11-15). However, this has resulted from an average of high-scoring models rather than the definitive best model, and it is very likely overfitting to the small data set. (Indeed, the number of model parameters in some places exceeded the number of available data points.) Although the data set we studied here is relatively large by traditional molecular cell biology standards, it nonetheless appears to be too limited to compellingly advance the current state of understanding regarding FAK and ERK activation. We thus emphasize again that this example was provided for illustration purposes only. The effective utility of Bayesian network approaches to modeling cell signaling pathways may depend on higher throughput measurement methods, such as flow cytometry, protein arrays, and microfabricated analytical devices. Because the systems biology field generally expects rapid developments in these measurement techniques, we anticipate that Bayesian networks will become increasingly useful in this problem area, as they are in the analysis of gene expression problems.


  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
View Abstract

Navigate This Article