## Abstract

High-throughput proteomic data can be used to reveal the connectivity of signaling networks and the influences between signaling molecules. We present a primer on the use of Bayesian networks for this task. Bayesian networks have been successfully used to derive causal influences among biological signaling molecules (for example, in the analysis of intracellular multicolor flow cytometry). We discuss ways to automatically derive a Bayesian network model from proteomic data and to interpret the resulting model.

## Introduction

With the advent of high-throughput proteomic technologies, molecular signaling biology is experiencing an explosion of new experimental results. For example, intracellular multicolor flow cytometry (1, 2) allows for quantitative, simultaneous observation of multiple signaling molecules in many thousands of individual cells. A major challenge is to reveal a coherent systems-level view of the signaling network from such data.

We proffer Bayesian networks (3) as suitable models for signaling pathways. We believe that it is essential for such pathway models to be of a probabilistic nature to accommodate the noise inherent in biologically derived data. Additionally, Bayesian networks are relatively robust to the existence of unobserved variables and can explicitly handle the uncertainty in such unobserved events (for example, current proteomic technology simultaneously measures only 12 molecules in individual cells, although there are many more molecules involved in a typical signaling response). Bayesian networks have been used for automatic reconstruction of causal signaling network models from data derived from individual primary human immune cells (1). The purpose of this primer is to provide a better mathematical understanding of Bayesian networks and how they can be used to derive causal influences between biological signaling molecules.

Bayesian networks provide a compact graphical representation of the joint probability distribution over the random variables **X** = *X*_{1}, …, *X*_{n} (each such random variable represents the protein expression or activity level of a signaling molecule). Even for binary-valued variables (on or off), the joint distribution requires specification of the probabilities for the 2^{n} different assignments to *X*_{1}, …, *X*_{n}. The key property of Bayesian networks is that they use simplifying structure in the joint distribution (by explicit encoding of conditional independencies; see below) to represent such high-dimensional data in a compact manner. Furthermore, the structural aspects of the joint distribution (modeled as a graph) might correspond to the graph structure of the signaling network itself.

Consider a finite set **X** = *X*_{1}, …, *X*_{n} of random variables, where each variable *X*_{i} may take on a value *x*_{i} from the domain Val(*X*_{i}). We use italic capital letters such as *X*, *Y*, *Z* for variable names; specific values taken by these variables are denoted *x*, *y*, *z*. Sets of variables are denoted by boldface capital letters **X**, **Y**, **Z**; assignments of values to the variables in these sets are denoted **x**, **y**, **z**.

At the core of Bayesian networks is the notion of conditional independence. This concept can be explained with an example from classical genetics. Assume that we are studying a certain mutation that appears in the Springfield population at a frequency of 0.001. If we sample a random resident of the city, the probability (Pr) that Bart has the mutation is 0.001. Now, assume that we know that Bart's Grandpa has the mutation. In this case, Pr(Bart has mutation given Grandpa has mutation) = 0.25. Grandpa's genotype was informative toward Bart's genotype; the two genotypes are clearly dependent. Next we learn that Homer (Bart's father) has the mutation too. In this case, Grandpa's genotype is irrelevant, so Pr(Bart has mutation given Homer and Grandpa have mutation) = Pr(Bart has mutation given Homer has mutation but Grandpa does not) = 0.5. Likewise, if Homer does not have the mutation, the probability that Bart has it is 0.001 whether Grandpa has the mutation or not. Conditioned on Homer's genotype, Grandpa's genotype does not affect the probability of Bart's genotype. Thus, we say that Bart's genotype is conditionall*y* independent of Grandpa's genotype, given Homer's genotype.

**Definition 1:** *X* is conditionally independent of *Z* given **Y** if the probability distribution of *X* conditioned on both **Y** and *Z* is the same as the probability distribution of *X* conditioned only on **Y**:
$$mathtex$$\[P(X{\mid}\mathrm{\mathbf{Y}},Z)\ =\ P(X{\mid}\mathrm{\mathbf{Y}})\]$$mathtex$$(1)
We represent this statement as (*X* ⊥ *Z*∣**Y**).

Bayesian networks encode these conditional independencies with a graph structure.

**Definition 2:** A graph *G* = (**X**,*E*) consists of a set of nodes **X**, depicted as dots, and a set of edges *E* that connect the nodes, drawn as lines between pairs of nodes. Each edge *X*–*Y* represents a pair of nodes from **X**. In a directed graph, each edge is ordered and *X*→*Y* denotes an edge from *X* into *Y*.

## Model Semantics

A Bayesian network is a structured directed graph representation of relationships between variables. The nodes represent the random variables in our domain, and the edges represent the influence of one variable on another.

**Definition 3:** A Bayesian network (3) is a representation of a joint probability distribution consisting of two components. The first component, *G*, is a directed acyclic graph (DAG) whose nodes correspond to the random variables *X*_{1}, …, *X*_{n}. Let *Pa*_{i} denote the parents of *X*_{i} in *G* (all nodes coming into *X*_{i}). The second component, 𝛉, describes a conditional probability distribution *P*(*X*_{i}∣*Pa*_{i}) for each variable *X*_{i} in **X**.

An important property of the graph *G* is that it represents conditional independencies between variables. In the genetics example above, Bart is independent of his ancestors conditioned on his parents. The Markov assumptions generalize this concept to any directed graph.

**Definition 4:** In a Bayesian network, the graph *G* encodes the Markov assumptions: Each variable *X*_{i} is independent of its nondescendants, given its parents in *G*.

As an example, a Bayesian network can represent the relations among five different proteins (Fig. 1). Assume that A is a kinase that phosphorylates protein B. If A activates B, we expect that in most cases when A is active, so is B. The Bayesian network indicates this dependency in activity levels by drawing a directed edge from A into B. The protein activities of A and B are statistically dependent; thus, knowing the value of A provides information that can help predict the value of B.

In addition, B is a kinase that phosphorylates C; thus, the network model has an edge from B into C. On the basis of pairwise correlations alone, we expect the activity of C to be correlated not only with its direct regulator (B), but also by its indirect regulator (A). In our example, if we know the value of B, A does not provide additional information that can improve our predictions for C, and so we say “The effect of A on C is mediated through B,” that is, A and C are conditionally independent given B. Such a relation can be inferred from interventional data; for example, A and C are correlated under normal conditions, but when B's activity is inhibited, this correlation disappears.

Furthermore, kinase A also activates kinase D, creating a correlation between the activities of B and D. If B is active, we can reason that its activity might be a result of A's activity, and therefore D is more likely to be active as well. This is another example of conditional independence encoded in the graph structure: The activities of B and D are conditionally independent given their common regulator A. Such a relation can be inferred from the data if, for example, D is activated by another (unmeasured) kinase. In this case, A would be a more reliable predictor for B's activity than would D.

Finally, kinase E inhibits kinase B, providing B with two proteins that control its activity. A and E are B's parents in the Bayesian network. This leads into the second component of the Bayesian network, 𝛉: Each node has a conditional probability 𝛉 that describes the probability of its levels conditioned on the levels of its parents.

The two components *G* and 𝛉 specify a unique distribution on *X*_{1}, …, *X*_{n}. The chain rule of probabilities claims that any joint distribution can be expressed as a product of conditional probabilities, so that each variable *X*_{i} is conditioned on all the variables that precede it, *X*_{1}, …, *X*_{i–1}:$$mathtex$$\[P(X_{1},{\ldots},X_{n})\ =\ {{\prod}_{i\ =\ 1}^{n}}P(X_{i}{\mid}X_{1},{\ldots},X_{i{-}1})\]$$mathtex$$(2)With the use of the conditional independencies derived from the Markov assumptions of Definition 4, the product form can be further simplified so that each variable *X*_{i} is only conditioned on its parents *Pa*_{i}:$$mathtex$$\[P(X_{1},{\ldots},X_{n})\ =\ {{\prod}_{i\ =\ 1}^{n}}P(X_{i}{\mid}Pa_{i})\]$$mathtex$$(3)This is called the chain rule for Bayesian networks. This product economizes on the number of parameters, thus making the Bayesian network representation of a joint probability compact. As an example, consider the joint probability distribution *P*(A,E,B,C,D) represented in Fig. 1. By the chain rule of probability, without any independence assumptions,$$mathtex$$\[P(\mathrm{A},\mathrm{E},\mathrm{B},\mathrm{C},\mathrm{D})\ =\ P(\mathrm{A})P(\mathrm{E}{\mid}\mathrm{A})P(\mathrm{B}{\mid}\mathrm{A},\mathrm{E})P(\mathrm{C}{\mid}\mathrm{A},\mathrm{E},\mathrm{B})P(\mathrm{D}{\mid}\mathrm{A},\mathrm{E},\mathrm{B},\mathrm{C})\]$$mathtex$$(4)Assuming that all variables are binary, this representation requires 1 + 2 + 4 + 8 + 16 = 31 parameters. Taking the conditional independencies into account,$$mathtex$$\[P(\mathrm{A},\mathrm{E},\mathrm{B},\mathrm{C},\mathrm{D})\ =\ P(\mathrm{A})P(\mathrm{E})P(\mathrm{B}{\mid}\mathrm{A},\mathrm{E})P(\mathrm{C}{\mid}\mathrm{B})P(\mathrm{D}{\mid}\mathrm{A})\]$$mathtex$$(5)which only requires 1 + 1 + 4 + 2 + 2 = 10 parameters. More generally, given *n* binary variables and *G* whose indegree (that is, maximal number of parents) is bounded by *k*, then, instead of representing the joint distribution with 2^{n} – 1 independent parameters, we can represent it with at most 2^{k}*n* independent parameters. This reduction in parameters is critical when estimating a model from empirical data. Robust estimation of a model with many parameters requires many more data points than does estimation of a model comprising fewer parameters. Fortunately, flow cytometry can measure protein expression/activity levels in thousands of individual cells, providing enough data for the estimation of Bayesian network models.

A graph *G* specifies a product form as in Eq. 3. To fully specify a joint distribution, we need to specify the conditional probability distributions *P*(*X*_{i}∣*Pa*_{i}) for each variable *X*_{i}. 𝛉 represents the parameters that specify these distributions. *P*(*X*_{i}∣*Pa*_{i}) can be viewed as a probabilistic function of *X*_{i} whose inputs are *X*_{i}'s parents in *G*. Any distribution *P* satisfying the conditional independencies in Definition 4 can be encoded as a Bayesian network with structure *G* and associated conditional probability distributions.

In this primer, we focus on discrete variables and describe the conditional probability distributions used by Sachs *et al*. (1). In the more general case, these conditional distributions can be almost any computable representation. For instance, many continuous conditional probability distributions have been used with Bayesian networks (4–6).

We use a conditional probability table that specifies a probability distribution over *X*_{i} for each possible combination of value assignments to its parents *Pa*_{i}. Each joint value assignment *pa*_{i} to *Pa*_{i} corresponds to a row in the table. This row specifies the probability vector for *X*_{i} conditioned on *Pa*_{i} = *pa*_{i}. For example, if *Pa*_{i} consists of *k* binary valued variables, the table will specify 2^{k} distributions. This representation can describe any discrete conditional distribution. But this flexibility comes at a price: The number of free parameters is exponential in the number of parents.

As an example, assume that A and E each weakly activate protein B, and together they strongly activate protein B. Table 1 is a conditional probability table that represents such a relation.

## The Graph Structure: Independence, Dependence, and Causality

We now discuss the properties of the Bayesian network structure *G*. We describe the relationship between the graph structure and the conditional independencies it implies.

### d-separation

If we assume that molecular interactions can be modeled by probabilistic dependencies, then we can use independence queries on the distribution *P*(*X*_{1}, …, *X*_{n}) to infer these interactions. The Bayesian network structure *G* greatly facilitates efficient multivariate independence queries. For instance, the structure *X*→*Y*→*Z* implies that although *X* and *Z* are dependent, the variable *Y* renders them conditionally independent, (*X* ⊥ *Z*∣*Y*). The variable *Y* dismisses the dependence between *X* and *Z*, indicating that the interaction between *X* and *Z* is indirect. Thus, a Bayesian network can be used to distinguish between direct and indirect relationships. For example, consider the Raf→Mek→Erk cascade of protein kinases (1). The activity of Raf is informative toward the activity of Erk, but Mek activity renders the Raf and Erk activities independent. We infer that Raf influence on Erk proceeds though the intermediate Mek.

Although this independence statement was easy to infer from the simple substructure (*X*→*Y*→*Z*), it is natural to ask about queries that relate to variables that are further apart in *G*. It is possible to automatically derive such conditional independence relations between variables from the graph structure itself. Before continuing, we present a graph substructure that plays a key role both for the notion of d-separation and for the notion of equivalence described in this section.

**Definition 5:** A v-structure (3) is an induced subgraph (a subset of nodes and all the edges between these nodes in *G*) of the form *X*→*Y*←*Z* so that no edge exists between *X* and *Z*.

The v-structure implies an interesting set of dependencies. In the previous cases, *X* and *Z* were dependent only when *Y* was unobserved; in a v-structure, given the value of *Y*, two possibly independent variables become dependent. Following is a classic example of such a dependency from genetics: Consider a random variable *Y* representing the existence of a rare mutation in some child, where *Y* = 0 indicates that *Y* does not have the mutation and *Y* = 1 indicates that *Y* has the mutation. Let *X* and *Z* represent the existence of the same mutation in each of that child's two biological parents. The genotypes of each of the parents *X* and *Z* are independent of one another. But if we know that *Y* = 1 (that is, the child has the rare mutation), this means that one of the two parents must have the mutation. Now, if we are also given that *X* = 0, we can infer that *P*(*Z* = 1∣*Y* = 1, *X* = 0) = 1, and given *X* = 1 we can infer that *P*(*Z* = 1∣*Y* = 1, *X* = 1) is very low. Therefore, given the value of *Y*, the independent variables *X* and *Z* become dependent.

Intuitively, one can view dependence as a property that can “flow” between the nodes representing *X* and *Z* through paths that connect them in *G*. For instance, in *X*→*Y*→*Z*, dependence can “flow” from *X* to *Z* through *Y*, unless *Y* “blocks” this flow. Because (*X* ⊥ *Z*∣*Y*), the path is “blocked” only when *Y* is given. In an opposite case, *X*→*Y*←*Z*, dependence can “flow” from *X* to *Z* only if *Y* is given. To generalize these notions to longer paths, we say that the graph has a trail from *X*_{1} to *X*_{n}, denoted *X*_{1}–…–*X*_{n}, if for every *i* = 1 … *n* – 1, *G* contains either *X*_{i}→*X*_{i+1} or *X*_{i}←*X*_{i+1}.

**Definition 6:** Let *G* be a Bayesian network structure and *X*_{1}–…–*X*_{n} be a trail in *G*. Let **E** be a subset of nodes from **X**. There is an active trail between *X*_{1} and *X*_{n} given evidence **E** if:

Whenever we have a v-structure

*X*_{i–1}→*X*_{i}←*X*_{i+1}, then*X*_{i}or one of its descendants are in**E**.No other node along the trail is in

**E**.

Intuitively, this means that the dependence can “flow” through every triplet *X*_{i–1}–*X*_{i}–*X*_{i+1}.

**Definition 7:** Let **X**, **Y**, **Z** be three sets of nodes in *G*. We say that **X** and **Z** are d-separated given evidence **Y**, denoted d-sep_{G} (**X**;**Z**∣**Y**), if there is no active trail between any node *X* in **X** and *Z* in **Z** given evidence **Y** (7).

d-sep_{G} is a property of the graph structure *G* that corresponds to the notion of conditional independence in the corresponding probability distribution *P*. Ind(*G*) is defined as the set of independence statements (of the form “*X* is independent of *Z* given **Y**”) that are implied by *G*. Using this formulation of d-separation, we can use an efficient graph algorithm whose running time scales linearly with the number of nodes in *G* (3) to check whether any such conditional independence statement holds.

### Equivalence classes

More than one graph can imply exactly the same set of independencies. For example, consider the graphs *X*→*Y* and *X*←*Y*; both imply the same set of independencies (that is, *X* and *Y* are dependent).

**Definition 8:** Two graphs *G*_{1} and *G*_{2} are equivalent if Ind(*G*_{1}) = Ind(*G*_{2}). That is, both graphs are alternative ways of describing the same set of independencies.

This notion of equivalence is crucial, because when we examine observations from a distribution, we cannot distinguish between equivalent graphs. Following the example above, given a statistical dependence between the activities of proteins *X* and *Y*, we cannot determine whether *X* activates *Y* or whether *Y* activates *X*. We can characterize equivalence classes of graphs with the use of a simple representation (8). Equivalent graphs have the same underlying undirected graph but might disagree on the direction of some of the arcs.

**Theorem 1:** Two Bayesian network structures are equivalent if and only if they have the same underlying undirected graph (termed “skeleton”) and the same v-structures (8).

For example, the skeleton *X*–*Y*–*Z* can be partitioned into two equivalence classes, one containing three graphs representing [(*X* ⊥ *Z*∣*Y*), ¬(*X* ⊥ *Z*∣∅)] and the other containing the v-structure representing [¬(*X* ⊥ *Z*∣*Y*), (*X* ⊥ *Z*∣∅)], where ¬ denotes the logical operator NOT (Fig. 2).

Moreover, an equivalence class of network structures can be uniquely represented by a partially directed acyclic graph (PDAG) P, where a directed edge *X*→*Y* denotes that all members of the equivalence class contain the directed edge *X*→*Y*; an undirected edge *X*–*Y* denotes that some members of the class contain the directed edge *X*→*Y*, whereas others contain the directed edge *X*←*Y*. (In the final section, we discuss the relation between directed edges in the PDAG and causality.)

Assume that we are given some Bayesian network structure *G* and wish to derive the PDAG P representing *G*'s equivalence class. Because Theorem 1 states that all equivalent graphs must agree on their v-structures, it is obvious that we need to orient any edge that participates in a v-structure. What is less obvious is that there are other edges in the graph that need to be oriented. These are the edges that form new v-structures when reversed. They can be derived through one of the following three propagation rules for compelled edges in P (Fig. 3

Consider the subgraph

*X*→*Y*–*Z*, where no edge exists between*X*and*Z*. Each edge direction between*Y*and*Z*defines a different equivalence class. The edge*Y*←*Z*forms a v-structure, whereas*Y*→*Z*does not. Therefore, the edge in the corresponding PDAG P is compelled to be directed as*Y*→*Z*.Consider the subgraph

*X*→*Y*,*Y*→*Z*, and*X*–*Z*. If we direct the edge as*Z*→*X*, a cycle is formed. Therefore, to ensure acyclicity, the edge is compelled to be directed as*X*→*Z*.Consider the subgraph

*X*–*Y*,*X*–*W*,*X*–*Z*,*Y*→*Z*, and*W*→*Z*. The edge*X*–*Z*is compelled to be directed as*X*→*Z*. Assume that the edge is directed as*Z*→*X*. Then, to avoid acyclicity, the edges*X*←*Y*and*X*←*W*are compelled, thus forming a new v-structure.

Given a DAG *G*, the PDAG representation of its equivalence class can be constructed as follows. We begin from the underlying skeleton of *G* and orient all edges that participate in a v-structure. Then we continue applying the propagation rules of Fig. 3 until no more subgraphs corresponding to one of the rules exist.

**Proposition 1:** If we apply the procedure described above to *G*, the resulting PDAG represents the equivalence class of *G*.

## An Algorithm for Inferring Bayesian Networks from Experimental Data

Our goal is to automatically infer the structure of the molecular signaling network from a large data set of phosphorylated protein activity levels. For example, flow cytometry can measure multiple protein expression levels simultaneously in many thousands of individual cells. We assume that influences between molecules in the signaling network enforce statistical dependencies between protein expression/activity levels and thus also impose a distribution *P** underlying the measurements. Recall that each Bayesian network represents a unique distribution (as defined by Eq. 3) that assigns a probability to each joint value assignment to variables (measured levels of phosphorylated proteins). The process of “learning” a Bayesian network is finding a Bayesian network that gives the samples in our data a high probability.

Thus, given flow cytometry data *D* = {*X*[1], …, *X*[*M*]}, we treat each individual cell as a sample and assume that these are independently drawn from some unknown generating Bayesian network *G**, with an underlying distribution *P**. Our goal is to recover *G**. Because *D* is only a noisy sample derived from *P**, we cannot detect with complete reliability which dependencies are present in the underlying distribution. Instead, we search for a relatively simple model, *B* = (*G*,𝛉) (with few edges), that is likely to have generated the data—that is, a model whose underlying distribution is close to the empirical distribution of the data *D*.

More precisely, we search for an equivalence class of networks that best matches *D*. Recall that all structures in an equivalence class represent the same dependencies and are equally close to the empirical distribution of *D*. We cannot distinguish between them solely on the basis of *D*. Thus, the best we can hope for is to recover a structure that is equivalent to *G**.

The theory of learning networks from data has been examined extensively over the past decade. In this primer, we take a score-based approach to learning. We define a hypothesis space of potential network models, introduce a statistically motivated scoring function that evaluates each network with respect to the training data, and search for the highest scoring network.

To do so, we first assume that the graph structure *G* is given and describe an appropriate score for the conditional probability distribution parameters 𝛉 and a closed-form solution for the highest scoring parameters. Then we describe an appropriate score for the graph structure itself. Finally, we describe a greedy algorithmic approach that finds a high-scoring network structure. (The reader unfamiliar with probability theory may skip ahead to the section on model averaging.)

### Maximum likelihood estimation of parameters

In this section, we assume that the structure of the graph *G* is known. Although this is not a reasonable assumption for our domain, the theory of parameter estimation is a basic building block for the structure learning to come. In the Bayesian network learning task, we implicitly assume that there is some Bayesian network *B** that generated the data *D*, and our goal is to use these data to try to reconstruct *B**. A good Bayesian network *B* is one that is likely to have generated *D*. If we assume that *G* is already known, our task then is to find the conditional probabilities 𝛉 that maximize the likelihood that *D* was generated by the Bayesian network *B** = (*G*,𝛉).

**Definition 9:** We define a likelihood function, *L*(𝛉:*D*), that measures the likelihood that the parameters 𝛉 generated the data *D*. Because the samples are independent, the likelihood decomposes into a product of probabilities, one for each sample:$$mathtex$$\[L({\theta}:D)\ =\ {{\prod}_{m\ =\ 1}^{M}}P(X[m]{\mid}{\theta})\]$$mathtex$$(6)In maximum likelihood estimation, given a data set *D*, we wish to choose parameters 𝛉̂ that maximize the likelihood of the data (as above):$$mathtex$$\[{\hat{{\theta}}}\ =\ \mathrm{argmax}L({\theta}:D)\]$$mathtex$$(7)Optimizing Equation 7 could potentially be a computationally hard problem (exponential in the number of parameters) because of the high dimensionality of 𝛉 and the large number of parameters that need to be concurrently optimized. One of the big advantages of the Bayesian network representation is that this likelihood separates into local likelihood functions, one for each variable. This simplifies the calculation of the likelihood; more important, it renders finding its optimal parameters tractable. Each local likelihood can be optimized in an independent manner, thus decomposing a complex global problem into smaller subproblems:$$mathtex$$\[L({\theta}:D)\ =\ {{\prod}_{m\ =\ 1}^{M}}P(X[m])\]$$mathtex$$$$mathtex$$\[=\ {{\prod}_{m\ =\ 1}^{M}}{{\prod}_{i\ =\ 1}^{n}}P(X_{i}[m]{\mid}Pa_{i}[m],{\theta})\]$$mathtex$$$$mathtex$$\[=\ {{\prod}_{i\ =\ 1}^{n}}\left[{{\prod}_{m\ =\ 1}^{M}}P(X_{i}[m]{\mid}Pa_{i}[m],{\theta})\right]\]$$mathtex$$$$mathtex$$\[=\ {{\prod}_{i\ =\ 1}^{n}}L_{i}({\theta}_{x_{i}{\mid}Pa_{i}}:D)\]$$mathtex$$(8)where$$mathtex$$\[L_{i}({\theta}_{X_{i}{\mid}Pa_{i}}:D)\ =\ {{\prod}_{m\ =\ 1}^{M}}P(X_{i}[m]{\mid}Pa_{i},{\theta})\]$$mathtex$$(9)is the local likelihood function for *X*_{i}.

If we have a variable *X* with its parents *Pa*_{X}, the conditional probabilities for each joint assignment of values to *X* and *Pa*_{X} are associated with the Bayesian network. In the case of conditional probability tables, for each combination of value assignments *x* ∈ Val(*X*) and **u** ∈ Val(*Pa*_{X}), a parameter 𝛉_{x∣u} represents *P*(*X* = *x*∣*Pa*_{X} = **u**). In conditional probability tables, these parameters are independent for each **u** ∈ Val(*Pa*_{X}); therefore, the local likelihood can be further decomposed into a yet simpler form. The idea behind the decomposition is to group together all the instances in which *X* = *x* and *Pa*_{X} = **u**. We define *M*[*x*,**u**] as the number of instances in which *X* = *x*, *Pa*_{X} = **u**, and *M*[**u**] = ∑_{x∈Val(X)}*M*[*x*,**u**]. Then, by rearranging the order of the product, we can write$$mathtex$$\[L_{i}({\theta}_{X{\mid}Pa_{x}}:D)\ =\ {{\prod}_{\mathrm{\mathbf{u}}{\in}\mathrm{Val}(Pa_{X})}}{{\prod}_{x{\in}\mathrm{Val}(X)}}{\theta}_{x{\mid}\mathrm{\mathbf{u}}}^{M[x,\mathrm{\mathbf{u}}]}\]$$mathtex$$(10)

**Proposition 2:** The maximum likelihood estimate for a Bayesian network with multinomial conditional probability tables is given by$$mathtex$$\[{\hat{{\theta}}}_{x{\mid}\mathrm{\mathbf{u}}}\ =\ \frac{M[x,\mathrm{\mathbf{u}}]}{M[\mathrm{\mathbf{u}}]}\]$$mathtex$$(11)We call the counts *M*[*x*,**u**] and *M*[**u**] sufficient statistics. Given these counts, the actual data instances *X*[1], …, *X*[*M*] themselves are no longer needed. The sufficient statistics extract from the data all the relevant information needed to calculate the likelihood. Note that the optimal parameters are based on the empirical counts observed in our data. Thus, optimizing the likelihood is equivalent to finding the best approximation for the empirical distribution constrained to the independencies of *G*.

### Bayesian approach to parameter estimation

Although the maximum likelihood estimation approach would seem to be a suitable way to measure the fit of a Bayesian network to the data, it has a number of disadvantages. Its main drawback is that it tends to overfit the model to the particular data instance at hand. This problem is especially critical when studying signaling in rare cellular subsets, for which we have a relatively small number of samples. We illustrate this problem with an example from the medical domain: Assume that we are performing a study on the effect of smoking on lung cancer. Our sample contains 30 nonsmokers, none of whom contracted lung cancer. Maximum likelihood estimation would construct a model that postulates *P*(lung cancer = YES∣smoker = NO) = 0. However, we think that the correct answer is that there is a small chance for a nonsmoker to develop lung cancer, but our small sample did not contain such a case.

We therefore turn to the Bayesian approach, which formulates this concept of prior belief in a principled manner. The idea is that in addition to the observed data *D*, we have some initial distribution *P*(𝛉), termed “the prior,” which encodes our beliefs regarding the domain prior to our observations. When we have little prior knowledge of our domain, this distribution is often flat and mostly ensures that every event has some nonzero probability. On the other hand, if we do have specific information about our domain, this distribution can be more peaked over certain values.

After we observe some data *D*, we update the distribution *P*(𝛉) to reflect the combination of both our prior belief and our observations. This updated distribution, *P*(𝛉∣*D*), is called the posterior distribution:$$mathtex$$\[P({\theta}{\mid}D)\ =\ \frac{P(D{\mid}{\theta})P({\theta})}{P(D)}\]$$mathtex$$(12)The term *P*(*D*), termed “the marginal likelihood,” averages the probability of the data over all possible parameter assignments. Because it is a normalizing constant, which is independent of 𝛉, we ignore it in our score calculations.

In this primer, we use the Dirichlet priors (9) for multinomial distributions. A Dirichlet prior is specified by a set of hyperparameters α_{x1∣u}, …, α_{xK∣u}, with one such hyperparameter corresponding to each *x*^{j} ∈ Val(*X*). The Dirichlet distribution is specified by$$mathtex$$\[P({\theta})\ =\ \mathrm{Dirichlet}({\alpha}_{x^{1}{\mid}u},{\ldots},{\alpha}_{x^{K}{\mid}u})\ {\sim}\ {{\prod}_{j}}{\theta}_{x^{j}{\mid}\mathrm{\mathbf{u}}}^{{\alpha}_{x^{j}{\mid}\mathrm{\mathbf{u}}}{-}1}\]$$mathtex$$(13)Dirichlet priors have a number of desirable properties; they satisfy global parameter independence and local parameter independence. This means that the prior decomposes into a product of independent terms, in a manner similar to decomposition in maximum likelihood estimation.

**Definition 10:** A parameter prior *P*(𝛉) is said to satisfy global parameter independence if it decomposes into the following form:$$mathtex$$\[P({\theta})\ =\ {{\prod}_{i\ =\ 1}^{n}}P({\theta}_{X_{i}{\mid}Pa_{i}})\]$$mathtex$$(14)

**Definition 11:** Let *X* be a variable with parents *Pa*_{X}. Then the prior *P*(𝛉_{X∣PaX}) has local parameter independence if$$mathtex$$\[P({\theta}_{X{\mid}Pa_{X}})\ =\ {{\prod}_{\mathrm{\mathbf{u}}{\in}\mathrm{Val}(Pa_{X})}}P({\theta}_{X{\mid}\mathrm{\mathbf{u}}})\]$$mathtex$$(15)The prior *P*(𝛉) satisfies parameter independence if it satisfies both global and local parameter independence.

In addition, Dirichlet priors are conjugate priors; that is, the posterior distribution has the same functional form as the prior. This property provides an intuitive interpretation for the hyperparameters. One can view α_{x∣u} as imaginary counts; that is, before we observed *D*, on α_{x∣u} occasions we “observed” *X* = *x* and *Pa*_{X} = **u**.

**Proposition 3:** If *P*(𝛉) is Dirichlet(α_{x1∣u}, …, α_{xK∣u}), then the posterior *P*(𝛉∣*D*) is Dirichlet(α_{x1∣u} + *M*[*x*^{1},*u*], …, α_{xK∣u} + *M*[*x*^{K},*u*]), where *M*[*x*,**u**] are the sufficient statistics derived from *D*.

We say that α* = ∑_{j}α_{xj∣u} is our effective sample size, meaning that we assume that our prior is based on α* observations. This reflects how strongly we believe in our prior. As we accumulate more samples in *D*, the effect of the prior on the posterior grows weaker.

Even when our prior is flat, the full effect of the Bayesian approach comes into play when predicting the probability of future samples. In the Bayesian approach, the probability of a future observation is calculated not on the basis of only one set of parameters, but using the expectation over the entire distribution of parameters. Thus, the probability of a new sample *X*[*M* + 1] is$$mathtex$$\[P(X[M+1]{\mid}D)\ =\ {\int}P(X[M+1]{\mid}D,{\theta})P({\theta}{\mid}D)P(D)dD\]$$mathtex$$(16)When we use conditional probability tables and Dirichlet priors, this integral has a closed-form solution:$$mathtex$$\[P(X_{i}[M+1]{\mid}D)\ =\ P(X_{i}[M+1]\ =\ x^{i}{\mid}Pa_{i}[M+1]\ =\ \mathrm{\mathbf{u}},D)\ =\ \frac{{\alpha}_{x^{j}{\mid}\mathrm{\mathbf{u}}+M[x^{i},\mathrm{\mathbf{u}}]}}{{\sum}_{j}({\alpha}_{x^{j}{\mid}\mathrm{\mathbf{u}}}+M[x^{j},\mathrm{\mathbf{u}}])}\]$$mathtex$$(17)

### Structure learning: A score-based approach

Previously, we showed how one can learn the Bayesian network parameters given a known structure *G*, but in a real scenario we do not know *G*. Our goal is to understand the structural relationships between the variables in our domain. For instance, we would like to be able to distinguish between direct and indirect interactions between molecules. Therefore, it is exactly this graph structure that we wish to reconstruct from the observed data. We can then use this reconstructed graph structure to answer queries regarding the interactions between the proteins and other structural properties. On the basis of observational data alone, it is not possible to distinguish between equivalent structures. Thus, at best our reconstruction procedure can reconstruct an equivalence class of networks.

We take a score-based approach to this problem. We define a model space of candidate models that we are willing to consider, along with a scoring function that measures how well each model fits the observed data. Then we use an optimization algorithm that searches for a high-scoring model.

Our scoring function is based on the same Bayesian principles described above, following a basic principle: Whenever we have uncertainty about something, we place a probability distribution over it. We therefore define a structure prior *P*(*G*) over the different graph structures and a parameter prior *P*(𝛉∣*G*) over the parameters once the graph is given. The particular choice of the priors *P*(*G*) and *P*(𝛉∣*G*) determines the exact Bayesian score. Our score evaluates the posterior probability of the graph given the data:$$mathtex$$\[\mathrm{Score}_{B}(G:D)\ =\ \mathrm{log}P(D{\mid}G)+\mathrm{log}P(G)\]$$mathtex$$(18)where *P*(*D*∣*G*) takes into consideration our uncertainty over the parameters by averaging the probability of the data over all possible parameter assignments to *G*,$$mathtex$$\[P(D{\mid}G)\ =\ {\int}P(D{\mid}G,{\theta})P({\theta}{\mid}G)d{\theta}\]$$mathtex$$(19)The Bayesian score inherently handles the problem of overfitting a small sample to a complex model. Because of the integration over all possible parameters, structures with many parameters (that is, many parents for each variable) are penalized, unless the probability of the true parameters is very peaked (which happens when the sample size is large). Although the Bayesian score is biased to more simple structures, as more data accumulate, the score will support more complex structures (when the generating distribution is indeed complex). We explain this intuition with the following thought experiment: Assume that we are playing a gambling game on the value of *X*; each time we guess the correct value, we gain $10.Assume that for $1000 we can purchase a subscription to the value of *Y* and this value improves our ability to guess *X*, so that each time we guess, we expect to be correct 1 out of 5 times. For only 100 games it does not pay to purchase *Y*, but for a series of 700 games it would be advantageous to purchase *Y*. This is the “game” our scoring function is playing. In a biological context, if we observe that *Y* holds a small amount of information on *X* in only 100 samples, this might be a spurious dependency, but if this remains consistent over 1000 samples, then we begin to believe the relationship and add the edge to our model.

We next show how to choose good priors and demonstrate how these priors lead to desirable properties in our score. An important characteristic of the Bayesian score is that when we restrict ourselves to a certain class of factorized priors (10, 11), the Bayesian score decomposes.

**Definition 12:** A parameter prior satisfies parameter modularity when for any two graphs *G*_{1} and *G*_{2}, if *Pa*_{i}^{G1} = *Pa*_{i}^{G2}, then$$mathtex$$\[P({\theta}_{X_{i}{\mid}Pa_{i}^{G_{1}}}{\mid}G_{1})\ =\ P({\theta}_{X_{i}{\mid}Pa_{i}^{G_{2}}}{\mid}G_{2})\]$$mathtex$$(20)This relationship means that the parameter prior depends only on the local structure of the graph.

**Proposition 4:** If the prior *P*(𝛉∣*G*) satisfies global parameter independence and parameter modularity, then$$mathtex$$\[P(D{\mid}G)\ =\ {{\prod}_{i}}{\int}_{{\theta}_{X_{i}{\mid}Pa_{i}}}{{\prod}_{m}}P(X_{i}[m]{\mid}Pa_{i}[m],{\theta}_{X_{i}{\mid}Pa_{i}})P({\theta}_{X_{i}{\mid}Pa_{i}})d{\theta}_{X_{i}{\mid}Pa_{i}}\]$$mathtex$$(21)Therefore, we can decompose the score into the local contributions of each variable (denoted FamScore), where the contribution of every variable *X*_{i} to the total network score depends only on the sufficient statistics of *X*_{i} and its parents *Pa*_{i}:$$mathtex$$\[\mathrm{Score}_{B}(G:D)\ =\ {{\sum}_{i}}\mathrm{FamScore}_{B}(X_{i},Pa_{i}:D)\]$$mathtex$$(22)As we will see, this decomposition plays a crucial role in our ability to efficiently search for high-scoring network structures.

One of the big advantages of using Dirichlet priors is that the family score has a simple closed-form formula.

**Definition 13:** The gamma function Γ(*x*) is defined to be an extension of the factorial function to real-number arguments. The gamma function is defined as the integral$$mathtex$$\[{\Gamma}(x)\ =\ {\int}_{0}^{{\infty}}t^{x{-}1}\mathrm{exp}({-}x)dt\]$$mathtex$$(23)If *n* is a natural number, then Γ(*n*) = (*n* – 1)!

**Theorem 2:** Let *G* be a network structure and *P*(𝛉∣*G*) be a parameter prior satisfying parameter independence. Further assume conditional probability tables and Dirichlet priors with hyperparameters {α_{xij∣u}}. Then$$mathtex$$\[\mathrm{FamScore}_{B}(X_{i},Pa_{i}:D)\ =\ \mathrm{log}{{\prod}_{\mathrm{\mathbf{u}}{\in}\mathrm{Val}(Pa_{i})}}\frac{{\Gamma}({\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}})}{{\Gamma}({\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}}+M[\mathrm{\mathbf{u}}])}{{\prod}_{x_{i}^{j}{\in}\mathrm{Val}(X_{i})}}\frac{{\Gamma}({\alpha}_{x_{i}^{j}{\mid}\mathrm{\mathbf{u}}}+M[x_{i}^{j},\mathrm{\mathbf{u}}])}{{\Gamma}({\alpha}_{x_{i}^{j}{\mid}\mathrm{\mathbf{u}}})}\]$$mathtex$$(24)where Γ is the gamma function and$$mathtex$$\[{\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}}\ =\ {{\sum}_{j{\in}\mathrm{Val}(X_{i})}}{\alpha}_{x_{i}^{j}{\mid}\mathrm{\mathbf{u}}}\]$$mathtex$$(25)(10). A desired property is that the score reaches its optimum on the true generating structure; that is, given a sufficiently large number of samples, graph structures that exactly capture all dependencies in the distribution will receive a higher score than all other graphs (12). This means that given a sufficiently large number of instances, learning procedures can pinpoint the correct equivalence class of network structures.

**Definition 14:** Assume that our model is generated by some true model *G**. We say that our score is consistent if, as *M* → ∞, the following properties hold with probability asymptotic to 1 (over possible choices of data set *D*):

The structure

*G** will maximize the score.All structures that are not equivalent to

*G** will have a strictly lower score.

**Theorem 3:** The Bayesian score is consistent.

Recall that given *D* it is impossible to distinguish between two different networks in the same equivalence class. Thus, another desirable property in our score is that equivalent structures receive the same score. That is, if *G*_{1} and *G*_{2} are equivalent graphs, they are guaranteed to have the same score. Such a property is called structure equivalence. To achieve structure equivalence, we devise a set of hyperparameters so that our prior will not bias the score between equivalent structures. This is achieved with the use of a BDe prior (13). We define a probability distribution *P*′ over *X* and an equivalent sample size *M*′ for our set of imaginary samples. The hyperparameters are then defined to be$$mathtex$$\[{\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}_{i}}\ =\ M^{{^\prime}}P^{{^\prime}}(x_{i},\mathrm{\mathbf{u}}_{i})\]$$mathtex$$(26)

**Theorem 4:** When the data are complete and Dirichlet BDe priors are used, the score is structure-equivalent (10).

### Search algorithm for maximizing scores

Once the score is specified and the data are given, learning amounts to finding the structure *G* that maximizes the score. This problem is known to be computationally hard (exponential in the number of variables) to solve exactly (14); thus, we resort to a heuristic search. We define a search space, so that each state in this space is a network structure. We define a set of operators that transform one network structure into another. This defines a graph structure on the states: Neighboring states are those that are one operator away. We start with some initial structure (for example, the empty graph) and, using the operators, traverse this space searching for high-scoring structures.

A natural choice of neighboring structures is a set of structures that are identical to the base structure except for local modifications. We use the following operators that change one edge at each step:

Add an edge

Remove an edge

Reverse an edge

Note that we only consider operations that result in allowed networks. Networks must be acyclic and must satisfy any other constraints we specify (for example, maximal indegree constraints).

A local search procedure can efficiently evaluate the gains made by adding, removing, or reversing a single edge. The decomposition of the score is crucial for the efficiency of this procedure. Decomposition allows us to reevaluate only those components of the score that involve the variables affected by our local step. For instance, if we add an edge to the variable *X*_{i}, we only need recalculate *X*_{i}'s family score; the family scores for all other variables remain unchanged.

We use the following greedy hill-climbing algorithm for our search procedure: At each step, we evaluate all possible local moves and perform the change that results in the maximal gain, until we reach a local maximum. (A sketch for such an algorithm appears in Fig. 4.) Although this procedure does not necessarily find a global maximum, it does perform well in practice. A number of heuristics can be included to overcome some of the local maxima (for example, random restarts). Examples of other search methods that advance by single edge changes include beam-search, stochastic hill-climbing, and simulated annealing (15).

Any implementation of these search methods involves caching of computed counts to avoid unnecessary passes over the data. This cache also allows us to marginalize counts. Thus, if *M*[*X*,*Y*] is in the cache, we can compute *M*[*X*] by summing over values of *Y*. This is usually much faster than making a new pass over the data. One of the dominating factors in the computational cost is the number of passes actually made over the data.

## Model Averaging

Given a flow cytometry data set, we use the learning algorithm described above to search for the Bayesian network *G* that best explains the data. A simple approach would be to accept *G* as a correct model of our domain. Then we could use *G* to infer relations between proteins (for example, kinase A directly influences kinase B). Such analysis would rely on the assumption that the network *G* correctly represents the interactions in the underlying domain, but how reasonable is this assumption? A sufficiently large number of samples (hundreds of thousands of cells) would (almost) provide that the network inferred is a good model of the data (12). However, given a smaller number of training instances (thousands), there may be many models that explain the data almost equally well.

Figure 5 shows five high-scoring networks in relation to data set *D*. The networks vary in structure, but their scores are almost the same. In our example, *G*_{3} is the highest scoring network. In a simple approach, we would infer a direct interaction between the proteins A and C (because the edge A→C exists in *G*_{3}), but the edge is absent from the other high-scoring networks. Therefore, it is more likely that the dependence between A and C is a spurious artifact in *D*. Thus, we cannot depend on a single network to provide an accurate description of the relations in our biological domain.

Instead of querying a single structure, we search for common feature*s* that most of the high-scoring network structures agree on. For example, the ensemble of high-scoring networks in Fig. 5 all agree on the edge E→A. Therefore, it is likely to represent a real biological signal. We find such features by examining the posterior probability of the feature given the data. A network feature is a property such as “*X*→*Y* is in the network” or “*d* – sep_{G}(*X*;*Y*∣**Z**) is in the network.” We associate the feature *f* with an indicator function, *f*(*G*), that has the value 1 when *G* satisfies the feature and value 0 otherwise. The posterior probability of *f* is defined as$$mathtex$$\[P[f(G){\mid}D]\ =\ {{\sum}_{G}}f(G)P(G{\mid}D)\]$$mathtex$$(27)This probability reflects our confidence in *f* given *D*.

The straightforward way of calculating Eq. 27 is by enumerating all high-scoring networks. Unfortunately, the number of such networks can be exponential in the number of variables, so exact computation of the posterior probability is not feasible. Instead, we can estimate this posterior probability by sampling representative networks and then estimating the fraction that contain the feature of interest. Ideally, we would like to sample networks from *P*(*G*∣*D*) and use the sampled networks to estimate this quantity. The general solution to this problem is to build a Markov chain Monte Carlo sampling procedure (16). Unfortunately, this sampling procedure is also computationally costly.

Instead, we use an effective and relatively simple bootstrap method (17) as an approximation of the posterior probability. Our networks are generated using nonparametric bootstrap (18). We generate “perturbed” versions of the original data set and learn a Bayesian network structure from each of them. In this way, we collect many networks, all of which are fairly reasonable models of the data. These networks reflect the effect of small perturbations to the data on the learning process. We use the following procedure:

For

*i*= 1, …,*m*, construct a data set*D*_{i}by sampling, with replacement,*M*instances from*D*. Then, apply the learning procedure on*D*_{i}to induce a network structure*G*_{i}.For each feature

*f*of interest, calculate

$$mathtex$$\[\mathrm{conf}(f)\ =\ \frac{1}{m}{{\sum}_{i\ =\ 1}^{m}}f(G_{i})\]$$mathtex$$(28)[See (19) for an evaluation of this bootstrap approach on simulated data.] These simulation experiments show that features induced with high confidence are rarely false positives, even in cases where the data sets are small relative to the size of the system being learned.

## Interventions and Causality

### Causality

Recall that a Bayesian network is a model of dependencies between multiple variables. However, we are also interested in modeling the mechanisms that generated these dependencies. Thus, we want to model the causality in the system of interest (for example, protein *X* activates protein *Y*). A causal network is a model of such causal processes. Having a causal interpretation facilitates predicting the effect of an intervention in the domain.

Although at first glance there would seem to be no direct connection between probability distributions and causality, causal interpretations for Bayesian networks have been proposed (8, 20). A causal network is mathematically represented similarly to a Bayesian network, a DAG where each node represents a random variable along with a local probability model for each node. However, causal networks have a stricter interpretation on the meaning of edges: The parents of a variable are its immediate causes.

A causal network models not only the distribution of the observations, but also the effects of interventions. If *X* causes *Y*, then manipulating the value of *X* affects the value of *Y*. On the other hand, if *Y* causes *X*, then manipulating *X* will not affect *Y*. Thus, although *X*→*Y* and *X*←*Y* are equivalent Bayesian networks, they are not equivalent causal networks.

A causal network can be interpreted as a Bayesian network when we are willing to make the causal Markov assumption, which states that given the values of a variable's immediate causes, it is independent of its earlier causes. When the causal Markov assumption holds, the causal network satisfies the Markov independencies of the corresponding Bayesian network. But when can we derive a causal network from data? This issue has received a thorough treatment in the literature (8, 21, 22); we briefly review some relevant results [for a more detailed treatment of the topic, see (20, 23)].

First, it is important to distinguish between an observation (a passive measurement of our domain; that is, a sample from **X**) and an intervention [setting the values of some variables with the use of forces outside the causal model, such as chemical inhibition or small interfering RNA (siRNA)]. Interventions are an important tool for inferring causality, but occasionally some causal relations can be inferred from observations alone.

To learn causality, we require several assumptions. First, we require a modeling assumption: We assume that the (unknown) causal structure of the domain satisfies the causal Markov assumption. Thus, we assume that causal networks can provide a reasonable model of the domain. The second assumption is that there are no latent or hidden variables that affect several of the observable variables. Unfortunately, neither assumption holds in our domain. Thus, causal conclusions from our learning procedure must be treated with caution.

If we do make these two assumptions, then we essentially assume that one of the possible DAGs over the domain variables is the “true” causal network. However, as discussed above, observations alone do not permit us to distinguish between causal networks that specify the same independence properties (i.e., belong to the same equivalence class). Thus, at best we can hope to learn a description of the equivalence class that contains the true model. In other words, we will learn a PDAG description of this equivalence class.

Once we identify such a PDAG, we are still uncertain about the true causal structure in the domain. However, we can draw some causal conclusions. For example, if there is a directed path from *X* to *Y* in the PDAG, then *X* is a causal ancestor of *Y* in all the networks that could have generated this PDAG, including the “true” causal model. Thus, in this situation we can recover some of the causal directions.

### Modeling interventions

To infer the causal direction of more edges, we need to apply external interventions to the data. These interventions can be genetic mutations, siRNA, small chemical interventions (inhibitors or activators), or any other intervention that directly influences one of the molecules observed in the data. The use of data containing external interventions contradicts a basic assumption made by our Bayesian network learning algorithm: that each data instance was sampled from the same underlying distribution. For instance, using a small chemical to inhibit *X*, we replace the original molecular control on *X*'s activity by an external one. Thus, any consequent measurement (in which *X*'s value is constantly set to 0) will behave differently from *X*'s conditional distribution on its parents in observational data. Therefore, it is important to explicitly model this intervention into our learning algorithms.

Formally, we model an intervention, denoted do(*X* = *x*), as an ideal intervention (20) that deterministically sets the value of *X* to *x*. This intervention disables the natural causal mechanisms that affect *X* and replaces them with an external deterministic mechanism. In addition, we assume that the intervention only affects *X*'s causal mechanism and leaves intact all other causal mechanisms in the model; that is, all other variables behave according to their respective conditional distribution. More formally, given a causal network *G*, an ideal intervention defines a new causal network *G*_{do(X = x)}, identical to *G* except that all incoming edges into *X* are removed. In *G*_{do(X = x)}, *X* becomes a root node associated with the probability distribution Pr(*X* = *x*) = 1 (Fig. 6). Note that *X*'s outgoing edges are not affected by such an intervention. When a number of different variables are manipulated in the same sample, we remove the edges incoming to each of these variables.

Such interventions can be used for causal inference. We illustrate this point with the following thought experiment: Consider the pair of networks *X*→*Y* and *Y*→*X*. As Bayesian networks, the two are equivalent and cannot be distinguished on the basis of observational data alone. If we inhibit *X* [i.e., do(*X* = *x*)], then, as causal models, we expect each of the two models to respond differently. If the causal model is *X*→*Y*, then *Y*'s causal mechanism remains intact.

Therefore, the same conditional distribution is measured in both the observed and inhibited samples: *P*[*Y*∣do(*X* = *x*)] = *P*(*Y*∣*X* = *x*). On the other hand, if *Y*→*X* is the causal model, the inhibition of *X* disables the causal mechanism responsible for the dependency between the two variables. When *X* is inhibited, the variables *X* and *Y* become independent: *P*[*Y*∣do(*X* = *x*)] = *P*(*Y*). Therefore, we expect to measure a different conditional distribution in the observational and inhibited samples. Whereas we could not distinguish between the two models with the use of observations alone, we can differentiate between them with the use of interventional data.

The global nature of our reasoning allows us to reach a causal conclusion upstream of an intervention. In the above example, if we inhibit *X* and observe a different conditional distribution—*P*[*Y*∣do(*X* = *x*)] ≠ *P*(*Y*∣*X* = *x*)—we infer that the causal mechanism between *X* and *Y* has been disrupted. Because inhibition of *X* disrupts the mechanism of its direct incoming causes, we infer that *Y* causes *X*, or in our terminology, *Y* regulates *X*.

### Scoring with interventions

The Bayesian scoring function described above assumes that all samples are drawn from the same network structure. We adapt that score to properly handle interventions (under the model of an ideal intervention). In such data sets, the underlying Bayesian network associated with each sample differs depending on the perturbation administered to the sample.

Similarly to (24), we make the following set of assumptions:

Our samples are independent random samples from a causal network. This causal network represents both the probability distribution sampled and the causal relationships in the data. We note that this assumption does not hold in our domain, as this assumes acyclicity of the generating network.

Each perturbation is an ideal intervention; that is, it disables the normal causal mechanism of the inhibited/activated protein and an independent causal mechanism deterministically sets its value. Furthermore, this intervention does not directly affect the causal mechanism of any other protein.

The data are complete; there are no missing or hidden variables.

Global and local parameter independence (Definitions 10 and 11).

Parameter modularity (Definition 12).

The prior distribution of all parameters is Dirichlet.

We define *M*[*m*] as the set of interventions occurring in the *m*th sample. Let *M* be the collection of all sets of interventions occurring in *D*. Then, *G*^{M} is the set of all graphs derived from *G* based on *M*, that is,$$mathtex$$\[G^{M}\ =\ \left\{G_{\mathrm{do}(M[m])}\right\}_{m\ =\ 1}^{M}\]$$mathtex$$(29)One of the key properties of the Bayesian scoring function is that the score decomposes into a product of local entities, each depending only on *X* and *Pa*_{X}. Unfortunately, when *D* contains interventions, the instances are not associated with a single structure. The probability of each sample *m* is calculated using the appropriate graph, *G*_{do(X = x)}. Therefore, at first glance, such a decomposition might seem problematic.

We briefly sketch how this problem can be overcome [see (19) for a full derivation of the appropriate Bayesian score under interventions]. The assumptions of parameter independence and parameter modularity allow us to calculate the contribution of each variable independently, without being affected by the choice of parent sets for other variables. Thus, the expression for each variable *X*_{i} decomposes as a separate integral that depends only on its parents in *G*_{do(X = x)} and the associated parameters. The key point to notice is that although *G*^{M} might contain many different graphs, if we focus our attention on a single variable *X*_{i}, we find only two possible sets of parents: *Pa*_{i}^{G} for samples where *X*_{i} is not intervened, and a root node in samples where it is intervened (Fig. 6). Furthermore, the parameters of *X*_{i} in the nonintervened samples are independent of the intervened samples.

When assuming Dirichlet priors, this leads to the same closed-form formula derived in Eq. 24. The only difference is that the counts *M*[**u**] and *M*[*x*_{i}^{j},**u**] are tallied only over the samples *X*_{i} ∉ *M*[*m*]:$$mathtex$$\[\mathrm{FamScore}_{B}(X_{i},Pa_{i}:D)\ =\ \mathrm{log}{{\prod}_{\mathrm{\mathbf{u}}{\in}\mathrm{Val}(Pa_{i})}}\frac{{\Gamma}({\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}})}{{\Gamma}({\alpha}_{x_{i}{\mid}\mathrm{\mathbf{u}}}+M[\mathrm{\mathbf{u}}])}{{\prod}_{x_{i}^{j}{\in}\mathrm{Val}(X_{i})}}\frac{{\Gamma}({\alpha}_{x_{i}^{j}{\mid}\mathrm{\mathbf{u}}}+M[x_{i}^{j},\mathrm{\mathbf{u}}])}{{\Gamma}({\alpha}_{x_{i}^{j}{\mid}\mathrm{\mathbf{u}}})}\]$$mathtex$$(30)

### Inferring causality with interventions

The score of Eq. 30 is not structure-equivalent: The score of two equivalent graphs *G* and *G*′ is no longer necessarily the same. This should not come as a surprise, because the score was derived with the intent of using interventional data to differentiate between equivalent graphs. We use the equivalent graphs *X*→*Y* and *Y*→*X* to demonstrate this point. As an example, assume that our domain contains the variables (*X*,*Y*) and we measure the samples$$mathtex$$\[D\ =\ \{(0,0),(0,1),(1,1),(1,1),(1,0),(0,0),[\mathrm{do}(X\ =\ 1),1],[\mathrm{do}(X\ =\ 1),1],[\mathrm{do}(X\ =\ 1),0],[\mathrm{do}(X\ =\ 0),0]\}\]$$mathtex$$(31)For these data, the score for the graph structure *X*→*Y* is –6.46. In contrast, the score for the graph structure *Y*→*X* is –6.78. Because the score for *X*→*Y* is better, we conclude that it is more likely that *X* causes *Y*.

Although interventions help us to determine the causal direction of some edges, usually these leave the causal direction of the others undetermined. This motivates the development of a notation of equivalence suited for the interventional setting.

**Definition 15:** For a set of interventions *M*, the graphs *G*^{1} and *G*^{2} are M-equivalent if, for any set of interventions *m* ∈ *M* (always including *m* = ∅), the graph structures *G*^{1}_{do(m)} and *G*^{2}_{do(m)} are equivalent.

Using the empty set of interventions, M-equivalence implies equivalence as defined in Definition 8. The notion of M-equivalence is a more restrictive refinement of Definition 8 in which a larger set of edges are compelled.

**Definition 16:** An edge *X*→*Y* in *G* is M-compelled for a set of interventions *M* if all graphs that are M-equivalent to *G* contain the directed edge *X*→*Y*.

**Theorem 5:** The following edges are M-compelled in *G*:

All edges participating in a v-structure in

*G*.For each set of interventions

*m*∈*M*, all edges entering or leaving any variable*X*intervened in*m*.All edges compelled by the repeated application of the propagation rules specified in Fig. 3.

We can use Theorem 5 to identify the edges that can be given a causal interpretation when interventional data are used (19). Even when the criteria specified by Theorem 5 are met, care must be taken in their causal interpretation. Signaling networks do not meet all the assumptions required for the correctness of Theorem 5. For example, the underlying signaling network is not acyclic. Although some assumptions do not hold, these criteria do work well in practice. For example, in the network derived in (1), the correct directionality was inferred for all edges except one. It is interesting to note that all edges, except for this reversed edge, met the criteria of Theorem 5.

## Conclusion

The task of inferring signaling pathway architectures is an important challenge of the postgenomic era, and the method described here is only a small step toward this goal. There are many ways in which the basic framework proposed here can be extended and developed further.

The method described infers an acyclic network and does not consider the timing of signaling events. Given appropriate time-series data, both of these limitations can be overcome by the use of dynamic Bayesian networks (25, 26). Another important area for improvement is to develop more realistic local probability distributions that are capable of modeling the actual kinetics of interactions between signaling molecules (27, 28). Finally, one of the advantages of probabilistic graphic models is that these can be explicitly used to model unobserved variables and infer their structure and activity (29, 30). This feature could be used to discover previously unknown components as well as potential drug targets.