## Abstract

This Focus tackles the issue of technical versus biological replicates, what constitutes appropriate biological replicates, and appropriate statistical analysis for data with small sample sizes.

A mathematician, a physicist, and a statistician went hunting for deer. When they chanced upon one buck lounging about, the mathematician fired first, missing the buck’s nose by a few inches. The physicist then tried his hand and missed the tail by a wee bit. The statistician started jumping up and down saying, “We got him! We got him!” (from the website CrossValidated)

## Technical versus biological replicates

Reproducibility in biological science is critical to scientific progress. Here, we discuss the issue of reproducibility at the level of individual experiments. First, it is important to distinguish between technical and biological replicates. Technical replicates tell something about the reproducibility of an assay, not the reproducibility of the phenomenon under study. Done in duplicate or triplicate, these technical replicates provide a glimpse of whether the technique used to measure something is robust or noisy, and, if it is noisy, whether the extent of that noise negates the ability to distinguish the effect from the control. Technical replication data, however, need to be interpreted in the context of biological replicates. Poor technical replication may still enable discovery if the biological replicates show that the phenomenon is strong and easily distinguishable from the controls. Conversely, tight technical replication is necessary if the goal is to discover an effect when the phenomenon itself is weak and highly variable between individual experiments.

The concept of biological replicates is even trickier. What exactly is a “biological replicate,” and why are these important? The definition and the importance of biological replicates depend on exactly what conclusions are being drawn. One may argue that biological replicates should be designed to be as similar as possible to each other—that is, parallel plates of cells in the same incubator, harvested on the same day at the same time. In fact, we would argue that by those criteria, the replicates would most often be technical and not biological. But it depends on what kind of truth you are seeking. If the intent of a biological replicate is to prove that the phenomenon you observed was real—in those cells, in that incubator, on that day—then maybe those can be considered biological replicates inasmuch as they are biological samples evaluated under identical conditions. But, in general, most biologists are interested in discovering biological processes that constitute a fundamental truth about nature, not just something that might be true only when the humidity in the tissue culture hood is 63% and the pollen count in the lab is high. We want to learn something general about life itself—day in and day out—and that means, at the least, that we can observe it again and again, at different times, on different days, and maybe even in different thawed aliquots or passages of the same cell line.

## The biological *n* for cells in culture

Just as important as understanding the difference between technical and biological replicates is understanding what constitutes appropriate biological replicates. Is it enough to study a phenomenon in a single cell line? The more general a phenomenon is, the more “universal” the biological truth that is being unveiled, and thus the more important the discovery is likely to be. Showing that the p38 mitogen-activated protein kinase (MAPK) pathway crosstalks with the nuclear factor κB (NF-κB) pathway in one cell line—say, U2OS cells—is interesting, but this result may have more effect and more important implications if it also occurs in other cultured cell lines, such as 293 cells and HeLa cells, and in primary cells, such as human endothelial cells, macrophages, and breast epithelial cells. However, it can be equally important to know the specificity of signaling events to understand in what contexts a particular regulatory process is relevant: Is the crosstalk between the p38 MAPK and NF-κB pathways limited to mesenchymal cells and tissues? This may be critical information not only for the basic biological researcher but also for the medical research looking to translate the results into clinical practice. Thus, some types of experiments require testing in multiple cell types and in primary cells, if possible. However, it may require a change in the scientific value placed on negative results to provide outlets for publication of these kinds of data. Publishing the lack of an effect in some cell types becomes just as important as publishing the positive data in other cell types.

Some scientists prefer to do an *n* = 1 experiment in four different cell lines rather than an *n* = 4 experiment in a single cell line, arguing that if they see the same phenomenon once in four cell lines, then it must be genuine. However, that approach fails to provide the type of quantitative details of the mean magnitude of the effect and the variation. Thus, to determine generalizability, it is important to perform the same experiment multiple times (biological replicates performed with technical replicates) in multiple types of cells. Another concern is that scientists will use different cell lines in an opportunistic manner. In published studies, the authors may show a blot of protein X shifting its migration pattern in one cell line, the coimmunoprecipitation of protein X with protein Y in a different cell line, and then an effect of knocking down protein X on some phenotype in a third cell line. Sometimes using different cellular systems is necessary for technical reasons, but we suspect that often it is because each of the effects was best detected in one specific cell line. Inevitably, it is then hard to know if the resulting model that emerges defines a coherent phenomenon within individual cells or is some agglomerate of effects, the entire sum of which never actually occur within any single cell. We argue that reliability and generalizability are higher if the experiments are all done within one cell line where technically feasible. Ideally, the key points can be demonstrated in other cell lines, in relevant primary cells, or in vivo.

## One-sided versus two-sided *t* tests

Ultimately, researchers must turn to statistical tests to understand whether the measurements indicate that there are significant differences between the conditions. The *t* test is frequently used to assess significance. As with any calculation of a *P* value, we are asking, “What is the likelihood that the observed differences in some biological response that we measured is real, or did it happen by chance?” If less than 5% of the time (*P* value ≤ 0.05) the differences could have happened by chance, then we can consider the differences real and place an asterisk (*) on the graph. As an example, imagine that we want to understand whether kinase activity is significantly increased in one cell line versus another. In this case, the hypothesis is that the kinase activity in cell line A is increased compared with that in cell line B, which means that the null hypothesis is that the kinase activity is the same. Because the hypothesis states that the change is unidirectional, cell line A has increased kinase activity compared with cell line B, we can use a one-sided *t* test. But, if we had a different hypothesis involving several transfected cell lines and were comparing each of them to a control and we hypothesized that the transfected cell lines would have different (increased or decreased) kinase activity compared with the control, then we would use a two-sided *t* test, because the difference could be significant in either direction. Like the first experiment, the null hypothesis is that the kinase activities in all of the cell lines are the same as in the control: There is no difference. If the hypothesis in the second experiment was that all of the transfected cells would have increased kinase activity compared with the control, then we would use a one-sided *t* test. This is important to consider carefully, because using a two-sided *t* test when a one-sided test is more appropriate amounts to doubling the *P* value because *t* distributions are symmetric (Fig. 1).

How does a *t* test work, and what is a *t* distribution? The *t* statistic and its distribution of values are useful for comparing the mean values from a normally distributed set of numbers (perhaps measurements or average values from replicate sets of measurements) when the total number of observations is small. The *t* distribution shows the probability distribution for a *t* statistic (see Eq. 1), and both the *t* distribution and the *t* statistic are related to the number of samples measured (Fig. 1). As the *t* statistic increases, the *P* value decreases (Fig. 1). Using the equation for Welch’s *t* test (Eq. 1), which can be used regardless of whether a different number of samples was collected for each condition, it is clear that the statistic, *t*, would get larger, and therefore the *P* value would decrease (becoming more significant) if any of the following happened: (i) the differences between the means (*m*1 and *m*2) became larger, (ii) the standard deviations (*s*1 and *s*2) became smaller, or (iii) more samples were used to calculate the sample means and standard deviations.

How good is our estimate of the mean? That depends on how you designed the experiment. In the kinase activity example, if you have nine samples and used all of the samples to calculate the mean, then you would calculate the standard deviation of the sample mean. However, if instead you had three replicates from day 1, three from day 2, and three from day 3, then you could calculate the sample mean and standard deviation for *n* = 3 samples, giving you three estimates of the mean. Calculating the standard deviation of the three mean estimates produces the standard error of the mean. These values could then be analyzed with the *t* test.

## The magic of 3

How many biological replicates is enough? Most statisticians will tell you that *n* = 30 is a good number from which to get a “feel” for the mean and its distribution. However, this is generally unrealistic in biological experiments, both for practical and financial reasons. So why, then, have most researchers settled on three biological replicates? Where does this magic *n* = 3 number come from? There have to be at least two samples; otherwise, you cannot calculate a standard deviation and therefore cannot use a *t* test (nor should you want to, because with only one sample, you have no idea as to how good your estimate of the mean is). Furthermore, only in the case where *n* = 3 does the value of the standard deviation actually begin to tell you anything other than the arithmetic difference between the measurements. Is *n* = 3 that much better than *n* = 2? To see this, imagine taking kinase activity measurements in two cell lines; the first time, we have two samples for each cell line (*n* = 2), and the second time, three samples for each cell line (*n* = 3). To simplify the situation, assume that the means and standard deviations are the same between experiments with *n* = 2 and *n* = 3 samples (for example, the mean in cell line one is 10 with a standard deviation of 0.5, and the mean in cell line two is 5 with a standard deviation of 0.2). Despite both experiments (*n* = 2 versus *n* = 3) having no changes in the differences of the two means or the standard deviations, the *P* value in the case of *n* = 3 will be smaller than with *n* = 2 for two reasons. First, the *t* statistic increases, because the denominator decreases as the number of samples increases. Second, the degrees of freedom increases for the *t* distribution and, therefore, the tails decrease and the central peak rises, meaning that for the same *t* statistic on this *n* = 3 curve, versus the *n* = 2 curve, there is less area under the tails (Fig. 1).

Is *n* = 4 better than *n* = 3? Absolutely! You get more statistical power (Fig. 1). However, is it worth the expense and time of more samples? That depends on many factors, particularly on the standard deviation among the samples, which depends on the effect size, the noise of the underlying biology, and the specific assay being used. Very large differences will be significant even with low sample numbers. Statistical significance can emerge between different sample populations when many samples are examined (for example, in the case of automatic microscopy measurements), even when the difference between the means is exceedingly small and potentially not biologically relevant. Evaluating statistical significance is clearly important, but ultimately we care more about the significance of the finding as it relates to biology.

## A future of reliable and reproducible science

We suspect that confusion about standard deviation, standard error of the mean, number of samples, and one- versus two-sided *t* tests, which we have only lightly touched upon here, is common. Many cell biologists learned statistics by necessity when dealing with our own primary data and may have a somewhat limited knowledge base. Others formally learned the basics of statistics in a mathematics course using artificial examples like riverboat gamblers with loaded dice rather than using realistic data from real wet-lab biological experiments. Only by first raising the awareness of these issues among ourselves, and then providing better and more practical training in statistical methods to the next generation of biologists can we hope both to enhance scientific progress and to retain the trust of the lay community that supports our endeavors.