Jump to: Page Content, Section Navigation, Site Navigation, Site Search, Account Information, or Site Tools.
Guest Alerts | Access Rights | My Account | Sign In
|
|
Copyright © 2006 by the American Association for the Advancement of Science
The Transcriptome of the Sea Urchin EmbryoManoj P. Samanta,1 Waraporn Tongprasit,2,3 Sorin Istrail,4,5 R. Andrew Cameron,5 Qiang Tu,5 Eric H. Davidson,5 Viktor Stolc2* Abstract: The sea urchin Strongylocentrotus purpuratus is a model organism for study of the genomic control circuitry underlying embryonic development. We examined the complete repertoire of genes expressed in the S. purpuratus embryo, up to late gastrula stage, by means of high-resolution custom tiling arrays covering the whole genome. We detected complete spliced structures even for genes known to be expressed at low levels in only a few cells. At least 11,000 to 12,000 genes are used in embryogenesis. These include most of the genes encoding transcription factors and signaling proteins, as well as some classes of general cytoskeletal and metabolic proteins, but only a minor fraction of genes encoding immune functions and sensory receptors. Thousands of small asymmetric transcripts of unknown function were also detected in intergenic regions throughout the genome. The tiling array data were used to correct and authenticate several thousand gene models during the genome annotation process.
1 Systemix Institute, Los Altos, CA 94024, USA. * To whom correspondence should be addressed. E-mail: vstolc{at}arc.nasa.gov Embryogenesis in the sea urchin occurs rapidly and is relatively simple in form (1). By 2 days after fertilization, when the embryo is in the late gastrula stage, there are about 800 cells and 10 to 15 cell types. Thus, genes expressed in individual cell types or territories represent a larger fraction of the total number of transcripts than do genes expressed in adult organs of vertebrates or in more complex embryos such as that of Drosophila. Earlier studies have provided extensive quantitative evidence on transcript prevalence for sea urchin embryos, both for populations of mRNA (and nuclear RNA) and for many individual transcripts, measured by quantitative polymerase chain reaction (QPCR) (24). The genome sequence of Strongylocentrotus purpuratus (5) enabled these advantages to be exploited for a whole-genome tiling array analysis of the embryonic transcriptome.Transcriptome analysis by whole-genome tiling array (69) has three advantages relative to standard microarray analysis with oligonucleotide probes constructed on the basis of known or predicted protein-coding genes: (i) The genes identified are not limited a priori by the gene predictions used to design the probes and therefore are not biased in favor of more prevalent or more conserved sequences; (ii) the transcripts detected will include noncoding as well as protein-coding RNAs; and (iii) intronexon boundaries plus untranslated regions (UTRs) are revealed. In comparison with expressed sequence tag (EST) or cDNA-based approaches, whole-genome tiling arrays offer an unbiased and complete view of the transcriptional activity of the genome in the developmental state examined and in addition display the intron and exon structures of expressed genes. In itself, tiling array data cannot assign a distant exon to its gene, but this shortcoming can be overcome by integrating tiling and EST/cDNA data for genome annotation. Tiling array experiments have traditionally been performed only several years after genome sequencing (9). However, maskless array synthesizer technology permitted us to develop custom arrays from preliminarily assembled draft sequence. This initiative enhanced the genome project while it was still in process, by substantially reducing the gap between sequencing and comprehensive annotation of the genome. To sample transcriptional activity throughout early sea urchin development on a single set of high-density microarrays, we prepared polyadenylated RNA from egg, early blastula (15 hours), early gastrula (30 hours), and late gastrula stage (45 hours) embryos. Samples were mixed in equal quantities, reverse transcribed, fluorescently labeled, and hybridized. The tiling array probes were designed from the initial draft assembled sequence, which at that time was based on 6x whole-genome shotgun sequence coverage (5). A total of 10,133,868 50-nucleotide (nt) probes were selected to uniformly represent the entire sea urchin genome, maintaining an average spacing of 10 nt between consecutive probes (table S1). Repetitive sequences and simple sequence tracts were excluded. The probes were synthesized on 27 glass-based microarrays. To avoid any potential bias due to cutoff selection based on unexpressed genomic probes, we also added a set of 1000 random sequences not represented anywhere in the genome to each array. The cutoff was such that only 1% of those random probes were falsely expressed. Additionally, each array included a small (2000) identical set of genomic control probes used for normalization purposes. After hybridization, data from all arrays were normalized according to the control probes, mapped back to the latest genome sequence assembly, and mounted on a genome browser together with the optimal set of computationally derived gene models [OGS set in (5); for visual presentation of all transcriptome results as in Fig. 1A, see www.systemix.org/sea-urchin]. Details of the methods used are available in the Supporting Online Material (10), and the microarray designs and experimental data have been deposited in the National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo) under the accession code GSE6031 [NCBI GEO] .
Analysis of signals for 28 well-characterized genes (11) (table S2) showed that the array measurements were highly sensitive. When mapped against the known structure of these genes, it was apparent that transcribed regions were clearly distinguished from silent regions, and no intronic transcripts were detected. Intronexon boundaries of expressed genes were thus clearly distinguishable (e.g., Fig. 1A, fig. S1). To establish a conservative statistical criterion of expression, we first established the background variance and chose a cutoff value about 2.5 times that of the mean background. At this value, about 1% of random control probes displayed apparently artifactual noise, e.g., single-point peaks over background surrounded by probes at the background level (as in the single-probe intron peak of fig. S1A). We determined whether a gene is actually expressed in the 0- to 45-hour embryo by assessing the significance of transcriptional activity in the set of probes that lie within the predicted exons of that gene (10). For each gene in the OGS set (5), a Poisson calculation was performed, based on the number of probes in the array overlying the exons of the gene that score as active, to estimate the probability that the observed profile was artifactual. Above about 12,000 to 13,000 active gene models, the probability of false-positives rose rapidly (Fig. 1B, table S3). Some genuinely active genes are no doubt excluded by this cutofffor example, genes that consist entirely of very small exons, or genes that are represented by very few probes (<3) because of sequence features that precluded choice of those sequence elements for representation in the probe set (10), or genes not represented in the genome assembly.
To estimate the number of genes expressed in the embryo up to the late gastrula stage, several corrections were required. Of the approximately 12,000 to 13,000 OGS gene models unequivocally scored as expressed (Fig. 1B), 1400 were duplicates, an artifact of high genomic polymorphism in the initial assembly process (5). A further 250 active gene models were excluded, because they are single-exon reverse transcriptase genes (mobile elements). On the other hand, this measurement detected a number of active open reading frames not represented in the gene model set used in this study (5). Where these were near one another, they were clustered, and the probability of accidental occurrence of these open reading frames in an 800-Mb genome was calculated. In total, We may compare the end result, about 11,000 to 12,000 genes expressed, to the conclusion derived a quarter of a century ago from saturation single-copy sequence hybridization of embryo polysomal mRNA (2). This conclusion was that the same embryo uses about 8500 different genes (counting all the members of any given repetitive class of genes as 1) at the gastrula stage and, if other stages are added in (as they are here), about 10% more. Given that genes with high sequence similarity in large (more than 100-member) gene families would have been excluded from the earlier hybridization results, the two values are reasonably consistent. In any case, these measurements demonstrate that even by conservative estimates, a very large number of protein-coding informational units are required for the construction of this embryo, simple as it is, amounting to at least half of the total number of genes predicted in the S. purpuratus genome (12,000 out of 23,500) (5). In S. purpuratus, the embryo gives rise to a larva after 3 days of development, within which the adult form develops during the successive weeks of larval feeding. By the late gastrula stage, only some small patches of undifferentiated cells set aside from the processes of embryonic specification for adult body formation (12), and the midgut, will contribute to the adult body plan in the postembryonic period. The descendants of most of the 48-hour embryo cells will be jettisoned at metamorphosis. In contrast, in other embryos for which we have array-based transcriptome measurements, such as Drosophila (9) and Caenorhabditis elegans (13), the development of adult body parts begins immediately upon gastrulation, and there is no point after the very earliest stage at which embryonic gene use per se can be separated from gene use to construct the adult body plan.
More than 9200 OGS gene models were functionally annotated in the course of the genome project (5). In Fig. 2 we report the fractions of these genes expressed during embryogenesis, according to their functional classes (table S3). Most notable is the high embryonic usage of transcription factor and signaling genes. In other work (14), Howard-Ashby et al. showed by QPCR measurements that nearly 80% of all genes encoding transcription factors other than putative Zn-finger transcription factors are expressed by 48 hours [in Fig. 2, Zn finger proteins in the "Trans." category are probably not all transcription factors (15)]. Thus, it requires most of the "regulome" just to construct the single-cell-thick gastrula embryo. These same genes must, in general, be used repeatedly in the construction of the far more complex adult body plan. Genes related to basic cellular processes (e.g., intermediary metabolism) and cytoskeletal structure (e.g., actins and myosins) were also highly expressed; these would be expected to be required in cells of both embryo and adult tissues. This is true as well of detoxification and other xenobiotic defense moleculesthe price of existence in the marine environmentand of biomineralization and neuronal molecules partially shared by the respective embryonic and adult differentiated cell types. By contrast, the immune genes (5, 16) are largely expressed in the coelomocytes, which are the adult immune effector cell types. There is an elemental embryonic and larval immune defense system as well, mediated by certain embryonic mesenchymal cells, and this may account for the
Qualitatively, the transcription profiles enabled thousands of the >9200 gene models annotated by the Consortium (5) to be directly checked or corrected. Table S5 presents the results of our comparison between each predicted gene model and the transcribed regions derived from the tiling data. The gene models were mainly accurate, but missing exons were often identified by reference to these profiles. On average, the OGS genes expressed in the sea urchin embryo were 15.8 kb long and contained 9 exons, whereas the OGS genes on average were 11.9 kb long with 6.6 exons. Lack of tiling probes on short OGS genes with few exons may have contributed to the difference. The transcriptome data also indicated the dimensions of the 3'-UTR sequences (table S3), as well as the approximate transcription start sites. Many of the subgroups of sea urchin annotators used the high-resolution array data to manually curate their genes of interest (5). It was thus particularly useful for the subsequent analysis that the transcriptome measurements were carried out at a relatively early stage of the genome sequencing project as a whole, as soon as the initial assembly permitted.
Finally, as in all other whole-genome array hybridizations, many enigmatic transcripts were observed that are not included in proteincoding genes (table S6). A major class of these is composed of short ( The number of active genes sets in concrete terms the dimensions of the regulatory task of the genomic control apparatus driving embryogenesis. It cannot be said that the transcriptome is functionally understood until the individual roles and interactions of each component are revealed. Assessment of transcriptional activity across the whole genome represents the essential beginning of that process.
References and Notes Back to Top
Supporting Online Material www.sciencemag.org/cgi/content/full/314/5801/960/DC1 Materials and Methods Fig. S1 Tables S1 to S7 References
Received for publication 29 June 2006. Accepted for publication 23 October 2006.
The editors suggest the following Related Resources on Science sites:In Science Magazine
THIS ARTICLE HAS BEEN CITED BY OTHER ARTICLES:
|
Science Signaling. ISSN 1937-9145 (pre-2008: Science's STKE. ISSN 1525-8882)