Research ResourceNuclear Receptors

Discovering relationships between nuclear receptor signaling pathways, genes, and tissues in Transcriptomine

See allHide authors and affiliations

Sci. Signal.  25 Apr 2017:
Vol. 10, Issue 476, eaah6275
DOI: 10.1126/scisignal.aah6275

Data mining to understand nuclear receptor signaling

Transcriptomic data are potentially useful for generating mechanistic hypotheses beyond the original experiment in which they were generated and for independently validating unrelated studies. However, data are often generated and presented in disparate contexts and formats, making it difficult to draw connections between different researchers’ findings. Becnel et al. provide an updated version of Transcriptomine, a data-mining web tool that focuses on nuclear receptor pathway data sets. This tool has been redesigned to be easily used by bench scientists to access and complement data from the published scientific literature. The resource curates more than 500 data sets to allow users to cross-reference information about how different genetic or pharmacological manipulations affect gene expression in different organs or physiological systems and to visualize pathway-gene-tissue relationships. The approach used by the authors can be expanded to other pathways and types of ‘omics data sets.

Abstract

We previously developed a web tool, Transcriptomine, to explore expression profiling data sets involving small-molecule or genetic manipulations of nuclear receptor signaling pathways. We describe advances in biocuration, query interface design, and data visualization that enhance the discovery of uncharacterized biology in these pathways using this tool. Transcriptomine currently contains about 45 million data points encompassing more than 2000 experiments in a reference library of nearly 550 data sets retrieved from public archives and systematically curated. To make the underlying data points more accessible to bench biologists, we classified experimental small molecules and gene manipulations into signaling pathways and experimental tissues and cell lines into physiological systems and organs. Incorporation of these mappings into Transcriptomine enables the user to readily evaluate tissue-specific regulation of gene expression by nuclear receptor signaling pathways. Data points from animal and cell model experiments and from clinical data sets elucidate the roles of nuclear receptor pathways in gene expression events accompanying various normal and pathological cellular processes. In addition, data sets targeting non-nuclear receptor signaling pathways highlight transcriptional cross-talk between nuclear receptors and other signaling pathways. We demonstrate with specific examples how data points that exist in isolation in individual data sets validate each other when connected and made accessible to the user in a single interface. In summary, Transcriptomine allows bench biologists to routinely develop research hypotheses, validate experimental data, or model relationships between signaling pathways, genes, and tissues.

INTRODUCTION

Signaling pathways involving members of the nuclear receptor (NR) superfamily of transcription factors, their regulatory small molecules (ligands), and transcriptional coregulators coordinate regulation of gene expression across multiple physiological systems and organs (1, 2). Over the past 15 years, researchers in this field have frequently used transcriptome-scale expression profiling approaches to characterize the biology of NR signaling pathways (3). Although reuse of these data sets has considerable potential value in filling gaps in knowledge in areas of research that the original investigators did not envisage, various factors have restricted such reuse. Rates of public deposition of these data sets are low (4), and many data sets are available only in unpredictable and inconsistent file types. Those that are available in public archives are presented in formats that are intimidating to many bench biologists and are frequently annotated without regard for metadata standards (5). To make these data more routinely accessible to scientists, we previously developed a search tool, Transcriptomine, that aggregated and annotated transcriptomic experiments relevant to NR signaling (6). Here, we describe advances in the biocuration, scale of coverage, and usability of our resource and illustrate its value with reference to a series of Use Cases of evidence gathering, hypothesis generation, and model testing.

RESULTS

Transcriptomine allows bench scientists to discover regulatory relationships between signaling pathways, genes, and tissues

The Transcriptomine tool enables data set discovery, reuse, and attribution through two user interfaces (UIs): (i) browsable data set pages, which connect data sets to their associated journal articles and to data set search engines (Fig. 1, Data Set Pages), and (ii) the gene Regulation Report, which allows visualization of biological relationships between genes, signaling pathways, and tissues across the universe of data points (Fig. 1, Regulation Reports). We designed a UI that would enable seamless, bidirectional navigation by the user between the data set pages and Regulation Reports for a gene. The data set pages can be reached from the associated research articles—where publishers support such links—thereby extending the value of the original studies (Fig. 1A). Gene Regulation Reports are accessible through links embedded in gene lists in the data set pages (Fig. 1B) from gene-centric knowledge databases such as Entrez Gene, GeneCards, and the Pharmacogenomics Knowledgebase (PharmGKB) (Fig. 1C) and through queries constructed in the query form itself (https://goo.gl/oscYup) (Fig. 1D). Gene Regulation Report data points link directly to a Fold Change Details (FCD) window that provides the experimental details that led to a fold change value (Fig. 1E) and connects it back to its parental data set (Fig. 1F). To complete the cycle of reuse and attribution, the data set page allows one-click citation of the data set in research manuscripts (Fig. 1G), which, in turn, drives discovery of data sets through the reference lists of articles in which they are cited (Fig. 1H).

Fig. 1 Transcriptomine enables a cycle of data set discovery, reuse, and attribution to illuminate uncharacterized biology of NR signaling pathways.

(A to H) The UI is designed to allow for seamless bidirectional navigation between browsable Data Set Pages and gene Regulation Reports. The Data Set Pages (browsable at the Data Set Directory) link to associated journal articles (A) to extend the value of the original study. The Regulation Report is accessible from the data set pages (B), through links embedded in external gene and small-molecule knowledge databases (C), or directly through a user-configurable query form (D) and enables hypothesis generation through visualization of pathway-gene-tissue relationships. Regulation Report data points link to detailed contextual windows (E) that, in turn, close the loop back to the parental data set (F). Finally, the data set pages enable one-click citation in manuscripts and research proposals (G), which, in turn, drives further discovery of data set citations in article reference lists (H).

Classifying transcriptomic data sets by signaling pathway and tissue or cell line enhances their accessibility to bench researchers

The current version of Transcriptomine introduces two newly developed biocuration steps, namely, classifying experiments by signaling pathway and by biosample categories. Transcriptomic data sets in the NR field are based on experiments involving small molecules (physiological ligands, synthetic organics, and others) or genetic manipulations, such as knockout, knockdown, and mutant knockins, of NRs and coregulators. This diversity in experiment design creates obstacles for the user in evaluating evidence for regulation of a gene or gene set by a signaling pathway or in comparing the transcriptomic end points of different signaling pathways. A simple scenario—determining if a gene of interest is regulated by a given signaling pathway—poses two challenges. First, the user must have previous knowledge of all the small molecules or regulatory proteins affecting the pathway, and second, they must run multiple independent queries for all possible experimental manipulations. To eliminate these obstacles to usability, we mapped small molecules and their cognate NRs to a specific pathway (table S1). Tissue specificity is a well-characterized facet of NR signaling pathways (7). Although they perform experiments in specific cell lines or tissues, cell biologists frequently interpret the results in the context of major organs and their associated physiological systems, for example, the role of adipose signaling pathways in metabolism or the role of uterine signaling in female reproduction. To allow users to interact with Transcriptomine using these familiar terms, we therefore assigned biological samples (tissue or cell type) to specific organs (for example, prostate) and their associated mammalian physiological system (for example, male reproduction) (table S2).

The Transcriptomine data set directory enables browsing of data sets and connects them to their associated research articles

The latest version of Transcriptomine contains a reference library of more than 500 data sets, encompassing more than 2000 experimental contrasts and about 50 million data points, with broad coverage of signaling pathways (Fig. 2A) and physiological systems and organs (Fig. 2B). Data sets are browsed in a data set directory (https://goo.gl/8ekO59) that can be filtered using drop-down menus for any combination of signaling pathway, physiological system, and organ.

Fig. 2 Breakdown of Transcriptomine data sets by signaling pathway and physiological system and organ.

(A) Data sets relevant to ERs and estrogens signaling constitute the largest pathway class in Transcriptomine, followed by, in descending order, studies of signaling by PPARs and fatty acids, GR and glucocorticoids, and AR and androgens. Reflecting cross-talk between NRs and cytoplasmic kinase pathways (141), data sets involving manipulations of cell surface receptors, signaling enzymes, and non-NR transcription factors are an expanding sector of the Transcriptomine data set library. FXR, farnesoid X receptor; TRs, thyroid hormone receptors; VDR, vitamin D3 receptor; CAR, constitutive androstane receptor; PXR, pregnane X receptor; ERR, ER-related receptor; RARs, retinoic acid receptors. (B) Data sets involving female reproductive and metabolic tissue model systems constitute nearly two-thirds of the database. The prominence of female reproductive biosamples reflects the popularity of mammary epithelial cell line models, whereas the large number of metabolic biosamples is due, in part, to our curation of data sets emerging from a TG-GATEs (142), a large-scale toxicotranscriptomic screen in liver and kidney model systems. GI, gastrointestinal; CNS, central nervous system; PNS, peripheral nervous system; UC, umbilical cord.

The data set pages provide for stable digital object identifier (DOI)–based links to external sources, including the associated journal article (Fig. 1A) and citations of data sets in article bibliographies (Fig. 1H) (5). Data set pages (such as https://goo.gl/wEyHGs) contain detailed experimental information and are designed with the bench researcher in mind (fig. S1). In addition to the data set name and description (see Materials and Methods), the Overview section (fig. S1A, top) identifies the repository and accession number of the primary data set. The Overview section also allows one-click citation of the data set in a user’s reference manager of choice and, to enable the user to readily identify the context in which the data set was originally generated, includes the full citation of the associated research article. Below the Overview section is an Experiments section, partitioned into tabs for (i) Data Points (fig. S1A, bottom), displaying the top induced (red) and repressed (blue) data points in a scatterplot with fold change on the horizontal axis and gene symbols [according to Human Genome Organization (HUGO) gene nomenclature] on the vertical axis, and (ii) Conditions (fig. S1B, top), displaying the experiment name and description and specific parameters such as small-molecule dose or concentration and duration of exposure. A pull-down menu displaying experiment names allows the user to toggle between different experiments of interest (fig. S1A, bottom). Clicking on an individual data point of interest displays a pop-up Fold Change Information (FCI) window containing a link to the Regulation Report for that gene (see section Gene Regulation Reports below and Fig. 1B). This feature allows the user to identify other pathways that might affect a particular gene of interest. The utility of this feature for hypothesis generation and evidence gathering is illustrated in Use Case 3 below. A third section, Related Data Sets, displays Transcriptomine data sets related by regulatory molecule or biosample, options for which can be selected from another pull-down menu (fig. S1B, bottom).

The query form allows bench scientists to ask biological questions in Transcriptomine

Rather than looking at a specific data set, users often have more open-ended questions in mind. What signaling pathways regulate my gene or function of interest? What genes are most frequently regulated by my pathway of interest in different tissues? How does regulation of my cellular function of interest by a given signaling pathway differ between different tissues? Users can ask these types of questions using the Transcriptomine query form (https://goo.gl/oscYup) (fig. S2). The pathway and biosample biocuration described above allowed us to simplify the query form to three principal elements: pathway, biosample category, and gene(s) of interest (fig. S2). Drop-down menus representing the pathway (fig. S2A) and biosample (fig. S2B) classes serve the dual purposes of (i) conveying the overall scope of the resource and (ii) eliminating possible user confusion or input error associated with manual text input. Selecting a given NR pathway option (fig. S2A) will retrieve all experiments that map to that pathway, whereas selecting a biosample physiological system displays an additional menu that allows the user to select individual organs (fig. S2B). Options for Gene of Interest (fig. S2C) are as follows: Any, Single Gene (text autocomplete allows for disambiguation of synonyms to a single approved gene symbol), Gene List (upload of up to 5000 approved gene symbols), Gene Ontology Term [querying by a user-specified Gene Ontology (8) term], and Disease Term [querying by a disease curated by the Online Mendelian Inheritance in Man (OMIM) curation initiative (9)]. A fourth module (fig. S2D) enables modification of the default P value cutoff (≤0.05) for fold change results.

Gene Regulation Reports allow for varied perspectives on pathway-gene-tissue relationships

In the following section, Transcriptomine data points are cited using paired references in which the Transcriptomine data set is immediately followed by its associated research article. Transcriptomine query results are displayed in a gene Regulation Report (figs. S3 and S4), a scatterplot similar to that used in the data set pages. The vertical axis enables four views of the data points, of which Pathway is the default for single gene queries. In the Pathway view for the Transcriptomine GREB1 Regulation Report (https://goo.gl/dCHoA6) (fig. S3 and data file S1, Query 1), top-level Pathways are organized into individual rows identified by a symbol representing the corresponding experiment regulatory molecule(s) (see table S1 and Materials and Methods), making biological patterns in the data readily visible. For example, numerous data points confirmed the published regulation of GREB1 by the estrogen receptors (ERs) and estrogens pathway (10), the androgen receptor (AR) and androgens pathway (11), and the glucocorticoid receptor (GR) and glucocorticoids pathway (12). In addition, the search results indicated previously uncharacterized regulation of GREB1 by the peroxisome proliferator–activated receptors (PPARs) and fatty acid–regulated signaling pathway (fig. S3).

Segregation of data points by their corresponding regulatory molecules within a given pathway allows for the comparison of the pharmacology of different small molecules with respect to a given gene in a specific biosample. GREB1, for example, was shown by Transcriptomine to be induced in mammary gland model systems by the physiological ER subfamily agonist 17β-estradiol (17βE2) (13, 14) and the mammary estrogens genistein (15, 16) and PCB54 but was repressed by the mammary anti-estrogens 4-hydroxytamoxifen and raloxifene (17, 18), as well as by the pure anti-estrogen fulvestrant (19, 20). Similarly, the inclusion of data points from knockout studies provides information on the possible receptor dependence of a given pathway-gene regulatory link. Illustrating this, repression (blue) and induction (red) of GREB1 in MCF-7 cells by depletion (21, 22) and overexpression (23, 24), respectively, of ERα/ESR1 indicated that this receptor is required for estrogenic regulation of GREB1 in these cells (fig. S3). Note that according to our previously proposed convention (25), to make our resource accessible to users in both the NR and non-NR fields, individual regulatory molecules (receptors and transcriptional coregulators) are referred to using the symbol in common use in the field and the species-appropriate approved symbol (for example, ERα/ESR1).

Coregulators are a diverse class of molecules that are required by NRs for efficient regulation of gene expression (1, 26). Many coregulators have broad specificity for multiple NRs, so to avoid unduly implying a specific association with a pathway in regulation of a gene, data points involving manipulations of individual coregulators are categorized in a separate Coregulators section (fig. S3). This segregation allows visualization of potential functional roles of coregulators in regulation of a given gene or group of genes by a specific NR pathway. For example, repression of GREB1 in HepG2 cells overexpressing a dominant-negative PGC1α/PPARGC1A (27, 28) indicated that induction of GREB1 by the PPARs and fatty acids signaling pathway involves the transcriptional coregulator PGC1α/PPARGC1A (fig. S3 and data file S1, Query 1).

In addition to the Pathway view, the Regulation Report also allows viewing of data points categorized by biosample, gene, or species. In the Biosample view, individual rows containing all data points from experiments in a specific tissue or cell line are organized into organs and physiological systems (table S2). The Biosample view for the Transcriptomine CA12 Regulation Report (https://goo.gl/PvmNsD) (fig. S4) illustrates how this view helps the user identify tissue-specific regulation of gene expression by a pathway. Consistent with its function in bone (29), CA12 is shown in Transcriptomine to be frequently regulated in bone model systems, principally U2OS cells. In contrast, the abundant data points in mammary gland model systems visualized in the CA12 Regulation Report indicate a function in this organ that is less well characterized. In the Gene view, alphabetical listing by symbol is the default display for multigene queries. To provide a less visually cluttered UI, we used the HUGO human gene symbol for all orthologs of a receptor. Finally, the Species view enables comparison of patterns of regulation of gene(s) by a pathway between different species.

Transcriptomic approaches are frequently used to evaluate the effect of small-molecule or genetic manipulations on gene expression programs involved in animal or cell models, such as the Zucker diabetic rat or adipogenesis in the NIH 3T3-L1 cells (table S3). When collated in a single interface, control experiments from these data sets are a convenient reference for identifying transcripts that are consistently regulated in these models. To enable such discovery, we categorized these data points in a distinct section of the Regulation Report (fig. S3, Animal and Cell Models). Use Case 2 (see below) illustrates the utility of these model data points in modeling NR pathway–biological process relationships. In addition to these model experiments, Transcriptomine contains a small but growing number of small-scale, case-control studies documenting transcriptomic signatures of pathological conditions. To enable the discovery of relationships between NR signaling pathways and disease states, we mapped these data sets to the same parental terms used for biosample (table S2) and model experiment mappings (table S3). In addition, to allow for the identification of cross-talk between NRs and other signaling pathways, a dedicated section of the Regulation Report displays data points from experiments involving small-molecule or genetic manipulation of receptors, enzymes, and transcription factors not related to NR signaling pathways (fig. S3, Other Pathways). Use Case 3 (see below) illustrates how these data points can be used to model potential cross-talk between different signaling pathways.

Drilling down places data points in context and enables their citation

The query form and Regulation Reports are designed to give users the biological “big picture.” To drill down on specific data points, users can first audit results by clicking on a data point in the Regulation Report to display an FCI window (fig. S3) containing the gene symbol, fold change and P value, experiment name, biosample, and species. Having identified a data point of interest, the user clicks on the More Information link in the FCI window to display the FCD window (Fig. 1E and fig. S5). The FCD window contains an Experiment Information section, which details gene manipulations, or small-molecule dose and concentration or duration of treatment. To place the experiment in the context of the larger data set, additional information is provided in the Data Set Information section, which includes the data set name and description fields, which define any gene or small-molecule symbols present in the experiment name and description. To encourage users to provide attribution to the original data set creators, the FCD window provides for one-click download of the data set citation into a user’s reference manager (5). Finally, to close the “logical loop” between the data mining and data set interfaces (Fig. 1F), the FCD window contains a link back to the data set page from which the data point originated.

Use Cases demonstrate linking signaling pathways, gene, and tissues using Transcriptomine

We next describe a series of Use Cases demonstrating the ability of Transcriptomine Regulation Reports to initiate or substantiate research hypotheses that address underappreciated relationships between NR signaling pathways, genes, and tissues. Three of these (Use Cases 1, 2, and 4) involve NR signaling pathways, whereas a fourth (Use Case 3) involves a non-NR pathway to demonstrate the expanding utility of our resource to investigators in other signaling disciplines. Links to the relevant Transcriptomine gene Regulation Reports are embedded in the text, and Use Cases are visually summarized (Fig. 3). The Supplementary Materials contain summaries of all Use Case query parameters and the corresponding gene list downloads (table S3). Transcriptomine data points supporting each Use Case are cited using paired references in which the Transcriptomine data set is immediately followed by its associated research article.

Fig. 3 Use Cases illustrating development of research hypotheses in Transcriptomine.

Use Case 1: ER and estrogen signaling pathway regulates the UPR. KO, knockout. Use Case 2: Regulation of 3T3-L1 adipogenesis by the PPARγ/PPARG and RORα/RORA signaling pathways involves antagonistic regulation of gap junction formation. Use Case 3: Combination Rev-erbAα/NR1D1 agonism and MET/HGF antagonism in chemoresistant gastric cancer. Use Case 4: The spermine oxidase (SMOX) gene is regulated by multiple NR signaling pathways in different physiological contexts. All mechanistic relationships are inferred from data points in Transcriptomine data sets. Use Case parameters and corresponding search results are contained in data file S1. Direct links to relevant gene Regulation Reports are embedded in the respective sections in the main text.

Use Case 1 shows how biology described in an article is indicated by data points in data sets that predated the article but existed in isolation from each other and were therefore difficult to integrate into a coherent biological narrative. The unfolded protein response (UPR) is a highly conserved program of gene expression regulated by multiple signaling pathways that alleviate endoplasmic reticulum stress associated with the accumulation of unfolded proteins (30). A 2015 article in Oncogene described how 17βE2 induced key components of the UPR, including HSPA5, SERP1, and XBP1, in breast cancer cells in an ERα/ESR1-dependent manner (31). The authors postulated that induction by 17βE2 of this pathway might constitute an adaptive stress response that promoted resistance to tamoxifen therapy, and described it as a novel 17βE2-ERα/ESR1-regulated pathway. Transcriptomine contained numerous data points from multiple independent, pre-2015 data sets pointing to transcriptional regulation by the ERs and estrogens pathway of components of the UPR pathway, including HSPA5 (https://goo.gl/5n3hW6) [mouse uterus (3237) and rat vagina (38, 39); data file S1, Query 2], XBP1 (https://goo.gl/MiJdTy) [MCF-7 cells (21, 22, 4045) and mouse testis (46, 47); data file S1, Query 3], and SERP1 (https://goo.gl/XwFbPo) [MCF-7 cells (21, 22) and mouse uterus (3235); data file S1, Query 4] (Fig. 3, Use Case 1). Collectively, these data points form a reasonable rationale for designing experiments to determine the mechanistic basis for regulation of the UPR by the ERs and estrogens signaling pathway. These queries also indicate regulation of UPR genes by the AR and androgens pathway [HSPA5 in PC3 (48, 49) and LNCaP (50, 51) cells and mouse liver (52, 53), XBP1 in mouse testis (46, 47), and SERP1 in LNCaP cells (5457) and mouse testis (46, 47)], by the GR and glucocorticoids pathway [HSPA5 in mouse liver (58, 59) and rat kidney (60, 61) and SERP1 in mouse bone marrow–derived macrophages (62, 63) and human HLE B-3 cells (64, 65)], and by the PPARs and fatty acids pathway [XBP1 during 3T3-L1 adipogenesis (66, 67)].

Use Case 2 focuses on the utility of Transcriptomine Animal and Cell Model experiments in building confidence in a hypothesis previously uncharacterized in the research literature, for the role of two distinct NR signaling pathways in the regulation of 3T3-L1 adipogenesis. Gap junction channels are cellular structures that play important roles in cell-cell communication by enabling the propagation of chemical and electrical signals. Although the literature on the relationship between gap junctions and adipocyte differentiation is conflicting, a consensus appears to be that the presence of gap junctions early during adipogenesis is required for synchronous initiation of the differentiation program (6870). Although published evidence links NR signaling pathways to regulation of fat cell differentiation (71), there have been no published studies linking such pathways to regulation of gap junction formation during this process. Consistent with the increased expression in white adipose tissue (WAT) of the GJA1 gene (72), the Transcriptomine Regulation Report for GJA1 in adipose tissue (https://goo.gl/HXBLA7) (data file S1, Query 5) indicated that it is dynamically regulated during 3T3-L1 adipogenesis and that this regulation is mediated in part by PPARγ/PPARG, a master regulator of adipogenesis (66, 67, 73, 74). Transcriptomine data points reflecting the induction of Gja1 in WAT of mice lacking Smad3 (75, 76), a mediator of transforming growth factor–β inhibition of adipocyte differentiation (77), provide further evidence for a putative role for Gja1 in adipogenesis.

To find further evidence that the PPARs and fatty acids pathway regulates adipogenesis in part by regulating gap junction formation, we next asked Transcriptomine whether other genes involved in gap junction formation were regulated in adipose tissue. Versican (encoded by the Vcan gene) modulates gap junction communication in 3T3-L1 cells (78). The Transcriptomine Regulation Report for VCAN in adipose tissue (https://goo.gl/OBCqy9) (data file S1, Query 6) suggested its regulation early in 3T3-L1 adipogenesis (79, 80) and PPARγ/PPARG-dependent repression in mature 3T3-L1 adipocytes (66, 67, 73, 74). Of possible translational relevance, data points from clinical data sets highlighted differences in VCAN expression in subcutaneous and omental fat from obese Pima Indian subjects relative to lean individuals (81, 82).

An antagonistic relationship exists between gap junctions and members of the extracellular matrix metalloproteinase (MMP) family (8385). If the emerging hypothesis that GJA1 is induced during adipogenesis in a PPARs and fatty acids pathway–dependent manner is well founded, we anticipated that genes in the MMP family would be repressed during adipogenesis. Transcriptomine queries showed that five members of the MMP family [Mmp2 (https://goo.gl/VF7rn4), Mmp3 (https://goo.gl/Y0tzvn), Mmp9 (https://goo.gl/KmD3p6), Mmp14 (https://goo.gl/my63vI), and Mmp23b (https://goo.gl/3AF9dx)] were subject to PPARγ/Pparg-dependent repression during adipogenesis (Fig. 3, Use Case 2, and data file S1, Queries 7 to 11) (66, 67). Transcriptomine queries also indicated PPARγ/Pparg-dependent repression during 3T3-L1 adipogenesis of five members of the ADAMTS (a disintegrin and metalloprotease with thrombospodin motif) family of zinc proteases [Adamts1 (https://goo.gl/zvLyjm), Adamts2 (https://goo.gl/8fvA8r), Adamts4 (https://goo.gl/4I3DII), Adamts5 (https://goo.gl/Dmz9Se), and Adamts10 (https://goo.gl/rlQJMS)] (data file S1, Queries 12 to 16) (66, 67, 79, 80). Although gap junction proteins are not known substrates for ADAMTS family members, the overlapping sensitivity of ADAMTS and MMP family members to certain inhibitors (86) suggests that this may be the case. In support of our Transcriptomine-generated hypothesis, two studies unknown to us that were published while our manuscript was in revision (87, 88) reported an antagonistic relationship between ADAMTS1 and adipogenesis.

A final source of evidence for our hypothesis related to signaling by RORα/Rora, which inhibits adipocyte differentiation (89, 90). Adipogenesis is also inhibited in mice overexpressing hepatic cholesterol sulfotransferase (91), which catalyzes the formation of cholesterol sulfate, a RORα/Rora agonist (92). Given these observations, we anticipated that RORα/Rora signaling would antagonize the expression of genes under the control of the PPARs and fatty acids pathway in adipose tissue. Consistent with this, a total of 10 genes encoding MMP (data file S1, Queries 7 to 11) and ADAMTS (data file S1, Queries 12 to 16) metalloproteinase family members, as well as Vcan (data file S1, Query 6) were shown by Transcriptomine to be repressed in RORα/RORA-depleted mouse WAT (93, 94).

Use Case 3 illustrates the utility of the direct link to a gene Regulation Report from a gene list on a data set page (Fig. 1B) to extend the scope of a data set beyond its associated research article. It also emphasizes the value of Transcriptomine to investigators of signaling pathways other than those involving NRs. Aberrant activation of the pathway mediated by receptor tyrosine kinase MET and its physiological ligand, hepatocyte growth factor (HGF), occurs in gastric cancer (9597). Although MET pathway inhibitors have been evaluated as therapeutic options in gastric cancer (95, 98), such strategies are frequently associated with the acquisition of resistance. A 2009 Science Signaling article probing the mechanistic basis of this resistance (99) included an expression profiling data set evaluating the transcriptomic response of a panel of human gastric cancer cell lines to the MET inhibitor PHA-665752 (PHA665) (100). A gene that is not discussed in the article but is consistently induced by PHA665 across all experiments in the associated data set is n-Myc downstream-regulated gene 1 (NDRG1), which has been implicated in tumorigenesis and metastasis in various cancers (101). The potential role of NDRG1 in resistance of gastric cancer cells to MET pathway inhibition implicated by these data points is suggested by the role of NDRG1 in multidrug resistance in neuroblastoma cells (102). Moreover, its stabilization of the epithelial state in nasopharyngeal cancer cells (103) is consistent with the association between resistance to the MET inhibitor KRC-108 and transition to an epithelial phenotype (104). The Transcriptomine NDRG1 Regulation Report (https://goo.gl/F76hnk) (accessed by clicking on the Transcriptomine query link in the FCI window of one of the NDRG1 data points) highlighted repression of Ndrg1 by the NR Rev-erbAα/Nr1d1 (data file S1, Query 17) (105, 106). These data points are consistent with the transcriptional repression of NDRG1 by iron (107) and the identity of heme as a Rev-erbAα/NR1D1 agonist (108, 109). Moreover, heme oxygenase, which depletes cellular heme abundance, has been associated in gastric cancer cells with resistance to apoptosis (110), itself a mechanism of cancer drug resistance. Epidemiological evidence indicates an inverse relationship between the incidence of gastric cancer and body iron stores (111). Collectively, the Transcriptomine NDRG1 data points and the subsequent focused literature mining suggest the hypothesis that repression by Rev-erbAα/NR1D1 signaling of NDRG1 is a possible basis for combination therapy to circumvent acquisition of resistance to MET inhibitors in gastric cancer (Fig. 3, Use Case 3).

Use Case 4 focuses on spermine oxidase, encoded by the SMOX gene, which catalyzes the oxidation of spermine to spermidine, and is an important component of the polyamine stress response (PSR) (Fig. 3, Use Case 4) (112). The Transcriptomine SMOX Regulation Report (https://goo.gl/tQzDxN) contained data points suggesting its regulation by multiple NR signaling pathways in various biological contexts (Fig. 3, Use Case 4, and data file S1, Query 18). First, the RAR and retinoids signaling section of the SMOX Regulation Report contained nearly 30 data points, indicating its induction during a time course study of all-trans retinoic acid (ATRA) in HL-60 leukocytes (113, 114). This is in contrast to a conventional literature search, which failed to locate any published studies connecting SMOX with the RARs and retinoids signaling pathway. The RARs and retinoids pathway induces terminal differentiation of HL-60 cells (115117), and spermidine is required for this differentiation (118, 119). The Transcriptomine data points therefore constitute a basis for the hypothesis that ATRA drives terminal differentiation in HL-60 cells in part through induction of SMOX to maintain cellular spermidine concentrations. Second, data points in the SMOX Regulation Report reflected its induction by the GR and glucocorticoids signaling pathway and suggested a mechanistic basis for the relationship between glucocorticoids and the PSR (120). Finally, data points in the AR and androgens pathway section of the SMOX Regulation Report indicated repression of SMOX after disruption of AR signaling in skeletal muscle (121, 122). Given the contribution of androgen signaling to skeletal muscle development (123) and the status of SMOX as a marker of myogenic differentiation (124), Transcriptomine helped develop the hypothesis that the promotion of skeletal muscle differentiation by the AR and androgens pathway is mediated in part by induction of SMOX.

ChIP-Seq evidence from the Gene Transcription Regulatory Database supports Transcriptomine Use Cases

Finally, we wished to supplement the transcriptomic evidence used to model the signaling pathway–gene relationships described above using cistromic evidence from the Gene Transcriptional Regulation Database (GTRD) (125), a compendium of uniformly processed chromatin immunoprecipitation sequencing (ChIP-Seq) data sets from the National Center for Biotechnology Information (NCBI) Short Read Archive. Of the 15 receptor-gene or gene family relationships modeled in the Use Cases, and for which comparable data sets were available in GTRD (regulation of GREB1 by PPARγ/PPARG; regulation of genes encoding UPR pathway members by ERα/ESR1, AR, GR/NR3C1, and PPARγ/PPARG; regulation of GJA1 and VCAN by PPARγ/PPARG; regulation of genes encoding MMP and ADAMTS family members by PPARγ/PPARG and RORα/RORA; regulation of NDRG1 by Rev-erbAα/NR1D1; and regulation of SMOX by GR/NR3C1, RARα/RARA, and AR), the GTRD provided evidence of receptor promoter binding sites in 13. Specifically, GTRD queries identified binding sites for PPARγ/PPARG in the GREB1 gene (fig. S6A); for ERα/ESR1 in the human HSPA5 (fig. S6B), XBP1 (fig. S6C), and SERP1 (fig. S6D) genes (Fig. 3, Use Case 1); for AR in the HSPA5 (fig. S6E), XBP1 (fig. S6F), and SERP1 (fig. S6G) genes; for GR/NR3C1 in the HSPA5 (fig. S6H) and SERP1 (fig. S6I) genes; for PPARγ/PPARG in the XBP1 gene (fig. S6J); for PPARγ/PPARG in the mouse Gja1 (fig. S6K) and human VCAN (fig. S6L), MMP3 (fig. S6M), MMP14 (fig. S6N), ADAMTS1 (fig. S6O), ADAMTS2 (fig. S6P), ADAMTS4 (fig. S6Q), and ADAMTS5 (fig. S6R) genes (Fig. 3, Use Case 2); for Rev-erbAα/Nr1d1 in the mouse Ndrg1 gene (Fig. 3, Use Case 3, and fig. S6S); and for GR/NR3C1 (fig. S6T), RARα/RARA (fig. S6U), and AR (fig. S6V) in the human SMOX gene (Fig. 3, Use Case 4). Although no evidence was found in the GTRD for RORα/RORA binding sites in genes in the MMP or ADAMTS families, previous reports of DNA binding–independent regulation of gene expression by RORα/RORA (126) indicate that direct transrepression of PPARγ/PPARG or another pathway by RORα/RORA cannot be excluded as a mechanism in these contexts.

DISCUSSION

Although other transcriptomic databases exist (127, 128), Transcriptomine differs from these by organizing the data points in the original data sets into biologically meaningful pathway and biosample classes. Our Use Cases demonstrate the routine but detailed insights into mechanistically underdeveloped aspects of NR signaling biology that can be achieved through reuse of appropriately biocurated transcriptomic data sets. That transcriptomic data sets have biological value beyond the specific contexts in which they were originally generated is not entirely surprising and is not the primary focus of this article. Rather, it is the ease and intuition with which their underlying data points are made available to the community for reuse, sharing, and citation that represents the true value of Transcriptomine. It should be stressed that Transcriptomine data points do not per se meet the rigorous standards for unambiguously establishing mechanism and require detailed validation in subsequent bench experiments. That said, the reduction in time and effort realized through Transcriptomine in locating, connecting, and interpreting these data points greatly enhances their accessibility and usability by the research community.

Aside from their enhancement of Transcriptomine usability, our biocuration efforts have additional value in shedding light on the relative volumes of data points representing different pathways and physiological systems. Certain experimental paradigms, most notably the study of the ERs and estrogens pathway in MCF-7 cells, are heavily overrepresented among expression array data sets. Such redundancy is not necessarily undesirable because the statistical power of numerous independent data sets helps suppress the technical and biological noise inherent in such experiments. That said, our high-level overview highlights large discrepancies in coverage of particular signaling pathways and physiological systems by ‘omics data sets. Given the major public health impact in Western nations of diseases of the renal system, for example, the underrepresentation of archived data sets involving manipulations of the mineralocorticoid receptor (MR) and mineralocorticoid pathway (Fig. 1A), or performed in kidney model systems (classified in Fig. 1B in the Other category), is surprising. Similarly, publically archived ‘omics data sets relevant to VDR signaling in bone are in remarkably short supply, whereas others targeting receptors such as TR2 and TR4, for example, are virtually nonexistent. We suggest that the disparity in coverage might benefit from the redistribution of funding to generate and archive reference discovery-scale data sets for signaling pathways that, although currently less well characterized, have the potential to provide important insights into the regulation of physiology by the NR superfamily.

Although recognized as a reliable surrogate for cellular protein abundance (129, 130), global measures of transcript relative abundance constitute only one of the ‘omics modalities currently available for interrogating the biology of NR signaling pathways. Data sets documenting the impact of NR pathway manipulations on transcription factor DNA binding events (ChIP-Seq), protein-protein interactions (interactomics), posttranslational modification (PTMomics), and cellular metabolite levels (metabolomics) are being generated in increasing quantities. Equally, NR signaling pathways constitute only a subset of the signal transduction pathways that affect mammalian cells, and investigators studying other pathways have been equally prodigious in their generation of discovery-scale data sets. Adding to the growing volume of ‘omics-scale data sets, the clinical research community is generating and making publically available discovery-scale profiles of molecular events accompanying pathogenesis in cancer and other disease states (131). The biocuration and web design principles outlined in this paper are readily abstractable to integration of these diverse classes of ‘omics data sets and, pending future funding, will support expansion of our resource into a pan-omics discovery tool for cellular signaling pathways and human disease.

MATERIALS AND METHODS

Designation of a data set

Designation of a group of experiments as a data set occurs on a case-by-case basis after closely studying the associated publication. In general, a data set is defined as a group of experiments associated with a single PubMed Identifier (PMID) in a single unique biosample. The term biosample was adopted to align our terminology with those currently in use by NCBI (132) and European Bioinformatics Institute (133) and refers here to the tissue or cell type from which the assay starting material (in the case of transcriptomic data sets, mRNA) was derived. The exceptions to this rule are where the data set is specifically designed to compare the transcriptomes of two or more distinct biosamples. On occasion, however, GEO data set depositions contain groups of experiments from two different biosamples that bear no relation to each other in the context of the associated article. In these instances, experiments from distinct biosamples are treated as distinct data sets, with the same PMID.

Data set naming convention

Data set names contain four key elements: (i) the ‘omics category of the data set: currently predominantly transcriptomic but positioned for expansion into other ‘omics data set categories; (ii) key regulatory molecules (full names spelled out along with familiar and approved symbols, where these exist, and which are used in the data set description and in its constituent experiment names and descriptions); (iii) a brief reference to the organ or organ system—providing either the cell line name or the organ or tissue name for animal studies; and (iv) where appropriate, a reference to specific data set designs (for example, time course or dose dependence).

Data set description convention

Data set descriptions provide the orientation and exposition required for a user to place the experimental results in a clear biological context: This is particularly important in the case of animal and cell models with which the user may not be familiar. In addition to enhancing their discoverability through text-based search engines, the adoption of these semantic standards for data set names and descriptions is designed to ensure a consistent, predictable experience for users of Transcriptomine as they browse from one data set to the next. This is in contrast to the names of the primary archived data sets, which are often simply the name of the associated article, and often give only an impression of the overall experimental design.

Experiment naming convention

Disambiguation of experiment names in the Nuclear Receptor Signaling Atlas (NURSA) transcriptomic data sets is an essential part of the user experience, for example, when experiment names are viewed side by side in the gene list drop-down on individual data set pages or when fold changes are compared in a Transcriptomine scatterplot. Equally, removing redundancy in experiment names makes for a cleaner, less visually cluttered UI. To accommodate the high degree of complexity in experimental design in a consistent human- and machine-readable manner, we developed an experiment naming convention in which up to four defined elements are combined to convey the essential elements of an experimental contrast while making the experiment name unique within a specific data set (table S4). The first element is the regulatory molecule core contrast. Regulatory molecules are small molecules or genes subjected to one or more manipulations, identified using codes (table S5). Genes in the NR signaling field are referred to using familiar symbols and by, with increasing frequency, their approved gene symbols. Accordingly, genes are identified using the name most commonly in use in the field and the species-appropriate approved gene symbol. The second element is the small-molecule treatment concentration/dose and duration. Small molecule–regulated transcriptomes are often compared across multiple concentrations and doses and durations of treatment, each represented by an individual experiment in a larger data set. In these experiment names, the regulatory molecule is separated by a pipe (|) from the concentration/dose and duration information. The third element relates to molecules on both sides of a contrast. Experimental contrasts are often set up with molecules on both sides of the contrast to provide for comparison of a given differential expression data point between two or more biological contexts. Molecules that appear on both sides of the contrast are indicated in parentheses after the core contrast (and the Concentration/Dose and Duration, if present), along with a manipulation code as required. Multiple molecules are separated by a “+” sign. In the case of data sets comparing gene manipulations with the wild-type state, control experiments are designated (WT) to indicate an animal wild-type for the gene of interest. The fourth element relates to Additional Disambiguators, which appear after a dash (-) at the end of the Experiment Name to make an Experiment Name unique within a Data set. Examples of additional disambiguators include an abbreviated biosample, duration, gender, strain, or some other relevant annotations.

Receptor-ligand biocuration

Where available, we adopted mappings curated by the International Union of Pharmacology Guide to Pharmacology (GtoPdb) resource (134) or DrugBank (135). Where mappings were not available from either of these resources, we created these de novo based on biocuration of the literature on a specific ligand or regulatory small molecule (136).

Biosample (tissue and cell line) biocuration

Biosample biocuration adopts a practical, functional approach that is predicated less upon the anatomical origin of a particular organ or tissue and more on its systemic physiological function of its constituent tissues and cell types. Leukocytes, for example, are located primarily in the cardiovascular compartment, but are almost invariably studied in the context of the immune system, and are classified as such. Biosamples occasionally undergo modifications designed to recapitulate a specific functional or physiological context—for example, treatment of cells with adipogenic (66) or inflammatory (137) chemical cocktails. To suppress complexity and clutter in the query and visualization UIs, such modifications are not encoded in the biosample vocabulary but rather are described in detail in the Experiment-level metadata section of Materials and Methods.

Regulatory molecule unique identifiers

Experiments are mapped to regulatory molecules using unique identifiers for small-molecule (PubChem) or gene (Entrez Gene ID) manipulations. Similarly, all experiments are mapped to unique identifiers for major anatomical and cell ontologies, including UBERON (138), CLO (139), and BRENDA (140).

Experiment descriptions

Experiment descriptions are assigned using a systematic approach similar to that adopted for the description of their parent data set, which provides for a consistent and predictable user experience when browsing between different experiments (fig. S1B).

Experiment Numbers

The assignment of Experiment Numbers to Experiments is an important step in the annotation and curation process because it determines how experiments will be grouped relative to each other in the UI, for example, in the pull-down menus that will be used to browse the individual gene lists (fig. S1B). The order of display of experiments is particularly important in data sets documenting the dose/concentration or time dependence of a given small molecule–regulated transcriptome.

NURSA web application

The NURSA hub is a gene-centric Java Enterprise Edition 6, web-based application around which other gene, RNA, protein, drug/ligand, clinical trial, disease, and other data from dozens of external databases are collected. This resource was redesigned to improve the UI and experience, particularly through implementation of responsive design frameworks. Transcriptomine is one component of this resource and links with NURSA Molecule pages, on which gene, RNA, protein, disease, and other data are summarized, and Data Set pages. All software is freely available at www.github.com/BCM-DLDCC/nursa.

After undergoing semiautomated processed and biocuration as described above, the data and annotations are stored in NURSA’s Oracle 11g database. RESTful web services expose Transcriptomine data, which are served to responsively designed views in the UI, were created using a Flat UI Toolkit with a combination of JavaScript, D3.JS, AJAX, HTML5, and CSS3. JavaServer Faces and PrimeFaces are the primary technologies behind the UI. Transcriptomine has been optimized for Firefox 24+, Chrome 30+, Safari 5.1.9+, and Internet Explorer 9+, with validations performed in BrowserStack and load testing in LoadUIWeb. XML describing each data set and experiment is generated and submitted to CrossRef to mint DOIs.

Programmatic access through API

Application programming interface (API) documentation is available on the NURSA website at www.nursa.org/nursa/rs/index.jsf.

Literature searches

All literature searches carried out in developing the Use Cases involved reasonable effort on the part of the curator and employed resources typically used by bench scientists, such as PubMed and Google.

SUPPLEMENTARY MATERIALS

www.sciencesignaling.org/cgi/content/full/10/476/eaah6275/DC1

Fig. S1. The NURSA data set page.

Fig. S2. Transcriptomine query form.

Fig. S3. Transcriptomine GREB1 Regulation Report (Pathway view).

Fig. S4. Transcriptomine CA12 Regulation Report (Biosample view).

Fig. S5. FCD window.

Fig. S6. Corroborating receptor promoter binding evidence from the GTRD.

Table S1. Receptor and small-molecule signaling pathway mappings.

Table S2. Cell line and tissue biosample mappings to physiological systems and organs.

Table S3. Examples of animal and cell models and clinical data sets.

Table S4. Experiment naming convention.

Table S5. Nonstandard abbreviations in NURSA experiment names.

Data file S1. Use Case Query parameters and search results.

REFERENCES AND NOTES

Acknowledgments: We thank current members of the NURSA Beta Testing Group and appreciate valuable discussions with D. Steffen. We thank the past and present NURSA Program staff, R. Margolis and C. Silva [National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK)] and K. Yoshinaga [National Institute of Child Health and Human Development (NICHD)]. We also thank A. Means and S. Mullican for critical reading of the manuscript. Funding: NURSA is supported by grants from the NIDDK (DK097748), with supplemental funding from the Eunice Kennedy Shriver NICHD, and the NIH Big Data to Knowledge (BD2K) program (DK097748-S1) support from the Biomedical Informatics Group of the Biostatistics and Informatics Shared Resource of the Dan L. Duncan Cancer Center, funded by National Cancer Institute grant P30CA125123. Author contributions: L.B.B. led the database and web development team. S.A.O. led the biocuration team. Y.F.D. was the day-to-day project manager. W.H.K. developed the UI. A.M. and A.N. were the software architects. M.D. was the systems administrator. N.J.M. and S.A.O. created the concept of Transcriptomine. L.B.B., S.A.O., Y.F.D., and N.J.M. designed the project. L.B.B. and N.J.M. wrote the manuscript. N.J.M. led the project. Competing interests: The authors declare that they have no competing interests.
View Abstract

Navigate This Article