ST NetWatch: Bioinformatics Resources

AltAnalyze is a multifunctional tool that can be used to predict alternative splicing from splicing-sensitive microarrays, such as exon and splice junction arrays. Users input raw or processed array data to determine alternative splicing or alternative promoter usage within the data set and assess how these variations affect protein sequence, domain composition, and sequences targeted by microRNAs. Although its primary use is for analyzing data from splicing-sensitive microarrays, AltAnalyze can also be used with data from conventional microarrays for normalization, statistical analyses, and determining pathway and Gene Ontology (GO) class overrepresentation. The site includes a instructions and tutorials for using the software, FAQs, and a Wiki. The AltAnalyze software is free and can be run with Windows, Mac, or Linux operating systems.
aMAZE is a system for representing and analyzing molecular interactions and cellular processes. The project's mission is to provide an efficient system that makes computer analysis ready. The signaling pathways (SigTrans) are presently still in the "Old Light Bench" and will be ported to the "Light Bench" in the near future. The metabolic pathways are not completely available through the "Light Bench" interface.
The Biological General Repository for Interaction Datasets (BioGRID) is a collection of published protein-protein and genetic interactions, which are curated from published data from both high-throughput and more-focused studies. The interactions include those documented in many species, such as common model organisms and humans. Entering a protein or gene name into the search form returns a list of binding partners for the protein of interest and a list of genetic interactions for the gene of interest. The method by which each interaction was identified and references for each reported interaction are indicated. The “Publication Summary” for each reference notes the interactions that were reported in that study. The contents of the database can be downloaded, and plugins that facilitate the import of BioGRID data into Cytoscape are available. A help wiki provides support for searching, defines terms used in the database, and explains the curation procedures and how to contribute to BioGRID.
Bioinformatics Resource Manager (BRM)
Managing and processing large data sets can be challenging, even more so if the data sets to be compared are heterogeneous. The Bioinformatics Resource Manager (BRM) is a suite of tools for combining data from multiple types of experiments into a single spreadsheet and converting data between various formats, thus allowing different software tools to share the same information. For example, a BRM user may import microarray and mass spectrometry data sets, use automated data mining to annotate each piece of data, merge the data sets, and then analyze them together. BRM was developed and is distributed and updated by the Pacific Northwest National Laboratory (PNNL) of the Department of Energy (DOE). The BRM software is freely available for use, though users must complete an online licensing agreement in order to obtain a username and password for downloading and installing the client that will allow them to run the software.
Biological Abstraction Layer for Simulation Analysis (BALSA)
BALSA is a project of the Cell Systems Initiative of the University of Washington. BALSA is a symbolic language for representing biological processes, including cell signaling. One of the examples includes the canonical Wnt/beta-catenin Pathway from Science Signaling. BALSA is currently a prototype of a Java-based cell systems modeling environment prototype. A cursory explanation of BALSA and its support software, Bioglyphics, can be found on the presentation poster, which is available for view. Additional information can be obtained from Paul Loriaux.
Biology Workbench
Maybe you ought to think twice before you buy that bioinformatics package. After all, there are some pretty sophisticated tools freely available on the web. At the Biology Workbench, The San Diego Supercomputing Center brings you free access (with registration) to a range of on-line databases and bioinformatics tools. At the Biology Workbench you can import your own sequence or search multiple databases in one operation and download sequences of interest. Then you'll be able to edit and manipulate your sequences and select from over 30 Protein Tools, over 20 Nucleic Tools, and 10 Alignment Tools. There is no need to worry about file format compatibility and controlling multiple interfaces. That's done behind the scenes leaving you with a single, relatively simple point-and-click interface. (Hint: if you're faced with a bewildering array of buttons, click the banner at the top to toggle to a more informative menu display.) The Workbench will even store your data, ready for your next session. A tutorial with material particularly suited to educators is available at the Biology Workbench Investigation Portal (a href="">
BioPAX : Biological Pathways Exchange
The BioPAX site offers extensive documentation and developmental plans for BioPAX, an open-source represention of metabolic pathway information. Available at the site are user guides, a tutorial, a sizable list of presentation PDFs, a document repository, and a basic roadmap for the software. BioPAX's developers encourage feedback by encouraging users to join a disussion list. They also provide full access to the project's past, present and future plans for development.
Cancer Gene Census
The Cancer Gene Census is a catalog of genes with mutations that have been causally linked to cancer and is compiled and maintained by the Sanger Institute's Cancer Genome Project. View the entire working list or subsets of the collection sorted by criteria such as chromosomal map position or type of genetic lesion that has been associated with cancer. Information for each gene includes a link to its Entrez Gene entry, chromosomal map position, and the types of tumors or syndrome associated with mutations in that gene. The catalog is not searchable, but the lists are presented in alphabetical order by gene symbol, and an Excel file containing the complete census may be downloaded for searching and browsing offline. Supplemental tables, which are also downloadable Excel files, include information about Pfam protein domains present in the proteins encoded by genes included in the census. Genes are included in the census if there are at least two independent published reports of mutations associated with cancer samples from patients; genes for which changes in timing or abundance of expression have been associated with cancer are not included. The census is ongoing and updated regularly.
Cancer Genome Anatomy Project (CGAP)
The National Cancer Institute's Cancer Genome Anatomy Project (CGAP) provides a series of interconnected modules that allow access to all CGAP data, bioinformatic analysis tools, and biological resources. The stated goal is "to determine the gene expression profiles of normal, precancer, and cancer cells, leading eventually to improved detection, diagnosis, and treatment for the patient." The pathways module utilizes the pathways from BioCarta and KEGG (Kyoto Encyclopedia of Genes and Genomes). Each protein encoded by a human gene in a BioCarta pathway and each human enzyme in a KEGG pathway is then linked to its CGAP Gene Info page, and each intermediary metabolite in the KEGG pathways is linked to a CGAP Compound Info page. The SAGE Geie module contains various tools for querying information about human and mouse gene expression profiles in normal and cancerous tissues and cell lines. The SAGE anatomic viewer allows users to view the relative expression of a given gene in normal and malignant tissues of the human body. The output is an intuitive visual display and the queries can be performed using keywords and gene names, which are automatically mapped to SAGE or SNP tags. The RNAi section allows users to search for and map onto the target gene interfering RNAs that can be used for silencing specific genes. The output is user friendly and easy to understand.
Catalogue of Somatic Mutations in Cancer (COSMIC)
The Catalog of Somatic Mutations in Cancer (COSMIC), part of the Sanger Institute’s Cancer Genome Project, is a manually curated database of somatic mutations that have been identified in various human cancers. The goal of the database is to provide an overview of the frequencies of mutations in specific genes by collating information from published reports. Readers can search the database by gene name, tissue type, or tumor sample identifier, or browse the collection by tissue type or gene name. Genes that are also included in the Cancer Gene Census, a collection of mutations causally linked to cancer, are noted. The nature of each mutation is described, as well as information about the tissue samples in which the mutations have been documented. Links are provided to information about the affected genes and their genomic neighborhood in external databases, such as Ensembl, Entrez Gene, Swiss-Prot, and OMIM. References describing the identification and study of the mutation and its effect on cell growth and tumor progression are cited. The "Additional Information" page explains how genes were selected for inclusion in the database and how genetic information from multiple sources was combined.
cBioPortal for Cancer Genomics
The cBio Cancer Genomics Portal provides tools for analyzing and visualizing data from large-scale cancer genomics projects. The database includes data sets from multiple sources, notably The Cancer Genome Atlas (TCGA). Genomic data types integrated by cBioPortal include somatic mutations, DNA copy-number alterations (CNAs), mRNA and microRNA expression, DNA methylation, protein abundance, and phosphoprotein abundance. Users may search for information about user-specified genes of interest within a single data set or within all available data sets. Searches of single data sets can be refined by patient and tumor attributes such as tumor subtype, type of data available, and clinical outcome. Search results may be analyzed and visualized by several different tools, each of which offers refinement options to customize how the results are presented. Results of the analysis and queries can be downloaded in various formats. A feature of the portal is the patient view, which summarizes and visualizes all relevant data about a specific tumor, including clinical characteristics, summaries of the extent of mutations and copy-number alterations, as well as details about mutated, amplified, and deleted genes. Programmatic access to the data is available, and information for querying the database through the R and MATLAB data analysis tools is provided. A Protocol in the 2 April 2013 issue of Science Signaling and tutorials on the site explain how to submit queries and describe the various analysis tools and results displays. The cBio Cancer Genomics Portal is developed and maintained by the Computational Biology Center at Memorial Sloan-Kettering Cancer Center and the Information Visualization Research Group at Bilkent University.
CellML is an XML-based markup language being developed by the Bioengineering Institute at the University of Auckland. CellMl is being developed to facilitate the storage and exchange of computer-based mathematical models, including biological models. CellML incorporates existing languages including MathML and RDF to encode mathematical information. The site provides an extensive repository of viewable pathway models, a number of different tutorials and best practice guides, and free downloads of CellML-supporting tools.
ChEMBL is part of the European Molecular Biology Laboratory’s European Bioinformatics Institute (EMBL-EBI) collection of databases and data analysis tools. This database includes structural, chemical, bioactivity, and target information for over a million small bioactive compounds. Use the quick search box on the home page to search the database with the name of a compound or target or with an assay keyword; use the advanced search functions to search the database by target protein sequence, compound identifier, or chemical structure. Users may browse a list of commercially marketed drugs, a list of recently approved drugs, and a list of target proteins sorted by type (for example, receptors, transporters, transcription factors, and adhesion proteins). The data in ChEMBL are abstracted and curated from published reports and include four specialty databases in addition to the main ChEMBL database. Kinase SARfari and GPCR SARfari include structure-activity relationship (SAR) information for compounds that target kinases and G protein–coupled receptors (GPCRs), respectively. Neglected tropical disease (NTD) and malaria databases include data on compounds relevant to endemic tropical diseases. ChEMBL also includes a structure-based search engine, DrugEBIlity, for identifying potentially druggable protein domains from published or user-supplied three-dimensional protein structures. ChEMBL is funded by the Wellcome Trust and hosted by EMBL-EBI.
Database for Annotation, Visualization and Integrated Discovery (DAVID)
DAVID provides online functional annotation tools for the analysis of lists of genes and may be especially useful in interpreting microarray data and high-throughput proteomics data. The site was created and is maintained by investigators at the National Institute of Allergy and Infectious Diseases, NIH. Lists of genes are available on the site to demo the functionality. Some of the annotation uses cryptic labels, however help is available with each of the tools and the FAQ section of the technical center provides useful information for getting started. The tools rely on information from several public databases for clustering and annotating the genes.
Developmental Therapeutics Program (DTP)
For signaling researchers who focus on cancer, the Developmental Therapeutics Program (DTP) from the National Cancer Institute (NCI) and the National Institutes of Health (NIH) is a source of reagents and services for identifying compounds with anti-tumor activity. Researchers may order cells from the tumor catalog, acquire microarrays of sections of formalin-fixed, paraffin-embedded clinical tumor samples, or obtain samples from various NIH-sponsored repositories of chemical and biological substances. "Discovery Services" offered by the DTP also include in vitro and in vivo screening of compounds for anti-tumor activity and toxicity. Researchers may submit their own compounds for screening or choose from those available from the NIH-sponsored repositories. Data from anti-tumor screens are publicly available, as are data from screens for compounds that inhibit yeast growth and for compounds that show anti-HIV activity. A database of structural information includes compounds tested by the DTP or deposited in NIH-sponsored repositories, and there is a database of quantitative and qualitative measurements of molecular targets, such as changes in gene expression or protein activity, collected in DTS screens. The DTP provides links to tools for mining DTP data, including the COMPARE tool, with which users can search the DTP databases for compounds that show an activity profile similar to a compound or molecular target of interest. "Development Services" include information on partner organizations for every step of getting test compounds to clinical trial: designing and testing synthesis strategies, formulating dosages, and evaluating the toxicity, metabolism, and pharmacokinetics of test compounds.
DisEMBL Protein Disorder Prediction/Predictor
DisEMBL is a computational tool for predicting the probability that particular regions within a protein sequence will be disordered (unstructured). Such unstructured regions frequently contain short motifs involved in protein:protein interactions and targeting. Because these regions may also affect protein stability and expression, DisEMBL may be useful in designing protein constructs.
A simple search of DrugBank by drug name (the default setting on the search tool) will tell you what biological processes a particular drug affects. More useful for most signaling researchers, however, is the ability to search the site by gene name or sequence to find drugs implicated in a particular signaling pathway. More sophisticated search tools allow users to search the database by chemical structure, formula or molecular weight. The database includes information for many FDA-approved and experimental drugs, biotech drugs, illicit substances and small molecule drugs. DrugBank is supported by Genome Alberta and Genome Canada in cooperation with GenomeQuest, Inc.
EMBL-EBI: European Bioinformatics Institute
EMBL-EBI provides database searching and DNA and protein analysis tools. There are at least 20 databases ranging from genomic data to protein data (sequences, structures, and domains) to gene expression data. Users may query the database, perform analysis, or upload data through this site.
One-stop shopping for information regarding sequences, proteins, 3D structures, genomes, taxonomy, and literature. Entrez is an ambitious and powerful front-end to such diverse databases as GenBank, EMBL, DDBJ, PIR, SWISS-PROT, PRF, PDB, RefSeq, and PubMed, produced by the National Center for Biotechnology Information.
Eukaryotic Linear Motif resource (ELM)
ELM, the Eukaryotic Linear Motif server, provides a resource for predicting small functional sites in eukaryotic proteins, such as those subject to post-translational modification or involved in protein-protein interactions. Putative functional sites are identified by matching their patterns. Context-based rules and logical filters are applied to reduce the incidence of false positives and thereby improve the predictive power of the server.
Expression2Kinases (X2K)
Determining how differences in experimental conditions can lead to differences in gene expression patterns is an important first step in understanding how different stimuli affect cell biology, metabolism, or fate. Genome-wide expression data are generally gathered by mRNA profiling through microarrays or RNA sequencing and may provide information about how the phenotypic changes induced by different stimuli come about. However, determining the mechanisms that are responsible for genome-wide gene expression patterns can be challenging. Expression2Kinases (X2K) is a Java-based application for identifying the proteins and signaling pathways that are likely to be involved in mediating these observed differences in gene expression. The software integrates information about protein-protein and kinase-substrate interactions with experimental data from chromatin immunoprecipitation (ChIP), position weight matrices (PWMs), or both, to identify the transcription factors that are likely to be involved and the cell signaling components that regulate the activities of these transcription factors. The software was developed by the laboratory of Avi Ma’ayan and is freely available for download. The site includes a user manual and sample case studies in which X2K was used to identify the upstream factors involved in gene expression programs associated with liver fibrosis and activity-dependent changes in a hippocampal neural network.
Gene Expression Dynamics Inspector (GEDI)
GEDI, the Gene Expression Dynamics Inspector, is a microarray data analysis tool that compares the changes in gene expression between samples and is especially useful for analyzing gene expression patterns that change over time. For example, gene expression profiles in samples representing different stages of development or samples from different time points after experimental manipulation can be compared. In contrast to microarray analysis tools that group genes into clusters of similar ontology or cellular function, GEDI creates self-organizing maps (SOMs), which are groups of genes called "metagenes" that share dynamic expression patterns, and the expression profiles of the metagenes are then displayed as color-coded tiles in two-dimensional mosaics. The color of each tile indicates the relative expression of the genes in that cluster compared to those in other clusters, thus these portraits can be visually compared to identify similarities and differences in global gene expression between samples. By animating the assembled mosaics, changes in gene expression over time can be readily visualized. Although this systems approach may be especially useful for identifying genes that are coordinately regulated between samples over time, such as during development or in response to single or multiple pharmacological agents; GEDI can also be used to identify similarities and differences in gene expression between static samples, such as diseased and normal tissues, for example. The software may be freely downloaded for academic use after registration.
Gene Expression Omnibus (GEO)
The Gene Expression Omnibus (GEO), part of the National Center for Biotechnology Information (NCBI), is a repository for gene expression and protein abundance data from microarray experiments and other high-throughput assays. Data from experiments in various bacterial, plant, fungal, and animal species are submitted by researchers and assembled by GEO curators into DataSet records, each of which includes a description of the experiment performed, the data obtained from the experiment, and access to tools for viewing and analyzing the data. Users can search the DataSets by keyword to identify studies that use methods or materials of interest, or search gene expression profiles to identify DataSets that include a gene or protein of interest. The DataSet Browser provides tools for visualizing, searching, and comparing data within the DataSet to identify expression profiles of interest. From the DataSet Browser, users may also access a complete list of expression profiles obtained from a particular experiment. Each of these GEO Profiles contains the experimental data plus links to information about the relevant gene in other NCBI databases such as OMIM, PubMed, and Entrez. Additionally, each GEO Profile includes links through which users can identify genes that share similar expression profiles, map near one another, or are related by homology. Information on how to download DataSets for further analysis and how to contribute data from high-throughput experiments are provided.
A searchable database of human genes adminstered by the Weizman Institute and DoubleTwist, Inc. Searches return a summary of the results with access to a more complete record, which may include the chromosome location, protein function, medical relevance, synonyms, links to homologs in model organisms, and much more.
The Gene Map and Pathway Profiler, or GenMAPP, is a computer application for analyzing gene expression data. The software was developed by the Conklin lab at the University of California, San Francisco in collaboration with Cytoscape. GenMAPP organizes gene expression data by first classifying genes into groups that share biological functions and then placing the genes into known biological pathways, such as signal transduction networks, metabolic pathways, or collections of genes related to one another by subcellular localization, homology, or gene ontology (GO) classification. The software allows users to superimpose expression data from microarray experiments, for example, onto known biological and biochemical pathways that include links to GenBank and PubMed entries for each gene. The software also allows users to classify components and draw pathways based on user-defined attributes. MAPPFinder is a GenMAPP accessory that identifies trends in gene expression data and relates microarray data to GO terms. There is an extensive interactive tutorial and a downloadable Powerpoint presentation that explain the features of the software. GenMAPP is Windows-compatible and freely available, though users must complete a free registration before downloading the software.
GlobPlot is a web service that allows the user to plot the tendency within the query protein for order/globularity and disorder. It successfully identifies inter-domain segments containing linear motifs, and also apparently ordered regions that do not contain any recognised domain. GlobPlot may be useful in domain hunting efforts. The plots indicate that instances of known domains may often contain additional N- or C- terminal segments that appear ordered. Thus GlobPlot may be of use in the design of constructs corresponding to globular proteins, as needed for many biochemical studies, particularly structural biology. GlobPlot has a pipeline interface - GlobPipe - for the advanced user to do whole proteome analysis.
Information Hyperlinked Over Proteins (iHOP)
Search the PubMed literature a new way using this site, which relies on text-mining strategies to find sentences containing a particular protein or gene. The sentences are hyperlinked to interacting proteins or genes and biological function words are also highlighted. Once a search has been performed, the user can then view the abstract containing the sentence of interest, perform various additional searches based on hyperlinked terms in the sentences, or build a network by adding the sentences to a "Gene Model". A "Gene Model" is a custom interaction network based on the sentences selected by the user. By selecting particular sentences to include, all of the interacting proteins or genes mentioned in those sentences will show in this user-defined interaction network. The site is the work of Robert Hoffman.
Jena Library of Biological Macromolecules
Look up various biological molecules deposited in the Protein Data Bank or the Nucleic Acid Data Bank. The site allows searching by various terms or unique identifiers and provides an integrated result for any biological molecule with structural information, including a visual representation fo the molecule's structure and an interactive molecule viewer based on the JMOL plug in.
Kyoto Encyclopedia of Genes and Genomes (KEGG)
Kyoto Encyclopedia of Genes and Genomes (KEGG) is an effort to computerize current knowledge of molecular and cellular biology in terms of the information pathways that consist of interacting molecules or genes and to provide links from the gene catalogs produced by genome sequencing projects. The site does have signal transduction maps that link to the nucleotide and protein sequences for the proteins in the pathways. From there, one can use KEGG's sequence analysis tools to perform a variety of functions such as predicting transmembrane domains or locating consensus motifs. The KEGG project is being undertaken in the Institute for Chemical Research, Kyoto University as part of the Japanese Human Genome Program. (Current version KEGG Release 21.0, January 2002.)
Melanoma Molecular Map Project (MMMP)
The MMMP is a collection of open access databases relevant to the biology and treatment of melanoma. News items culled from the recent literature appear on the home page, as well as a section titled "Melanoma: An Introduction," which gives an overview of melanoma biology, including epidemiology, pathology, prevention, and risk assessment. Schematic illustrations of the signaling and metabolic pathways implicated in the development or progression of melanoma are available in the Biomaps section of the site. The Biomaps section also includes tables of information about small molecule inhibitors of biomolecules relevant to melanoma, including many inhibitors of molecules involved in cell signaling. The Melanoma Molecular Profile page consists of a list of molecules present or altered in melanoma with a link to each molecule's Biocard, a page that contains information such as sequence, alleles, expression, function, demonstrated or predicted role in cancer, and therapies that target the molecule. The complete collection of Biocards includes an extensive list of molecules implicated in melanoma, most of which participate in signaling pathways. Users can browse or search a clinical trial database that contains both primary data and meta-analyses. The database of drugs includes information such as mechanism of action, clinical information, and interactions for both established and experimental drugs used to treat melanoma, though complete information is not available for all drugs listed. A glossary includes all the molecules and drugs cited in the databases plus cancer biology terms. Users can search and view the information without registration but must obtain a free account in order to contribute or comment on information presented on the site.
miRBase is a useful resource about microRNA (miRNA) for researchers seeking to use these molecules in their own labs. The site comprises three databases and is maintained by the Wellcome Trust Sanger Institute. The first is a searchable database of published miRNA sequences, their genomic locations, and associated annotations. Both the hairpin precursor (miR*) and mature forms of the miRNA are included. The miRBase Registry allows researchers to submit a sequence of a new miRNA that is described in an accepted manuscript for naming and indexing in the miRNA sequence database. The miRBase Targets database contains a database of computationally defined miRNA target genes. Sequence and Targets databases are searchable through the web interface or available for download.
NetPhorest is an online tool for matching phosphorylated sequences with kinases and predicting phosphorylation-dependent protein interactions, such as those mediated by SH2 domains, PTB domains, 14-3-3 proteins, the BRCT domain, and WW domain. In addition to submitting a set of sequences for analysis, users may use the "Tree" tabs to create or view phylogenetic trees of the human kinome, proteins with SH2 domains and proteins with PTB domains. The trees are constructed using the iTOL (Interactive Tree of Life) tool. Trees may be modified by the user to include or exclude particular nodes and may be exported as PDF files. A description of NetPhorest, the methods by which the predictions are determined, and information about the data on which the NetPhorest is based are available in the Research Article by Miller et al.
NetPhos is a tool for identifying consensus phosphorylation sites in proteins of interest. This artificial neural network was trained on eukaryotic proteins and can be searched by entering one or more sequences into the search field or by uploading a file containing sequences in FASTA format. NetPhos was developed at the Center for Biological Sequence Analysis at the Technical University of Denmark.
More than 500 human kinases and thousands of phosphorylation sites have now been identified. It’s been difficult, however, to determine exactly which protein phosphorylation sites are substrates for which kinases based on information on consensus motifs alone. The NetworKin site uses an algorithm that combines predictions of kinase families likely to phosphorylate particular motifs with contextual information on substrates and kinases to predict which kinases mediate phosphorylation of specific sites. Users can conduct searches for the predicted partners of substrates and kinases of interest or can explore interactions by means of the comprehensive map of the human phosphorylation network.
Oncomine Research Platform
The Oncomine Research Platform is a group of web-based applications for acquiring and analyzing cancer gene expression profiles. Oncomine was originally developed at the University of Michigan but is now hosted by Compendia Bioscience, which maintains this free version for academic and non-profit users. Oncomine's transcriptome database is curated and includes data from many different human cancers, thus enabling researchers to compare data from different studies. Users can search the database by gene name or expression pattern, compare expression profiles between normal and cancer tissues, and perform meta-analysis of gene expression patterns. The Data Newsletter page contains information about data releases and changes to existing datasets plus a helpful "How to" section.
Online Mendelian Inheritance in Man (OMIM)
Many heritable human diseases, including tuberous sclerosis, Noonan syndrome, and spondylocostal dysostosis, are caused by genetic defects that disrupt signaling cascades. The Online Mendelian Inheritance in Man (OMIM) database is a collection of information about human genes and inherited diseases that is updated daily. Information in the database is divided into gene records and disease records that are searchable by disorder, gene name, map position, or other keyword. Disease pages include clinical and biochemical information about the disease phenotype plus information pertaining to treatment, population genetics, and animal models of the disease, map position of the disease locus, and identification of genetic lesions or alleles associated with the disease. Gene pages include information on human genes that have been implicated in inherited disorders such as chromosomal mapping information, gene structure and function, research using animal models, and documented human phenotypes associated with specific alleles. OMIM also provides links to resources useful for researchers studying inherited disorders; these resources include mapping, mutation, animal model, and phenotype databases. OMIM is curated, updated, and maintained by the McKusick-Nathans Institute of Genetic Medicine at The Johns Hopkins University School of Medicine and is part of the National Center for Biotechnology Information (NCBI).
The PANTHER (Protein Analysis Through Evolutionary Relationships) Classification System is a method for classifying genes based on function – either experimentally determined or predicted based on evolutionary relationships. The PANTHER database includes a library of protein families and subfamilies (PANTHER/LIB) presented as interactive phylogenetic trees focusing on the relationships between molecules within subfamilies. An information page for each subfamily includes molecular function, biological process, links to relevant signaling pathways, and information about genes assigned to that group. Biological processes are described using PANTHER/X, a controlled vocabulary similar to Gene Ontology (GO), which can be converted to GO terminology using a key provided on the site. PANTHER Pathway is a collection of interactive pathways drawn with CellDesigner that can be exported in SBML format and includes a Pathway Resources page with links to online and downloadable tools for viewing and manipulating pathway models. Users can perform batch searches using gene, transcript, or protein identifiers to determine which are included in the PANTHER database or use the Java applet Prowler to browse and search the database by molecular function, biological process, pathway, or species. Users may input their own protein sequences for scoring with the PANTHER system or run the PANTHER system locally on their own computer to predict functions and identify molecular phylogenetic relationships for their proteins of interest. There are also tools available for analyzing large data sets, such as comparing the functional categories represented in different lists of genes, or analyzing expression data from microarray experiments. The site may be used to explore the function of phylogenetically related proteins and also to use known protein relationships and molecular functions to make predictions about the activity, regulation, or potential binding partners of a protein of interest. Users may browse and search the data freely or complete a free registration to create a user account that enables saving and manipulating search results.
This NIH funded database has been developed as a research tool, a resource for students, and an ongoing interactive forum on the use of pharmacological compounds. Registered users can access detailed information about various pharmacological compounds. The database is also browseable and searchable.
PharmGKB: Pharmacogenomics Knowledge Base
Developed at Stanford University as part of the National Institute of Health's Pharmacogenetics Research Network (PGRN), this public database integrates information about pharmacological agents, disease phenotypes, and genes to help researchers better understand the genetic basis of differences in drug responses between individuals. The data collected here can be accessed by gene name, allele ("variants of interest"), drug name, pathway, or disease. Drug data includes details about dosage, metabolism, interactions, toxicity, relevant clinical datasets, cellular pathway targeted, and information about genetic factors that appear to affect the drug's activity. The database contains phenotypic information, such as clinical outcome and results from molecular and functional assays, plus genetic information about genes and alleles associated with disease processes or drug responses. Curated information about relationships between genes, drugs, and disease is clearly labeled to distinguish it from non-curated lists of relevant literature generated by an automated search engine. All of the information is interlinked, so, for example, users can find information on genes involved in a specific disease process by searching for information on a drug used to treat that disease. Interactive signaling and metabolic pathway diagrams show where specific drugs affect the pathway. Users must complete a free registration process in order to view or download clinical data that contains individual identifiers or access some sets of restricted-access patient data.
Plant Metabolic Network (PMN)
The Plant Metabolic Network (PMN) is a hub for information on plant metabolic pathways and the community of researchers that generate and curate that information. The database contents are curated by experts and include information from the literature plus computational models of biochemical pathways, with a focus on agriculturally important species. In addition to the reference database PlantCyc, PMN also includes species-specific databases for Arabidopsis thaliana, cassava, maize, poplar, soybean, and wine grape. Some information from externally hosted species-specific databases (such as those for coffee, tobacco, tomato, potato, and rice) is included in PlantCyc. The pathways in PMN are predicted and displayed using the Pathway Tools software, which users may download to expand their analyses of the database contents and build their own metabolic databases. Users can browse the collections of enzymes, proteins, and compounds, or search the databases by compound, gene, RNA, protein, biochemical reaction, or pathway. The Tools section includes various viewers for displaying and interacting with the contents of the database, and the Tools Overview page includes links to tutorials for using PMN’s search features and viewers. PMN is funded by the National Science Foundation and hosted by the Department of Plant Biology at the Carnegie Institution.
This site is devoted to information about phosphorylation in plants. The search interfaces are easy to use and the results for motif searching are customizable, allowing you to display only those results you deem relevant.
Progenetix is a database of genome copy number aberrations (CNAs) that have been documented in various human cancers. This curated collection includes published gains and losses of genomic information detected by comparative genomic hybridization (CGH) assays on individual cancer samples. Users may search the collection by publication, by morphological or clinical characteristics of the cancer sample, by the array used to analyze the sample, or by CNAs that affect a specific gene or chromosomal region. Progenetix includes separate databases for genomic information on specific types of cancers in the Diffuse Intrinsic Pontine Glioma (DIPG) and Cutaneous Non-Hodgkin Lymphoma (CNHL) genomic repositories. The site also links to ArrayMap, a collection of genomic copy number array data that can be mined for meta-analyses. The laboratory of Michael Baudis at the University of Zurich developed and maintains Progenetix.
Protein Information Resource (PIR)
The Protein Information Resource (PIR) is produced by collaborating groups at the University of Delaware and Georgetown University Medical Center and is part of the UniProt consortium. The purpose of PIR is to assist researchers in integrating proteomic and genomic information and promote standardization of protein annotation. Users may access several databases and tools from the site, including PRO, the Protein Ontology tool, which represents classes of proteins that are evolutionarily related to one another, that interact with one another to form a complex, or that represent different isomers or modified forms of proteins produced from the same coding sequence. iProClass integrates protein sequence, expression, family, function, and structure information by cross-referencing information from many databases, including UniProt, Pfam, GO, PDB, and OMIM. A tutorial shows users how to conduct individual protein or batch searches with iProClass and how to use and interpret the search results. PIR SuperFamily (PIRSF) is a system for classifying proteins based on their evolutionary relationships and can be used to identify phylogenetic relationships, functional convergence and divergence, or relationships between proteins that share structural similarities or have similar domain architecture. The iProLINK (integrated protein literature, information, and knowledge) tool helps users mine the published literature for proteomic information.
Protein Kinase Substrate Prediction
This very easy to use site allows one to ask if your protein is a kinase, what kind, what are the consensus residues that allow the classification, and then allows the user to test for possible substrates based on the predicted phosphorylation consensus motif. You can also use a known kinase to then ask if a test protein sequence can be a substrate for that kinase. The software is Javascript and only runs using Internet Explorer and the text could use a little proofreading, but the interface is extremely easy to use and the output is easy to interpret.
PubChem is a repository for information about the chemical, structural, and functional properties of small bioactive molecules grouped into three interlinked databases. PubChem Substance contains data on the biological activity and chemical compositions of bioactive substances, and PubChem Compound contains the chemical and structural data for these substances. Data from biological activity screens performed with these substances are available in the PubChem BioAssay database. PubChem is part of the National Center for Biotechnology Information (NCBI) family of databases, and users can initiate searches of these three databases using keywords, BioAssay ID number, synonym, or chemical name from the search tool on the NCBI main page. Chemical structures and molecular formulas may also be used to query PubChem Compound through the Structure Search feature, where users can upload a structure or use the Structure Editor to draw a structure that can be used to search the database for compounds with similar structures. PubChem3D is a free, downloadable application that generates 3-dimensional figures of the molecules in PubChem and allows users to manipulate them. PubChem is an open-access database, and information about how to submit new data to PubChem is available at the Deposition Gateway. The contents of the databases are available for downloading, and there is a Power User Gateway for those who wish to submit queries and receive search results in XML format. The PubChem help page provides an overview of the types of data available from PubChem and how to mine it.
Robetta Server
The Robetta server provides automated computational tools for predicting and analyzing protein structures. For structure prediction, sequences submitted to the server are parsed into predicted domains and structural models are generated using comparative modeling if PSIBLAST or fold recognition methods detect a confident match to a protein of known structure, or ab initio structure prediction using the Rosetta fragment insertion method if no match is found. Other capabilities include predicting the effects of mutations on protein-protein interactions (computational alanine scanning).
Search Tool for Interactions of Chemicals (STITCH)
STITCH, the Search Tool for Interactions of Chemicals, is a database of demonstrated and predicted interactions between chemical compounds and proteins that may be used to identify pathways that might be affected by a drug of interest or to identify drugs that could be used to modulate activity of a protein or signaling pathway of interest. This database is built on protein-protein interaction data from STRING combined with information about bioactive compounds culled from experiments, databases, and the literature. Users can search the database by protein name or sequence or by chemical name, ATC (Anatomic Therapeutic Chemical) identifier, or structure to access interaction networks that include protein-protein, protein-chemical, and chemical-chemical interactions. Predicted protein and chemical interaction partners are displayed both as an interactive graphical network and in a table. Protein nodes in a STITCH interaction network are linked to information about that protein, such as sequence, homologs, structure, and Gene Cards, and chemical compound nodes are linked to the PubChem Compound record for that chemical. There are several ways to view a network: the actions view, in which color coding of the interactions between nodes represents the mode of action (inhibitory versus stimulatory); the evidence view, in which color coding indicates the type of evidence on which the predicted interaction is based; and the confidence view, in which color coding indicates the type of interaction (protein-protein, chemical-protein, chemical-chemical) and thickness of the lines indicates the relative confidence of the predicted interactions. Lines representing interactions between nodes are linked to evidence supporting the interactions, including information from the literature and external databases. Users may alter the parameters of the network prediction to change the confidence threshold for inclusion in the network or to include or exclude partners predicted by specific methods. The "Help/Info" section shows the user how to get the most out of the search and parameter-setting functions and how to navigate through the data, plus it includes descriptions of the types of data used to predict the interactions and the methods available for viewing the networks.
Search Tool for the Retrieval of Interacting Genes/Proteins (STRING)
STRING, the Search Tool for the Retrieval of Interacting Genes/Proteins, is maintained by the European Molecular Biology Laboratory (EMBL), the Swiss Institute of Bioinformatics, and the University of Zurich. STRING is a database of direct and indirect protein-protein interactions experimentally demonstrated or predicted from genomics, high-throughput screens, expression studies, or text mining of published literature. Search the database by protein name, accession number, or sequence to access an interactive graphical representation of that protein's interaction network. Protein nodes are linked to information about that protein such as sequence, homologs, structure, and Gene Cards. The STRING interaction network can be visualized according to confidence, in which the thickness of the lines representing interactions indicates the strength of the evidence supporting the interaction, or according to evidence, in which the lines representing interactions are color-coded to indicate the type of evidence used to predict the interaction. Lines representing interactions between nodes are linked to evidence supporting that interaction, including relevant literature and external databases. Below the graphical depiction of the network, the predicted partners of the protein of interest are listed in a table format through which users may access information about similar proteins. Users may alter the parameters of the prediction to change the confidence threshold for inclusion in the network or to include or exclude partners based on the type of evidence that supports the interaction. Although this database includes direct interactions between components within a pathway it also includes indirect interactions between components of different pathways. For example, the BMP4 network includes not only interactions between BMP4 and binding partners such as Chordin and BMP receptors, it also predicts that BMP4 and FGF4 are functional partners. The "Help/Info" section shows the user how to get the most out of the search and parameter-setting functions and how to navigate through the data, and it includes descriptions of the types of data used to predict the interactions and the methods available for viewing the networks.
SenseLab, which is part of the Human Brain Project and produced by Gordon Shepherd’s lab at Yale University, includes several interconnected databases: the Cell Properties Database (CellPropDB), Neuron Database (NeuronDB), Model Database (ModelDB), Olfactory Receptor Database (ORDB), Odor Molecular Database (OdorDB), and Olfactory Bulb Odor Map Database (OdorMapDB). Two additional databases are under development as of July 2008: the Microcircuit Database, which includes computational models of microcircuits and networks, and BrainPharm, which includes information about drug interactions with specific types of neurons. Information in the databases is largely devoted to molecules, cells, and networks involved in olfaction, although there is information on other neuronal cell types in CellPropDB and NeuronDB. CellPropDB includes information about the membrane properties of 32 different types of neurons, such as the kinds of current a cell carries and which receptors, channels, and neurotransmitters it expresses. NeuronDB provides information about the properties of particular membrane regions (e.g. axons and dendrites) for this same collection of neurons. Of particular interest to researchers interested in olfaction are the Olfactory Receptor Database (ORDB), which provides a comprehensive list of vertebrate olfactory receptor gene sequences, and the Odor Molecules Database (OdorDB), which contains information about odorant molecules. Functional magnetic resonance imaging (fMRI) and radiolabeling maps of neural activity in the rodent olfactory bulb are archived in the Odor Map Database (OdorMapDB), where they are classified by odorant and type of experiment. Both of these databases are browsable as well as searchable by name of odorant or type of neuron. SenseLab also includes Model Database (ModelDB), a collection of published computational neuroscience models that users can download and execute with the free, open source simulator Neuron.
SignaLink is a database of directional signaling interactions in several major developmental signaling pathways: transforming growth factor-β (TGF-β), Janus kinase-signal transducer and activator of transcription (JAK-STAT), receptor tyrosine kinase (RTK), nuclear hormone receptor, Wnt, Hedgehog, and Notch pathways. These pathways were manually curated from the published literature for the nematode Caenorhabditis elegans, the fruit fly Drosophila melanogaster, and Homo sapiens. The database highlights mechanisms of crosstalk between pathways and features a layered structure, with core pathway components at the center, followed by additional layers detailing pathway regulators, such as scaffolding and trafficking proteins, posttranslational modifiers of pathway components, and transcription factors and microRNAs that modulate the expression and activity of pathway components. The pathway viewer enables the user to display or hide each layer of interaction to focus on the signaling events of interest. The contents of the SignaLink database can be downloaded in various formats, including SQL, CSV, BioPAX, SBML, and Cytoscape.
SigPath is an information management system designed to support quantitative studies on the signaling pathways and networks of the cell. The system is web-based and allows users to submit and edit information directly about reactions, molecules and their measured concentrations, pathways and quantitative models. Reactions can be assembled into models and exported to a variety of modeling tools, including SBML-compliant tools. SigPath is an open-source bioinformatics project. Details about how to interact with SigPath are described in an STKE Protocol.
Stanford Microarray Database
The Stanford Microarray Database is a collection of freely available raw and normalized microarray data. The site provides access to tools for analysis and exploration of the data. As of January 2012, the database contains information from thousands of experiments in 24 organisms. For those technical folks, the source code is available.
Systems Biology Experiment Analysis Management System (SBEAMS)
The Institute for Systems Biology developed the Systems Biology Experiment Analysis Management System (SBEAMS) to help researchers integrate information from different types of experiments. SBEAMS is a relational database management system for building databases of microarray and mass spectrometry data, integrating these different types of data, and analyzing them together. SBEAMS has a Web-based interface for accessing the tools to store, manage, access, analyze, and annotate user-created databases. The Test Drive page provides instructions for logging in with a demo account to explore SBEAMS, tutorials for both the microarray and proteomics modules, and access to a user discussion forum.
Systems Biology Markup Language (SBML)
SBML is an XML-based language for representing models of biochemical reaction networks, including signaling pathways, in computer-readable format. STKE has chosen an SBML-compatible file format for machine-readable versions of the Connections Maps database. The website provides assistive documents, schemas, and presentations, as well as a comprehensive FAQ list and two SBML discussion forums. Other highlights are an extensive model repository and a community-editable SBML Wiki area, where users can describe their own experiences with the software and make suggestions for future development. SBML tools include those for validation, visualization, and conversion of SBML files.
The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA), sponsored by the National Cancer Institute (NCI) and the National Human Genome Research Institute (NHGRI), generates and archives genomic data from human cancers. The program is still in its pilot phase, with datasets for only glioblastoma and ovarian carcinoma available as of February 2009. Tumor samples are contributed by several participating clinical institutions, and new data from the specimens are released regularly. Data sets for each type of cancer include clinical information, such as how the tumor was diagnosed and its histology, as well as genomic data such as methylation patterns, transcriptome information, and sequences from oncogenes and other regions of interest identified by karyotyping or microarray analysis. Cell signaling proteins are common in the list of 600 oncogenes sequenced in each sample. The data are available through two tiers: An open-access data tier, which is publicly accessible and is limited so that it cannot be compiled to form a complete data set from any one individual, and controlled-access data tier, which includes additional clinical, demographic, and molecular information that could be used to generate complete, unique data sets from anonymous individuals. Researchers may gain access to the controlled data by completing a certification process. Users can browse and download TCGA data sets through the Data Portal and access tools for analyzing these data sets through the cancer Biomedical Informatics Grid (caBIG),where the Gene View may be used to view only data from genes involved in specific signaling or metabolic pathways.
The Gene Ontology
The Gene Ontology (GO) is a bioinformatics project that develops standardized terminology for describing the activities of gene products to facilitate integration of data from multiple species and different databases. GO annotation, for example, may help someone who is interested in cyclin-dependent kinases in one organism to identify gene products with similar functions in other species. GO has developed three ontologies (controlled vocabularies for describing the biochemical function of gene products and their subcellular site of action). The cellular component ontology provides terminology to describe the intracellular location at which a gene product acts. The biological process ontology includes terminology for describing cell physiological, regulatory, developmental, or metabolic processes in which gene products act. The molecular function ontology is used to describe the biochemical functions of gene products. Each ontology is hierarchical; for example, a protein that functions in the G1 phase of the cell cycle would be associated with the biological process ontology terms "cell cycle," "interphase," and "G1 phase." The ontologies do not include genes or gene products, or information about protein domains, protein-protein interactions, specific signaling pathways, or cell and tissue types. The ontologies are under continuous development by the GO Consortium, a group of laboratories, institutes, and databases, including FlyBase,the Zebrafish Information Network, the Mouse Genome Informatics database, and many others. Several data sets are available for download, including files describing the ontologies themselves, the gene product-GO term associations submitted by individual GO Consortium members, the entire database of gene products annotated with GO terminology (updated weekly), and resources for teaching and learning about GO. The site includes tools developed both within and without the GO Consortium for searching, browsing, and editing the GO ontologies, and for using the ontologies to annotate data or analyze microarray data sets.
The Signaling Gateway
The Signaling Gateway is database hosted by the University of California San Diego (UCSD). The site includes the Molecule Pages, which are created using bioinformatics tools and biochemical data and represent the protein components involved in signal transduction. Presently, bioinformatics tools are being used to build an impressive set of related information from multiple databases about a particular protein as defined by its sequence. In some cases, a scientist has "adopted" the protein and provided a "Mini-molecule" page to accompany the derived data.
Transcription Regulatory Region Database (TRRD)
A database of transcriptional regulatory factors and the genes that they regulate. The site includes a "Viewer" that requires a Java plug-in, which allows the transcriptional regulatory region (protein binding sites, promoter, silencer, and enhancer regions) of a gene to be viewed as a map. Some of the "Functionally Important Gene Systems" are becoming dated, but still a useful collection of information.
A database for proteins involved in transcriptional regulation. The newest release of the database is version 5.0. The database and tools are free to non-profit users. The database provides detailed information about transcriptional regulators organized in a variety of ways, that includes structural and sequence information and literature references linked to Medline. Access to Transpath, a limited database of signal transduction pathways, and Cytomer, a database of human and mouse organs and systems, are also provided.
TRANSPATH Signal Transduction Browser
TRANSPATH is an information system on gene-regulatory pathways, and an extension module to the TRANSFAC database. It focuses on pathways involved in the regulation of transcription factors in different species, mainly human, mouse and rat. Elements of the relevant signal transduction pathways are stored together with information about their interaction and references. All information is validated with references to the original publications. Also, references to other databases are provided (TRANSFAC, Swissprot, EMBL, PubMed and others). Selected pathways are displayed as clickable graphic maps.
UCSC Genome Bioinformatics
The UCSC Genome Bioinformatics site contains a number of tools. For example with the Gene Sorter tool all of the members of a gene family in a given species (human, mouse, rat, worm, fly, and yeast) can be found and sorted by homology, expression, chromosome position, or gene ontology (GO) annotation. The genes are linked to pages that provide details such as expression data, alleles, phenotypes, and links to the gene's entry in its model organism database. The site also includes tools for browsing genomes, including the In Silico PCR resource for obtaining predicted product sizes in many species (primates, rodents, sea urchins, flies, and more) based on user-defined primer sequences.
Website for Alternative Splicing Prediction (WASP)
The Website for Alternative Splicing Prediction (WASP) is a tool for predicting whether, when, or where particular exons are alternatively spliced. WASP was developed by the Frey and Blencowe laboratories at the University of Toronto and consists of two parts -- a database of exons for which alternative splicing data is available, and a collection of alternative splicing predictions. Tissue-specific splicing profiles of a collection of over 3,500 mouse exons were used to identify sequence and structural motifs that correlated with whether, how often, and in what tissues specific exons were alternatively spliced. This curated collection of exons was used to build an algorithm for making predictions about the frequency and tissue-specificity of alternative splicing. The collection of de novo predictions includes predicted splicing information for over 11,000 mouse exons for which cDNA sequence data suggest alternative splicing. The predicted and experimentally determined alternative splicing data are maintained in separate databases that can be searched by gene name, exon sequence, or chromosomal coordinates. A query for which there is a high-quality splicing prediction in the database returns a graphical feature map that includes information on potential regulatory elements in or near the alternatively spliced exon, and these features can be explored through the UCSC Genome Browser. If an exon of interest is not in the predictions database, researchers may submit its sequence for analysis by someone on the team that developed the splicing code algorithm.