Other biological databases
Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and networks Biological systems Protein families and domains Whole genome data Sequence data Ontologies -GO
Other Biological Databases Transcription factor binding sites - TRANSFAC Protein structure databases- PDB, SCOP, CATH Protein family databases- Pfam, Prints, PROSITE etc. Chemicals and small molecules - ChEBI Gene expression databases – GEO, ArrayExpress Metabolic pathways - Reactome, KEGG Genome Databases- Ensembl, FlyBase, WormBase etc. Human genetics-related databases –HapMap, dbSNP
Transcription factor binding sites TRANSFAC –database of eukaryotic transcription factors: regulation.com/pub/databases.html#transfac TESS –Transcription Element Search System –for predicting transcription factor binding sites, uses TRANSFAC: TFsearch –for searching transcription factor binding sites:
Protein structure databases Main resource is Protein Data Bank (PDB): Contains the spatial coordinates of macromolecule atoms whose 3D structure has been obtained by X-ray or NMR studies Proteins represent more than 90% of available structures (others are DNA, RNA, sugars, viruses, protein/DNA complexes…) Can search by PDB code
Searching MSD -Search by PDB code
Protein structure-related databases Structural family databases based on PDB – SCOP ( and CATH ( Predicted structures in SWISS-MODEL ( MODEL.html)
Protein family databases Databases that produce signatures for identifying protein families or domains Used for functional classification of proteins E.g. Pfam, PROSITE, Prints, SMART, TIGRFAMs etc. Integrated into single resource InterPro (
InterProScan sequence search Stand-alone version available
InterPro text search Search keyword, protein acc or InterPro acc
Results for protein acc
Example InterPro entry
Chemicals and small molecules Chemical abstracts- ChEBI- KEGG –part of it includes chemicals ChemID plus -chemicals cited in NLM databases dlite.jsp MSD-Chem –ligands and chemicals in MSD
CheBI example entry
Hierarchy for chemicals
Gene expression databases NCBI Gene Expression Omnibus (GEO) ArrayExpress Stanford microarray database www5.stanford.edu/ Can usually search for experiments or particular expression profiles
GEO search page
Profiles search results
Specific entry and experiment info
ArrayExpress search results
What does the data look like? Info on experiment, array used, etc. Raw or processed tab delimited file containing spots and their intensities cy3/cy5 ratios) across different samples Files with meta data e.g. sample info, annotation and coordinates of each spot on array
Proteomics: SWISS-2DPAGE
Enzymes and metabolic pathways Contain information describing enzymes, biochemical reactions and metabolic pathways; ENZYME and BRENDA: nomenclature databases that store information on enzyme names and reactions; IntEnz: Integrated relational Enzyme database
Enzyme nomenclature E.C. (Enzyme Commission) numbers assigned based on reactions they catalyze Hierarchy, high level groups: –EC 1 –Oxidoreductases –EC 2 –Transferases –EC 3 –Hydrolases –EC 4 –Lyases –EC 5 –Isomerases –EC 6 –Ligases
EC example
Metabolic Pathway databases PATHGUIDE >200 pathways KEGG (Kyoto encyclopedia of genes and genomes): -includes: –Database of chemicals, genes and networks (metabolic, regulatory etc.) –Well-curated and quite specific EcoCyc (Encyclopedia of E. coli K12 genes and metabolism): –curation of entries genome Reactome –curated biological pathways: GenMAPP –pathways contributed by users
Different pathway in different species: -> comparison
Pathway in Reactome
Example of a pathway in BioCyc
Protein-protein interaction databases Protein-protein interaction databases store pairwise interactions or complexes Can get 1 to more than 20,000 interactions per publication IntAct DIP (Database of Interacting Proteins) mbi.ucla.edu/ BIND (Biomolecular Interaction Network Database)
Protein-protein interactions in IntAct
Integrated functional interactions in STRING
Genome browsers Integrate sequence & functional data for a genome Ensembl –genome browser for major eukaryotic genomes, e.g. human, mouse etc. UCSC browser - FlyBase –Drosophila genome database: WormBase –C. elegans: PlasmoDB –Plasmodium (malaria): Etc.
Ensembl genome browser
Ensembl gene view 1
Ensembl gene view 2
Gene within context on chromosome
Human genetics databases GeneCards ( HapMap ( OMIM HGDP Human Genome Diversity Project (
Most of the databases are disease or gene centric i.e. p53 Mutation/polymorphism databases
dbSNP Repository of all known mutation (human and other organisms)
Where to find the databases Table of addresses for major databases and tools Nucleic Acids Research Database issue January each year Nucleic Acids Research Software issue –new Expasy list of tools:
Large scale data retrieval Programmatic access to many databases MySQL access to some BioMart access –public and private FTP sites –large data downloads
Other tutorials ex.html