Presentation is loading. Please wait.

Presentation is loading. Please wait.

©M. Thollesson, 2001 Bioinformatics – Biological databases Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala.

Similar presentations


Presentation on theme: "©M. Thollesson, 2001 Bioinformatics – Biological databases Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala."— Presentation transcript:

1 ©M. Thollesson, 2001 Bioinformatics – Biological databases Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala University

2 ©M. Thollesson, 2001 What is a database? The data itself (i.e., information)The data itself (i.e., information) The organisation of the data - Database structureThe organisation of the data - Database structure –Flat-file databases Mark-up and tagsMark-up and tags –Relational databases – records and fields –Object oriented databases Database Management System – DBMSDatabase Management System – DBMS –Queries and retrievals InterfacesInterfaces –User interfaces, e.g. web pages or dedicated clients –Computer interfaces

3 ©M. Thollesson, 2001 Relational databases Consists of tables with homogenous content, where each table contains records (items) and each record has one or several fields (properties)

4 ©M. Thollesson, 2001 Relational databases Records in different tables are related by key fields Contents from different tables are brought together using these key values

5 ©M. Thollesson, 2001 WWW-server, e.g. Apache Interface code, e.g. Perl or PHP DBMS, e.g. mySQL DBMS, e.g. mySQL SQL query OS, e.g. Linux http/https SQL query SQL reply html Client Server

6 ©M. Thollesson, 2001 Structure databases Sequence databases Predictions on proteins Phylogenetic inference Pairwise/Multiple alignment Contig assembly ? One view of Bioinformatics Phylogenetic databases Gene Function, localisation Phylogenies BLAST Literature databases Predictions on DNA Metabolic databases Expression patterns Regulatory mechanism Genome databases Expression databases

7 ©M. Thollesson, 2001 Sequence databases Nucleic acid sequence databasesNucleic acid sequence databasesNucleic acid sequence databasesNucleic acid sequence databases –Contain primary nucleotide sequence data –Repositories, i.e. the content of these databases are not curated Protein sequence databasesProtein sequence databasesProtein sequence databasesProtein sequence databases –Contain secondary and primary protein sequence data –Some are curated, others are just extracts from other databases Several kinds of interfaces/search engines are available to retrieve data, e.g. SRS (Sequence Retrieval System) and the Entrez browserSeveral kinds of interfaces/search engines are available to retrieve data, e.g. SRS (Sequence Retrieval System) and the Entrez browserSRSEntrezSRSEntrez

8 ©M. Thollesson, 2001 Nucleotide sequence repositories Three primary centres, which exchange information on a daily basisThree primary centres, which exchange information on a daily basis –EMBL / European Molecular Biology Laboratory –DDBJ / DNA Data Bank of Japan –GenBank All three adhere to the DDBJ/EMBL/GenBank Feature Table Definition –All three adhere to the DDBJ/EMBL/GenBank Feature Table Definition – http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html, i.e. the content of each record is the same for these databases

9 ©M. Thollesson, 2001 EMBL – European Molecular Biology Laboratory Europe’s primary nucleotide sequence resource Established 1980 in Heidelberg by EMBL, now maintained by EBI (European Bioinformatics Institute) in Cambridge, UK Main sources of sequences are direct submissions from individual researchers, genome project and patent applications Contains two main parts –A release section (embl_rel) that is issued every three months –A new section (embl_new) where new sequences are added daily Also split into divisions depending on the origin of the sequence http://www.ebi.ac.uk/embl/Access/index.html Entries has a format that differs from GenBank and DDBJEntries has a format that differs from GenBank and DDBJformat

10 ©M. Thollesson, 2001 EMBL divisions

11 ©M. Thollesson, 2001 - DNA Database of Japan Mainly collects data from Japanese activities (but accepts submissions from any researcher in any country) Began DNA repository activities in 1986 endorsed by the Ministry of Education, Science, Sports, and Culture http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/http://www.ddbj.nig.ac.jp/ Entries has the same format as GenBankEntries has the same format as GenBankGenBank

12 ©M. Thollesson, 2001 GenBank US primary nucleotide sequence resource Established in 1988 Maintained by National Center for Biotechnology Information (NCBI), Bethesda, MD Contains a release section and a new section as EMBL http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/http://www.ncbi.nlm.nih.gov/ Entries has a format that is different from EMBLEntries has a format that is different from EMBL formatEMBL formatEMBL

13 ©M. Thollesson, 2001 EST databases Expressed Sequence Tags (ESTs) are short sequences from mRNAExpressed Sequence Tags (ESTs) are short sequences from mRNA ESTs are useful to get a handle on expressed genesESTs are useful to get a handle on expressed genes dbEstdbEst –Is a division of GenBank containing ESTs from a number of organisms UniGeneUniGene –A non-redundant set of gene-oriented clusters –Contains numerous novel ESTs, but also “proper” sequences –Presently Homo, Rattus, and Mus has been processed

14 ©M. Thollesson, 2001 SWISS-PROT and TREMBLSWISS-PROT and TREMBLSWISS-PROT and TREMBLSWISS-PROT and TREMBL –Developed by Swiss Institute of Bioinformatics (SIB) and European Bioinformatics Institute (EBI) PIR-PSDPIR-PSDPIR-PSD –A collaboration between National Biomedical Research Foundation (NBRF), Munich Center for Protein Studies (MIPS) and Japan International Protein Information Database (JIPID) Protein databases

15 ©M. Thollesson, 2001 Protein databases I SWISS-PROT (86000 entries June 2000)SWISS-PROT (86000 entries June 2000)SWISS-PROT –Is a curated protein sequence database –Aims to provide a high level of annotations (e.g., function, domain structure, post-translational modifications) –Divided into Swissprot_rel and Swissprot_new –Not divided into sections based on species TREMBL (ca 300 000 entries June 2000)TREMBL (ca 300 000 entries June 2000)TREMBL –Contains translated sequences from the EMBL database –Divided into SP-TREMBL with sequences that are candidates for incorporation into SWISS-PROTSP-TREMBL with sequences that are candidates for incorporation into SWISS-PROT REM-TREMBL that will not be incorporated into SWISS-PROT REM-TREMBL that will not be incorporated into SWISS-PROT

16 ©M. Thollesson, 2001 Protein databases II Protein Information Resource - Protein Sequence Database (PIR-PSD) is similar to SWISS- PROT in its aimsProtein Information Resource - Protein Sequence Database PIR’s stated goal is “to provide a comprehensive, non-redundant, classified, well-annotated, and freely available, protein sequence database, in which entries are classified into family groups and alignments of each group are available” Also produces a computer generated supplemental database of translations, PATCHX, similar to TrEMBLE with sequences not yet incorporated New entries in batches from genome sequencing projects or from selected GenBank/EMBL entries The PIR database is in constant flux as the level of annotation on entries increases and new entries with minimal annotation are added PIR-PSD database is growing at a higher rate than the SWISS-PROT, but has a lower level of annotation per entry. The PIR-PSD consists of four sections: –PIR1. Fully Classified Entries –PIR2. Verified and Classified Entries –PIR3. Unverified Entries –PIR4. Un-encoded or Un-translated Entries

17 ©M. Thollesson, 2001 Interfaces to public databases Several different databases are usually accessible through the same WWW interface. For example, the databases below are accessible via National Institute of Health/National Centre for Biotechnology Information (NIH/NCBI) ( For example, the databases below are accessible via National Institute of Health/National Centre for Biotechnology Information (NIH/NCBI) ( http://www.ncbi.nlm.nih.gov/Database/) http://www.ncbi.nlm.nih.gov/Database/ OMIM PubMed Full-text Electronic journals Full-text Electronic journals 3D Structures 3D Structures Taxonomy Protein sequences Protein sequences Nucleotide sequences Nucleotide sequences Maps & Genomes Maps & Genomes

18 ©M. Thollesson, 2001 Genome databases Differs from sequence databases by being more heterogeneous and diverse A genome database organises all information on an organisms genome, such as –Genetic mapping Maps how genes are located relative to each other and with a distance measured as percentage recombination –Physical mapping Ranges from cytogenetic maps (banding patterns of chromosomes) to the positions of clone contigs –Sequence data Nucleotide sequences are (usually) deposited at the nucleotide sequence repositories even before finishing the genome sequencing Entries to genome databases are e.g. –Genome Net – http://www.genome.ad.jp/http://www.genome.ad.jp/ –NCBI’s genome section – http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

19 ©M. Thollesson, 2001 Protein structures n-IVTAHAFVMI-c Primary structure; the order of amino acids Secondary structure; conformations, mainly alpha helices and beta sheaths Tertiary structure; the complete three dimensional folding of the polypeptide Quaternary structure; exists if the protein is composed of two or more polypeptide chains

20 ©M. Thollesson, 2001 Structural databases Contain information on the three-dimensional structure of molecules, chiefly proteinsContain information on the three-dimensional structure of molecules, chiefly proteins Data is primarily based on x-ray crystallography (>80%), NMR, or theoretical models ( 80%), NMR, or theoretical models (<2%) Examples of such databasesExamples of such databases –Protein databank (PDB) - http://www.rcsb.org/pdb/ http://www.rcsb.org/pdb/ –Molecular Modelling Database (MMDB) - http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure

21 ©M. Thollesson, 2001 All metabolic databases use EC-numbers, which are a combination of four figures that classify the type of reaction the enzyme catalysesAll metabolic databases use EC-numbers, which are a combination of four figures that classify the type of reaction the enzyme catalysesEC-numbers Example: EC 1.2.3.4 is a oxido-reductase (1) that act on aldehyde or oxo groups (1.2) with oxygen as acceptor (1.2.3). The last digit, 4, is an ordnial number within the classExample: EC 1.2.3.4 is a oxido-reductase (1) that act on aldehyde or oxo groups (1.2) with oxygen as acceptor (1.2.3). The last digit, 4, is an ordnial number within the class Pros and consPros and cons + EC provides an unique identifier + Enables a synonym dictionary - Many classes of enzymes are not covered in sufficient detail, especially proteases and nucleases with macromolecules as substrate Metabolic databases

22 ©M. Thollesson, 2001 Metabolic databases Describe enzymes, reactions, substrates, products, and biochemical reactionsDescribe enzymes, reactions, substrates, products, and biochemical reactions Data are specific for different organisms (“type organisms”) as well as general overviews and links to sequence and structure databasesData are specific for different organisms (“type organisms”) as well as general overviews and links to sequence and structure databases ExampleExample –Kyoto Encyclopedia of Genes and Genomes – http://www.genome.ad.jp/kegg/ http://www.genome.ad.jp/kegg/

23 ©M. Thollesson, 2001 Phylogenetic databases Primary (repositories) and secondary (data analysis and interpretation) databasePrimary (repositories) and secondary (data analysis and interpretation) database Primary databases contain information on the result of phylogenetic analyses (trees, taxonomic names), data, and assumptions on which the analyses are basePrimary databases contain information on the result of phylogenetic analyses (trees, taxonomic names), data, and assumptions on which the analyses are base Secondary databases contain interpretations and assembled phylogenetic hypotheses for all kinds of taxaSecondary databases contain interpretations and assembled phylogenetic hypotheses for all kinds of taxa ExamplesExamples –TreeBase – http://www.herbaria.harvard.edu/treebase/index.html (Primary) http://www.herbaria.harvard.edu/treebase/index.html –Tree of Life – http://phylogeny.arizona.edu/tree/ (Secondary) http://phylogeny.arizona.edu/tree/

24 ©M. Thollesson, 2001 Expression databases Functional genomicsFunctional genomics –DNA arrays (cDNA probes on a chip) are used to assess the RNA levels of different genes (several hundreds at a time) –Measurements are taken at intervals after some treatment is initialised –Genes are grouped in clusters according to expression profile –Reverse engineering of expression levels of these groups are used to propose regulatory genetic networks No unified format for DNA-chip data yet, although work is in progressNo unified format for DNA-chip data yet, although work is in progress Example of gene expression databases areExample of gene expression databases are –EBI ArrayExpress database – http://www.ebi.ac.uk/arrayexpress/http://www.ebi.ac.uk/arrayexpress/ –KEGG Expression Database – http://www.genome.ad.jp/kegg/expression/http://www.genome.ad.jp/kegg/expression/


Download ppt "©M. Thollesson, 2001 Bioinformatics – Biological databases Mikael Thollesson Evolutionary Biology Centre and Linnaeus Centre for Bioinformatics, Uppsala."

Similar presentations


Ads by Google