Download presentation
Presentation is loading. Please wait.
Published bySamson Simpson Modified over 8 years ago
1
Databases, archives, search tools. Bioinformatics: ”convergence of two historical trends in biological research - storage of molecular sequences in computer databases - application of computational algoritms to the analysis of DNA and protein sequences.” (Brown 2003 Biotechniques).
2
After the database lecture the student should * Understand the differences between primary and secondary databases. * Understand the differences between sequence similarity search and structured data search. * Understand the background for maintaining different versions of databases with nearly the same content. * Understand the difference between curated and raw databases. * Understand the difference between databases (Genbank non-redundant protein, SwissProt), servers (NCBI, Expasy) and search programs (Blast, Fasta). Why? Most information developed in bioinformatics is stored in databases. Often the same information exists in different formats in different databases, and different servers present the same data in different more or less user-friendly ways. The choice of database depends on the problem and personal taste. The choice of server may even depend on the time of day and the loads (number of users) at the time.
3
Databases (DB). 1. Primary databases = archives = repositories. 2. Secondary databases = specialized databases 3. Parallel information mainlyAmerican (U. S. A.) versus European (EU) databases. All databases are listed in the first issue of Nucleic Acid Researh (NAR) each year
5
Main bioinformatic institutes hosting databases and servers verdenskort: NCBI EBI DDBJ USA, Bethesta Hinxton,Japan Maryland England International Nucleotide Sequence Database Collaboration
6
Primary (repository) (archives) DB. Data derived from direct experimental characterization of DNA or protein. Authors submit their own material which is curated by the database. International public databases All known nucleotide and protein sequences. GenBank (funded 1982) hosted at NCBI since 1992 EMBL (funded ?) hosted at EBI since 1994 DDJB (funded 1986) (DNA Data Bank Japan) Local databases at institutions doing sequencing TIGR (funded 1992), Sanger (funded 1992) Other local databases of sequencing projects linked to the 3-5 large primary databases Commercial DB not assible to the public.
7
Secondary DB (specialized DB, derived information resources) Information curated by the DB. No direct submission from scientists. Further analysis by the database. Swiss-prot (hosted at SIB since 1998) (funded by Dr. Amos Bairoch in 1985) Annotation, minimum redundancy, integration with other DB, documentation. PDB (Protein Data Bank) (funded in 1971) DB of experimental determined three-dimentional structural information. PIR (Protein Information Resource) (funded in 1965 by Margaret O. Dayhoff) Receives directly sequenced proteins. Many, many more (see NAR)
8
Domain and motif specialized DB. Domain: compact units of proteins behaving indepently Motifs: conserved regions of proteins which might be part of domain BLOCKS (USA) (funded by Henikoff’s) Multiple alignments of conserved regions PRINTS (UK) (parallel to BLOCKS based on OWL DB) Hierarchical gene family fingerprints PROSITE (associated Swiss-Prot) Biologically-significant protein patterns and profiles ProDOM (automatic created blocks, France) Pfam (manually defiend domains) Multiple sequence alignments and hidden Markov models of common protein domains CDD (Conserved Domain Database) Alignment models for conserved protein domains
9
Domain, motif specialized DB. Domain: compact units of proteins behaving indepently Motifs: conserved regions of proteins which might be part of domain Search tools for the domains DB DART (Domain Architecture Retrieval Tool) SMART (Simple Modular Architacture Research Tool) Interpro: Linking information in PRINTS, PROSITE, ProDOM and Pfam
10
Database Category: Proteome Resources AAindex Physicochemical and biological properties of amino acids GELBANK 2D gel electrophoresis patterns from completed genomes REBASE Restriction enzymes and associated methylases SWISS-2DPAGE Annotated two-dimensional polyacrylamide gel electrophoresis database
11
Database Category: Varied Biomedical Content DBcat Catalog of databases DrugDB Pharmacologically-active compounds; generic and trade names GlycoSuiteDB N- and O-linked glycan structures and biological source information NCBI Taxonomy Browser Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence probeBase rRNA-targeted oligonucleotide probe sequences, DNA microarray layouts, and associated information PubMed MEDLINE and Pre-MEDLINE citations RefSeq Reference sequence standards for genomes, genes, transcripts, and proteins Tree of Life Information on phylogeny and biodiversity VirOligo Virus-specific oligonucleotides for PCR and hybridization
12
International bioinformatic resources (integrated databases, programs and servers) NCBI (National Center for Biotechnology Information) Division of NLM on NIH campus. web-site www.ncbi.nlm.nih.gov. Repository: GenBank Data retrieval: Entrez, PubMed, LocusLink Entrez is an integrated database retrival system accessible all type of data Data analysis: BLAST, Electronic PCR, ORFfinder, and more
13
International bioinformatic resources (integrated databases, programs and servers) EBI (European Bioinformatics Institute) EMBL. Repository. Europe’s primary collection of nucleotide sequences UniProt KnowledgebaseUniProt Knowledgebase - a complete annotated protein sequence database Macromolecular Structure DatabaseMacromolecular Structure Database - European Project for the management and distribution of data on macromolecular structures ArrayExpressArrayExpress - for gene expression data EnsemblEnsembl - Providing up to date completed metazoic genomes and the best possible automatic annotation. Tools:Clustalw and many more
14
Example of repository: GenBank Submission: 35 % by Bankit individual submissions. Rest bulk submissions from sequencing centres. Gi (genetic identifier)-number: changes with new updates. Accession number: constant but extended by version no. DNA sequences: two letters, six digits (old one letter 5 dig.). Protein sequences: three letters, five digits (old one letter 5 dig.). Non-redundant (nr) ?
15
Example of protein search in DB: Leucotoxin Frey and Kuhnert 2002
16
GenBank DNA-sequence format LOCUS PASA1LKT 7801 bp DNA linear BCT 26-APR-1993 DEFINITION Pasteurella haemolytica A1 leukotoxin gene, encoding LktA, LktB, LktC and LktD proteins, complete cds. ACCESSION M20730 VERSION M20730.1 GI:150492 KEYWORDS LktA protein; LktB protein; LktC protein; LktD protein. SOURCE Mannheimia haemolytica ORGANISM Mannheimia haemolytica Bacteria; Proteobacteria; Gammaproteobacteria; Pasteurellales; Pasteurellaceae; Mannheimia. REFERENCE 1 (bases 1 to 7801) AUTHORS Lo,R.Y., Strathdee,C.A. and Shewen,P.E. TITLE Nucleotide sequence of the leukotoxin genes of Pasteurella haemolytica A1 JOURNAL Infect. Immun. 55 (9), 1987-1996 (1987) MEDLINE 87306837 PUBMED 3040588 COMMENT Original source text: P.haemolytica (serotype 1, biotype A) DNA. Submitted in computer readable form by C.Strathdee21-SEP-1988. FEATURES Location/Qualifiers source 1..7801 /organism="Mannheimia haemolytica" /mol_type="genomic DNA" /db_xref="taxon:75985"
17
GenBank DNA-sequence format CDS 470..973/note="LktC protein" /codon_start=1 /transl_table=11 /protein_id="AAA25528.1" /db_xref="GI:150493" /translation="MNQSYFNLMNSSLHK….. CDS 989..3850 /note="LktA protein" /codon_start=1 /transl_table=11 /protein_id="AAA25529.1" /db_xref="GI:150494" /translation="MGTRLTTLSNGLKNTLTATKS….. ORIGIN 3 bp upstream of EcoRV site. 1 gatatcttgt gcctgcgcag taaccacaca cccgaataaa agggtcaaaa gtgttttttt 61 cataaaaagt ccctgtgttt tcattataag gattaccact ttaacgcagt tactttctta
18
Genome level comparison COG (Clusters of Orthologous Groups) ARNA processing and modification BChromatin structure and dynamics CEnergy production and conversion DCell cycle control and mitosis EAmino acid metabolism and transport FNucleotide metabolism and transport … SFunction unknown
21
Example of search in specialized database. Selection of DB. More versions of the same acc. no. Different types of identifiers. Links (or lack of these) to other specialized databases
22
Example of protein search in DB: Leucotoxin Frey and Kuhnert 2002
23
Example of protein search in DB: Leucotoxin NCBI http://www.ncbi.nlm.nih.gov/ Protein keywords, Mannheimia haemolytica leukotoxin over 100 hits Swiss-prot http://expasy.org/sprot/http://expasy.org/sprot/ Wrong name (Pasteurella) only one sequence (P16535) Swiss-prot no. LKA1_PASHA NCBI with P16535
24
Example with 16S rRNA based identification of bacteria. Relevant for food -, veterinary and environmental microbiology. 16S rRNA sequence comparison preferred for classification/identification: 16S rRNA genes are universially distributed There is only one type of ribosomes. No selection and no recombination (in theory) 16S rRNA gene sequence derived phylogeny reflects the natural relationship of bacteria Current framework for bacterial taxonomy Huge databases.
25
Example of sequence submission to a primary database. Isolate P876, 16S rRNA gene sequence. Length: 1449 bp TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAAC TGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTT GGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCA GCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAG GGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTA ATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTA GAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGA CGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGC CCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGT TTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCA TGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGA GACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGG CAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCA AATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA Submission to GenBank with BankIt
30
Errors detected during automatic translation of DNA to protein. When the sequence is curated at the database.
31
BLAST 16S rRNA TGCAAGTCGA ACGGTAGCAG GAAGAAAGCT TGCTTTCTTT GCTGACGAGT GGCGGACGGG TGAGTAATGC TTGGGAATCT GGCTTATGGA GGGGGATAAC TGTGGGAAAC TGCAGCTAAT ACCGCGTAAT CTCTGAGGAG TAAAGGGTGG GACyTTAGGG CCACCTGCCA TAAGATGAGC CCAAGTGGGA TTAGGTAGTT GGTGGGGTAA AGGCCTACCA AGCCTGCGAT CTCTAGCTGG TCTGAGAGGA TGACCAGCCA CACTGGAACT GAGACACGGT CCAGACTCCT ACGGGAGGCA GCAGTGGGGA ATATTGCGCA ATGGGGGGAA CCCTGACGCA GCCATGCCGC GTGAATGAAG AAGGCCTTCG GGTTGTAAAG TTCTTTCGGT AATGAGGAAG GGGTGTTrTT kAATAGATAG CATCATTGAC GTTAATTACA GAAGAAGCAC CGGCTAACTC CGTGCCAGCA GCCGCGGTAA TACGGAGGGT GCGAGCGTTA ATCGGAATAA CTGGGCGTAA AGGGCACGCA GGCGGACTTT TAAGTGAGAT GTGAAATCCC CGAGCTTAAC TTGGGAATTG CATTTCAGAC TGGGAGTCTA GAGTACTTTA GGGAGGGGTA GAATTCCACG TGTAGCGGTG AAATGCGTAG AGATGTGGAG GAATACCGAA GGCGAAGGCA GCCCCTTGGG AATGTACTGA CGCTCATGTG CGAAAGCGTG GGGAGCAAAC AGGATTAGAT ACCCTGGTAG TCCACGCTGT AAACGCTGTC GATTTGGGGA TTGGGCTTTA AGCTTGGTGC CCGAAGCTAA CGTGATAAAT CGACCGCCTG GGGAGTACGG CCGCAAGGTT AAAACTCAAA TGAATTGACG GGGGCCCGCA CAAGCGGTGG AGCATGTGGT TTAATTCGAT GCAACGCGAA GAACCTTACC TACTCTTGAC ATCCTAAGAA GAGCTCAGAG ATGAGCTTGT GCCTTCGGGA ACTTAGAGAC AGGTGCTGCA TGGCTGTCGT CAGCTCGTGT TGTGAAATGT TGGGTTAAGT CCCGCAACGA GCGCAACCCT TATCCTTTGT TGCCAGCGAT TTGGTCGGGA ACTCAAAGGA GACTGCCAGT GACAAACTGG AGGAAGGTGG GGATGACGTC AAGTCATCAT GGCCCTTACG AGTAGGGCTA CACACGTGCT ACAATGGTGC ATACAGAGGG CAGCGAGAGT GCGAGCTTAA GCGAATCTCA GAAAGTGCAT CTAAGTCCGG ATTGGAGTCT GCAACTCGAC TCCATGAAGT CGGAATCGCT AGTAATCGCA AATCAGAATG TTGCGGTGAA TACGTTCCCG GGCCTTGTAC ACACCGCCCG TCACACCATG GGAGTGGGTT GTACCAGAAG TAGATAGCTT AACCTTCGGG AGGGCGTTTA CCACGGTATG ATTCATGACT GGGGTGAAGT CGTAACAGA
32
Four DB advises. Start with: NCBI and/or Swiss-prot Remember differences between: 1. Repository, archive 2. Specialized Parallel resources often exists in Europe and USA. Find help in the scientific litterature. Be aware of errors in the DB. Cite the databases correctly (see first issue of NAR each year)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.