Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.

Similar presentations


Presentation on theme: "Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular."— Presentation transcript:

1 Introduction to Bioinformatics Databases

2 DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular sequence data to gain insight into a broad range of biological problems.

3 After Pace NR (1997) Science 276:734 Page 6 With the use of bioinformatics we can learn the variation that occur between species, and we can deduce the evolutionary history of life on Earth.

4 0 10 20 30 40 50 60 70 1985 Growth of GenBank Base pairs of DNA (billions) Sequences (millions) 200019951990 December 1982 June 2006

5 Growth of the International Nucleotide Sequence Database Collaboration Base pairs of DNA (billions) http://www.ncbi.nlm.nih.gov/Genbank/ Base pairs contributed by GenBank EMBL DDBJ

6 DNARNAprotein Central dogma of molecular biology genometranscriptomeproteome Central dogma of bioinformatics and genomics

7 DNARNA cDNA ESTs UniGene phenotype genomic DNA databases protein sequence databases protein Fig. 2.2 Page 20

8 GenBank EMBLDDBJ There are three major public DNA databases The underlying raw DNA sequences are identical Page 16

9 GenBankEMBLDDBJ Housed at EBI European Bioinformatics Institute There are three major public DNA databases Housed at NCBI National Center for Biotechnology Information Housed in Japan Page 16

10 >300,000 species are represented in GenBank Table 2-1

11 Taxonomy nodes at NCBI http://www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi8/06

12

13 The most sequenced organisms in GenBank Homo sapiens 10.7 billion bases Mus musculus 6.5b Rattus norvegicus 5.6b Danio rerio 1.7b Zea mays 1.4b Oryza sativa 0.8b Drosophila melanogaster0.7b Gallus gallus 0.5b Arabidopsis thaliana 0.5b Updated 8-12-04 GenBank release 142.0 Table 2-2 Page 18

14 The most sequenced organisms in GenBank Homo sapiens 11.2 billion bases Mus musculus 7.5b Rattus norvegicus 5.7b Danio rerio 2.1b Bos taurus1.9b Zea mays 1.4b Oryza sativa (japonica) 1.2b Xenopus tropicalis0.9b Canis familiaris0.8b Drosophila melanogaster0.7b Updated 8-29-05 GenBank release 149.0 Table 2-2 Page 18

15 The most sequenced organisms in GenBank Homo sapiens 12.3 billion bases Mus musculus 8.0b Rattus norvegicus 5.7b Bos taurus3.5b Danio rerio 2.5b Zea mays 1.8b Oryza sativa (japonica) 1.5b Strongylocentrotus purpurata1.2b Sus scrofa1.0b Xenopus tropicalis1.0b Updated 7-19-06 GenBank release 154.0 Table 2-2 Page 18

16 National Center for Biotechnology Information (NCBI) www.ncbi.nlm.nih.gov Page 24

17 Types of Data in GenBank  DNA level  RNA level (cDNA)  Protein sequences.  …

18 www.ncbi.nlm.nih.gov Fig. 2.5 Page 25

19 Fig. 2.5 Page 25

20 PubMed is… National Library of Medicine's search service 16 million citations in MEDLINE links to participating online journals PubMed tutorial (via “Education” on side bar) Page 24

21 Entrez integrates… the scientific literature; DNA and protein sequence databases; 3D protein structure data; population study data sets; assemblies of complete genomes Page 24

22 Entrez is a search and retrieval system that integrates NCBI databases Page 24

23 BLAST is… Basic Local Alignment Search Tool NCBI's sequence similarity search tool supports analysis of DNA and protein databases 100,000 searches per day Page 25

24 OMIM is… Online Mendelian Inheritance in Man catalog of human genes and genetic disorders edited by Dr. Victor McKusick, others at JHU Page 25

25 Books is… searchable resource of on-line books Page 26

26 TaxBrowser is… browser for the major divisions of living organisms (archaea, bacteria, eukaryota, viruses) taxonomy information such as genetic codes molecular data on extinct organisms Page 26

27 Structure site includes… Molecular Modelling Database (MMDB) biopolymer structures obtained from the Protein Data Bank (PDB) Cn3D (a 3D-structure viewer) vector alignment search tool (VAST) Page 26

28 Accessing information on molecular sequences Page 26

29 Accession numbers are labels for sequences NCBI includes databases (such as GenBank) that contain information on DNA, RNA, or protein sequences. You may want to acquire information beginning with a query such as the name of a protein of interest, or the raw nucleotides comprising a DNA sequence of interest. DNA sequences and other molecular data are tagged with accession numbers that are used to identify a sequence or other record relevant to molecular data. Page 26

30 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

31 Four ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 27

32 4 ways to access protein and DNA sequences [1] Entrez Gene with RefSeq Entrez Gene is a great starting point: it collects key information on each gene/protein from major databases. It covers all major organisms. RefSeq provides a curated, optimal accession number for each DNA (NM_006744) or protein (NP_007635) Page 27

33 From the NCBI home page, type “rbp4” and hit “Go” revised Fig. 2.7 Page 29

34 revised Fig. 2.7 Page 29

35

36

37 By applying limits, there are now just two entries

38 revised Fig. 2.8 Page 30 Entrez Gene (top of page) Note that links to many other RBP4 database entries are available

39 Entrez Gene (middle of page)

40 Entrez Gene (bottom of page)

41 Fig. 2.9 Page 32

42 Fig. 2.9 Page 32

43 Fig. 2.9 Page 32

44 FASTA format Fig. 2.10 Page 32

45 FASTA format

46 What is an accession number? An accession number is label that used to identify a sequence. It is a string of letters and/or numbers that corresponds to a molecular sequence. Examples (all for retinol-binding protein, RBP4): X02775GenBank genomic DNA sequence NT_030059Genomic contig Rs7079946dbSNP (single nucleotide polymorphism) N91759.1An expressed sequence tag (1 of 170) NM_006744RefSeq DNA sequence (from a transcript) NP_007635RefSeq protein AAC02945GenBank protein Q28369SwissProt protein 1KT7Protein Data Bank structure record protein DNA RNA Page 27

47

48 NCBI’s important RefSeq project: best representative sequences RefSeq (accessible via the main page of NCBI) provides an expertly curated accession number that corresponds to the most stable, agreed-upon “reference” version of a sequence. RefSeq identifiers include the following formats: Complete genomeNC_###### Complete chromosomeNC_###### Genomic contigNT_###### mRNA (DNA format)NM_###### e.g. NM_006744 ProteinNP_###### e.g. NP_006735 Page 29-30

49 Accession MoleculeNote AP_123456 Protein Protein products; alternate NC_123456 Genomic Complete genomic molecules NG_123456 Genomic Incomplete genomic regions NM_123456 mRNATranscript products; mRNA NM_123456789 mRNATranscript products; 9-digit NP_123456 ProteinProtein products; NP_123456789 ProteinProtein products; 9-digit NR_123456 RNANon-coding transcripts NT_123456 GenomicGenomic assemblies NW_123456 GenomicGenomic assemblies NZ_ABCD12345678 GenomicWhole genome shotgun data XM_123456 mRNATranscript products XP_123456 ProteinProtein products XR_123456 RNATranscript products YP_123456 ProteinProtein products ZP_12345678 Protein Protein products NCBI’s RefSeq project: accession for genomic, mRNA, protein sequences

50 Ensembl to access protein and DNA sequences Try Ensembl at www.ensembl.org for a premier human genome web browser. Ensembl is a joint scientific project between the European Bioinformatics InstituteEuropean Bioinformatics Institute and the Wellcome Trust Sanger InstituteWellcome Trust Sanger Institute, Its aim is to provide a centralised resource for geneticists, molecular biologists and other researchers studying the genomesgenomes of our own species and other vertebrates.vertebrates We will encounter Ensembl as we study the human genome, BLAST, and other topics.

51 click human

52 Species in Ensembl FISHES BIRDS REPTILES MAMMALS PLACENTALS MONOTREMES MARSUPIALS OTHER BIRDS PALEOGNATHS PASSERINES CROCODILES TURTLES LIZARDS AMPHIBIANS TELEOSTS SHARKS RAYS LATIMERIA BICHIR/POLYPTERUS LUNGFISHES AGNATHANS NON-VERTEBRATES

53 enter RBP4

54

55 Five ways to access DNA and protein sequences [1] Entrez Gene with RefSeq [2] UniGene [3] European Bioinformatics Institute (EBI) and Ensembl (separate from NCBI) [4] ExPASy Sequence Retrieval System (separate from NCBI) Page 33

56 ExPASy to access protein and DNA sequences ExPASy sequence retrieval system (ExPASy = Expert Protein Analysis System) Visit http://www.expasy.ch/ Page 33

57 Fig. 2.11 Page 33

58

59 Example of how to access sequence data: HIV-1 pol There are many possible approaches. Begin at the main page of NCBI, and type an Entrez query: hiv-1 pol Page 34

60

61 Searching for HIV-1 pol: Following the “genome” link yields a manageable three results

62 Example of how to access sequence data: HIV-1 pol For the Entrez query: hiv-1 pol there are about 40,000 nucleotide or protein records (and >100,000 records for a search for “hiv-1”), but these can easily be reduced in two easy steps: --specify the organism, e.g. hiv-1[organism] --limit the output to RefSeq! Page 34

63 only 1 RefSeq over 100,000 nucleotide entries for HIV-1

64 Examples of how to access sequence data: histone query for “histone”# results protein records21847 RefSeq entries7544 RefSeq (limit to human)1108 NOT deacetylase697 At this point, select a reasonable candidate (e.g. histone 2, H4) and follow its link to Entrez Gene. There, you can confirm you have the right gene/protein. 8-12-06

65

66 Access to Biomedical Literature Page 35

67 PubMed at NCBI to find literature information

68 PubMed is the NCBI gateway to MEDLINE. MEDLINE contains bibliographic citations and author abstracts from over 4,600 journals published in the United States and in 70 foreign countries. It has >14 million records dating back to 1966. Page 35

69

70

71 PubMed search strategies Try the tutorial (“education” on the left sidebar) Use boolean queries (capitalize AND, OR, NOT) lipocalin AND disease Try using “limits” Try “Links” to find Entrez information and external resources Obtain articles on-line via Welch Medical Library (and download pdf files): http://www.welch.jhu.edu/ Page 35

72 lipocalin AND disease (60 results) lipocalin OR disease (1,650,000 results) lipocalin NOT disease (530 results) 1 AND 2 1 OR 2 1 NOT 2 1 1 1 2 2 2 Fig. 2.12 Page 34 8/04


Download ppt "Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular."

Similar presentations


Ads by Google