Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.

Similar presentations


Presentation on theme: "1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David."— Presentation transcript:

1 1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 baharak@cs.ubc.ca David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue

2 2 NCBI! What is it? Created in 1998 At the National Institutes of Health To develop information systems for molecular biology Maintains: GenBank(R) nucleic acid sequence database Provides: Data retrieval systems & computational resources

3 3

4 4 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

5 5 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

6 6 Entrez Text searching → using Boolean queries → of a diverse set of over 20 databases Simultaneous searches across all Entrez databases at speeds comparable to a single database search

7 7

8 8 Entrez Retrieved record can be displayed in a wide variety of formats → GenBank Flatfile, FASTA, XML, … Graphical display is offered for some type of records Search history → allows users to recall result of previous searches and combine them using Boolean logic

9 9 Entrez PubMed → includes 12.8 million references and abstracts in MEDLINE(R) → with links to the full text of more than 4400 journals available on web PubMed Central → digital archive of peer reviewed journals in life sciences → access to over 300 000 full text articles → over 160 journals Books database → Contains more than 35 online scientific textbook

10 10 Entrez Extensive links within and between databases to related information Links btw a genomic assembly and its components Links btw a master sequence and those derived from its annotation LinkOut: expands the range of links from individual database records to related outside services

11 11 PubMed Central PMC: digital archive of peer reviewed journals in life sciences Access to over 300 000 full text articles Over 160 journals

12 12 Taxonomy Indexed over 165 000 named organisms Can be used to view taxonomic position or retrieve data from a database for particular organism or group Searches can be made on whole, partial or phonetically spelled organism names Links to organisms commonly used in biological research are provided Display custom taxonomic trees, representing user- defined subsets of the full NCBI taxonomy

13 13

14 14

15 15

16 16 Entrez Gene Successor to LocusLink Provides an interface to curated sequences and descriptive information about genes With links to gene related resources → NCBI’s Map Viewer, Evidence Viewer, Blast Link,..

17 17 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

18 18 BLAST Family BLAST → Local alignment search tool → performing sequence-similarity searches against variety of sequence databases → returning a set of gapped alignments btw the query and database sequences BLAST2Sequences → comparing two DNA or protein sequences → producing a dot-plot representation of the alignments

19 19

20 20 BLAST Family MegaBLAST → designed to search for nearly exact matches → handles batch nucleotide queries → operates up to 10 times faster than standard nucleotide BLAST BLASTLink (BLink) → displays pre-computed protein BLAST alignments for each protein in the Entrez databases → can display subset of these alignments by taxonomic criteria, database of origin, …

21 21 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

22 22 UniGene System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters Each cluster contains sequences that represent a unique gene, and is linked to related information Human UniGene → over 4.5 million human ESTs → reduced to 42-fold in number to approximately 107 000 sequence clusters Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression

23 23 ProEST Analogous to BLASTLink Presents pre-computed BLAST alignment btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences Reports are updated in tandem with UniGene protein similarities

24 24 Trace & Assembly Archives Trace Archive allows for flexible searching and download of sequencing traces Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank

25 25 HomoloGene System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic New HomoloGene build is guided by the taxonomic tree, relies on: → conserved gene order & measures of DNA similarity among closely related species → protein similarity for more distantly related organisms

26 26 …HomoloGene ‘Ancestor’ field → refers to the taxonomic group of the last common ancestor of the species represented in HomoloGene entry → using it is possible to limit a search to genes conserved in one of 22 ancestral group ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes → percent amino acid and nucleotide identities → Jukes-Cantor genetic distance parameter → the ratio of non-synonymous to synonymous amino acid substitutions (Ka/Ks)

27 27 dbMHC Supports clinical applications and research related to the major histocompatibility complex (MHC) Includes Reagent Database and Clinical sections

28 28 Reference Sequences RefSeq provides curated references for → transcripts, proteins and genomic regions → computationally derived nucleotide sequences and proteins Containing 1.3 million sequences → including more than 1 million protein sequences → representing more than 2400 organisms

29 29 ORF Finder and Spidey ORF finder → performs a six-frame translation of a nucleotide sequence → returns the location of each ORF within a specified size range Spidey → alignment tool for eukaryotic genomic sequences → takes into account predicted splice sites in constructing its alignment, and can use one of four splice-site models → returns exon alignments, protein translations and a summary showing the alignment quality, …

30 30 Electronic PCR (e-PCR) Forward e-PCR → searches for matches to STS primer pairs in the UniSTS database of over 450 000 markers → to increase sensitivity, allows the size of primer segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted Reverse e-PCR → used to estimate the genomic binding site, amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases

31 31

32 32

33 33 dbSNP Database of single nucleotide polymorphisms Repository for single base nucleotide substitutions and short deletion and insertion polymorphisms Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms

34 34 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

35 35 Entrez Genomes Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress Includes: → over 180 complete microbial genomes → more than 1600 viral genomes → over 550 reference sequences for eukaryotic organelles → … Complete genome can be accessed hierarchically starting from either → an alphabetical listing → phylogenetic tree for each of six principal taxonomic groups

36 36 COGs database Clusters of orthologous groups Presents a compilation of orthologous groups of proteins from 66 completely sequenced organisms Eukaryotic version, KOGs, is available for seven eukaryotes

37 37 MAP & Evidence Viewer MAP Viewer displays → genome assemblies → genetic and physical markers → the result of annotation, and other analyses using sets of aligned maps Evidence Viewer displays the alignments to a → genomic contig of RefSeq transcripts → GenBank mRNAs → known or potential transcripts → EST’s supporting a gene model

38 38 Model Marker Used to construct transcript models using combinations of putative exons derived from ab initio predictions or from the alignment of GenBank transcripts, including ESTs and NCBI RefSeqs, to the NCBI human genome assembly

39 39 Cancer Chromosome Consists of → NCI/NCBI SKY, M-FISH and CGH databases → NCI Mitelman database of chromosome Aberrations in cancer → NCI Recurrent Chromosome Aberrations in Cancer dtabase Three search formats are available → convential Entrez query → Quick/Simple search: set of menus to select a disease site or diagnosis → Advanced search : combination of forms for more complex queries

40 40 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

41 41 SAGEmap Provides two-way mapping btw → regular (10 base) and LongSAGE (17 base) SAGE tags → UniGene clusters SAGEmap repository contains → 381 SAGE experiments from 11 organisms Can also construct a user-configurable table of data comparing one group of SAGE libraries with another Is updated weekly

42 42

43 43 Gene Expression Ominbus Data repository and retrieval system for any high- throughput gene expression or molecular abundance data Contains → microarray-based experiments measuring the abundance of mRNA → genomic DNA and protein molecules → non-array-based technologies such as SAGE → mass spectrometry peptide profiling Now contains → high-throughput gene expression data from about 30 000 hybridization experiment → about 1000 array definitions → half a billion individual spot measurement data derived from over 100 organisms

44 44 OMIM Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University Contains information on disease phenotypes and genes Contains → about 16 000 entries

45 45 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

46 46 MMDB Built by processing entries from the Protein Data Bank Structures are linked to sequences in Entrez and to the Conserved Domain Database. Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D

47 47 HIV-I/Human Protein Interaction DB Concise summary of documented interactions between HIV-1 proteins and → host cell proteins → other HIV-1 proteins → proteins from disease organisms associated with HIV or AIDS Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented

48 48 Summary / Conclusion NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data All of the tools and resources can be find easily on the website http://www.ncbi.nih.gov/ along with documentations and explanatory materialhttp://www.ncbi.nih.gov/ NCBI Handbook and several tutorials are available One can search for tools and information in NCBI website by choosing NCBI Website as database

49 49

50 50 Thank you!

51 51 Outline Introduction Related work Components of a Pseudoknotted Sec. Str. Parsing algorithm Enumerating loops Akutsu’s structure class Conclusion & Future work


Download ppt "1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David."

Similar presentations


Ads by Google