1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David.

Slides:



Advertisements
Similar presentations
Databases (“knowledge bases”) used in genome analysis
Advertisements

Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Bunu databases’in icine koy lecture 5i de sonuna
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Integration of Protein Family, Function, Structure Rich Links to >90 Databases Value-Added Reports for UniProtKB Proteins iProClass Protein Knowledgebase.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Alignment of mRNAs to genomic DNA Sequence Martin Berglund Khanh Huy Bui Md. Asaduzzaman Jean-Luc Leblond.
Bioinformatics and Phylogenetic Analysis
Lecture 2.21 Retrieving Information: Using Entrez.
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
How to use the web for bioinformatics Ethan Strauss X 1171
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Introductory Overview
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Gene Expression Omnibus (GEO)
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Copyright OpenHelix. No use or reproduction without express written consent1.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
ANALYSIS AND VISUALIZATION OF SINGLE COPY ORTHOLOGS IN ARABIDOPSIS, LETTUCE, SUNFLOWER AND OTHER PLANT SPECIES. Alexander Kozik and Richard W. Michelmore.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Organizing information in the post-genomic era The rise of bioinformatics.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Basic Local Alignment Search Tool BLAST Why Use BLAST?
BIOLOGICAL DATABASES. BIOLOGICAL DATA Bioinformatics is the science of Storing, Extracting, Organizing, Analyzing, and Interpreting information in biological.
Bioinformatics and Computational Biology
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
David Wishart February 18th, 2004 Lecture 3 BLAST (c) 2004 CGDN.
An Introduction to NCBI & BLAST National Center for Biotechnology Information Richard Johnston Pasadena City College.
Copyright OpenHelix. No use or reproduction without express written consent1.
Welcome to the combined BLAST and Genome Browser Tutorial.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
What is sequencing? Video: WlxM (Illumina video) WlxM.
E-utilities: Short course. The Entrez Query System at NCBI.
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Center for Biologisk Sekvensanalyse Nikolaj Blom Center for Biological Sequence Analysis BioCentrum-DTU Technical University of Denmark
Entrez, dbSNP, GEO, OMIM & LinkOut JanPlan Entrez Distributed by NCBI in 1991 on CD-ROM Included linked nodes: GenBank & PDB Translated GenBank,
Introduction to Genes and Genomes with Ensembl
NCBI Molecular Biology Resources
Archives and Information Retrieval
Genomes and Their Evolution
BLAST.
Problems from last section
Presentation transcript:

1 Database Resources of the National Center for Biotechnology Information Baharak Rastegari MEDG 505 presentation February 3, 2005 David L. Wheeler et al. Nucleic Acids Research, Vol. 33, Database issue

2 NCBI! What is it? Created in 1998 At the National Institutes of Health To develop information systems for molecular biology Maintains: GenBank(R) nucleic acid sequence database Provides: Data retrieval systems & computational resources

3

4 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

5 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

6 Entrez Text searching → using Boolean queries → of a diverse set of over 20 databases Simultaneous searches across all Entrez databases at speeds comparable to a single database search

7

8 Entrez Retrieved record can be displayed in a wide variety of formats → GenBank Flatfile, FASTA, XML, … Graphical display is offered for some type of records Search history → allows users to recall result of previous searches and combine them using Boolean logic

9 Entrez PubMed → includes 12.8 million references and abstracts in MEDLINE(R) → with links to the full text of more than 4400 journals available on web PubMed Central → digital archive of peer reviewed journals in life sciences → access to over full text articles → over 160 journals Books database → Contains more than 35 online scientific textbook

10 Entrez Extensive links within and between databases to related information Links btw a genomic assembly and its components Links btw a master sequence and those derived from its annotation LinkOut: expands the range of links from individual database records to related outside services

11 PubMed Central PMC: digital archive of peer reviewed journals in life sciences Access to over full text articles Over 160 journals

12 Taxonomy Indexed over named organisms Can be used to view taxonomic position or retrieve data from a database for particular organism or group Searches can be made on whole, partial or phonetically spelled organism names Links to organisms commonly used in biological research are provided Display custom taxonomic trees, representing user- defined subsets of the full NCBI taxonomy

13

14

15

16 Entrez Gene Successor to LocusLink Provides an interface to curated sequences and descriptive information about genes With links to gene related resources → NCBI’s Map Viewer, Evidence Viewer, Blast Link,..

17 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

18 BLAST Family BLAST → Local alignment search tool → performing sequence-similarity searches against variety of sequence databases → returning a set of gapped alignments btw the query and database sequences BLAST2Sequences → comparing two DNA or protein sequences → producing a dot-plot representation of the alignments

19

20 BLAST Family MegaBLAST → designed to search for nearly exact matches → handles batch nucleotide queries → operates up to 10 times faster than standard nucleotide BLAST BLASTLink (BLink) → displays pre-computed protein BLAST alignments for each protein in the Entrez databases → can display subset of these alignments by taxonomic criteria, database of origin, …

21 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

22 UniGene System for automatically partitioning Gen-Bank sequences, including ESTs, into a non-redundant set of gene-oriented clusters Each cluster contains sequences that represent a unique gene, and is linked to related information Human UniGene → over 4.5 million human ESTs → reduced to 42-fold in number to approximately sequence clusters Has been used as a source of unique sequences for the fabrication of microarrays for the large-scale study of gene expression

23 ProEST Analogous to BLASTLink Presents pre-computed BLAST alignment btw protein sequences from model organisms and six-frame translations of UniGene nucleotide sequences Reports are updated in tandem with UniGene protein similarities

24 Trace & Assembly Archives Trace Archive allows for flexible searching and download of sequencing traces Assembly Archive links the raw sequence information found in the Trace Archive with assembly information found in GenBank

25 HomoloGene System for automated detection of homologs among the annotated genes of several completely sequence eukaryotic New HomoloGene build is guided by the taxonomic tree, relies on: → conserved gene order & measures of DNA similarity among closely related species → protein similarity for more distantly related organisms

26 …HomoloGene ‘Ancestor’ field → refers to the taxonomic group of the last common ancestor of the species represented in HomoloGene entry → using it is possible to limit a search to genes conserved in one of 22 ancestral group ‘Pairwise Score’ display gives a table of pairwise statistics for members of a Homologene group that includes → percent amino acid and nucleotide identities → Jukes-Cantor genetic distance parameter → the ratio of non-synonymous to synonymous amino acid substitutions (Ka/Ks)

27 dbMHC Supports clinical applications and research related to the major histocompatibility complex (MHC) Includes Reagent Database and Clinical sections

28 Reference Sequences RefSeq provides curated references for → transcripts, proteins and genomic regions → computationally derived nucleotide sequences and proteins Containing 1.3 million sequences → including more than 1 million protein sequences → representing more than 2400 organisms

29 ORF Finder and Spidey ORF finder → performs a six-frame translation of a nucleotide sequence → returns the location of each ORF within a specified size range Spidey → alignment tool for eukaryotic genomic sequences → takes into account predicted splice sites in constructing its alignment, and can use one of four splice-site models → returns exon alignments, protein translations and a summary showing the alignment quality, …

30 Electronic PCR (e-PCR) Forward e-PCR → searches for matches to STS primer pairs in the UniSTS database of over markers → to increase sensitivity, allows the size of primer segment to be matched, number of mismatches, number of gaps and the size of the STS to be adjusted Reverse e-PCR → used to estimate the genomic binding site, amplicon size and specificity for sets of primer pairs by searching against the genomic and transcript databases

31

32

33 dbSNP Database of single nucleotide polymorphisms Repository for single base nucleotide substitutions and short deletion and insertion polymorphisms Contains 9.8 million human SNPs as well as about 5 million from a variety of other organisms

34 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

35 Entrez Genomes Provides access to genomic data contributed by the scientific community for species whose sequencing and mapping is complete or in progress Includes: → over 180 complete microbial genomes → more than 1600 viral genomes → over 550 reference sequences for eukaryotic organelles → … Complete genome can be accessed hierarchically starting from either → an alphabetical listing → phylogenetic tree for each of six principal taxonomic groups

36 COGs database Clusters of orthologous groups Presents a compilation of orthologous groups of proteins from 66 completely sequenced organisms Eukaryotic version, KOGs, is available for seven eukaryotes

37 MAP & Evidence Viewer MAP Viewer displays → genome assemblies → genetic and physical markers → the result of annotation, and other analyses using sets of aligned maps Evidence Viewer displays the alignments to a → genomic contig of RefSeq transcripts → GenBank mRNAs → known or potential transcripts → EST’s supporting a gene model

38 Model Marker Used to construct transcript models using combinations of putative exons derived from ab initio predictions or from the alignment of GenBank transcripts, including ESTs and NCBI RefSeqs, to the NCBI human genome assembly

39 Cancer Chromosome Consists of → NCI/NCBI SKY, M-FISH and CGH databases → NCI Mitelman database of chromosome Aberrations in cancer → NCI Recurrent Chromosome Aberrations in Cancer dtabase Three search formats are available → convential Entrez query → Quick/Simple search: set of menus to select a disease site or diagnosis → Advanced search : combination of forms for more complex queries

40 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

41 SAGEmap Provides two-way mapping btw → regular (10 base) and LongSAGE (17 base) SAGE tags → UniGene clusters SAGEmap repository contains → 381 SAGE experiments from 11 organisms Can also construct a user-configurable table of data comparing one group of SAGE libraries with another Is updated weekly

42

43 Gene Expression Ominbus Data repository and retrieval system for any high- throughput gene expression or molecular abundance data Contains → microarray-based experiments measuring the abundance of mRNA → genomic DNA and protein molecules → non-array-based technologies such as SAGE → mass spectrometry peptide profiling Now contains → high-throughput gene expression data from about hybridization experiment → about 1000 array definitions → half a billion individual spot measurement data derived from over 100 organisms

44 OMIM Catalog of human genes and genetic disorders authored and edited by Victor A. McKusick at the John Hopkins University Contains information on disease phenotypes and genes Contains → about entries

45 DB Resources Categories Databases retrieval tools The BLAST family of sequence-similarity search programs Resources for Gene-level sequences Resources for Genome-scale analysis Resources for the analysis of patterns of gene expression and phenotypes The molecular modeling database, the conserved domain database search, CDART and Protein interactions

46 MMDB Built by processing entries from the Protein Data Bank Structures are linked to sequences in Entrez and to the Conserved Domain Database. Conserved Domain Search can be used to search a protein sequence for conserved domains in CDD Wherever possible, CDD hits are linked to structure which can be viewed with NCBI’s 3D molecular structure viwer, Cn3D

47 HIV-I/Human Protein Interaction DB Concise summary of documented interactions between HIV-1 proteins and → host cell proteins → other HIV-1 proteins → proteins from disease organisms associated with HIV or AIDS Summaries, including protein RefSeq accession numbers, Entrez Gene ID number, … are presented

48 Summary / Conclusion NCBI provides many tools for data retrieval and analysis of data in GenBank and other biological data All of the tools and resources can be find easily on the website along with documentations and explanatory materialhttp:// NCBI Handbook and several tutorials are available One can search for tools and information in NCBI website by choosing NCBI Website as database

49

50 Thank you!

51 Outline Introduction Related work Components of a Pseudoknotted Sec. Str. Parsing algorithm Enumerating loops Akutsu’s structure class Conclusion & Future work