Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.

Slides:



Advertisements
Similar presentations
NCBI/WHO PubMed/Hinari Course NCBI Literature Databases: PubMed Background.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Bunu databases’in icine koy lecture 5i de sonuna
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
NCBI Genome Resources Using NCBI Resources for Gene Discovery Kim D. Pruitt Transcriptome 2002 National Center for Biotechnology Information (NCBI) National.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
1 Databases in Bioinformatics (Roald Forsberg). 2 Overview The role of databases in bioinformatics The structure of databases –Relational databases –Database.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
An Introduction to Bioinformatics Molecular Biology Databases.
Introductory Overview
On line (DNA and amino acid) Sequence Information
Bioinformatics.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
2 February, 2007 Life Science: Organisms. 2 February, 2007 Genomics “The genetic blueprints of all people generally have the same information, with approximately.
Searching PubMed® NCBI, NLM Resources, Micromedex -GSBS TTUHSC Preston Smith Library presents Rev. 08/17/14.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Sequence Retrieving, Manipulation and Management BIOINFORMATICS Lecture 3.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Introduction to Bioinformatics Introduction to Databases
DAY 1c: Accessing Completed Genomes 1. UCSC Genome Bioinformatics 2. Ensembl 3. NCBI Genomic Biology.
Introduction to Bioinformatics Databases. DNARNAphenotypeprotein Central dogma of molecular biology A main focus of bioinformatics is to study molecular.
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
NCBI Literature Databases: PubMed
Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.
Computer Storage of Sequences
A Field Guide to GenBank and NCBI Molecular Biology Resources
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
NCBI: something old, something new. What is NCBI? Create automated systems for knowledge about molecular biology, biochemistry, and genetics. Perform.
Information retrieval and sliding window programs April 5, 2011 Hand in Homework #1. Homework #2 due Tuesday, April 12. Learning objectives- Understand.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
生物資料庫搜尋 ( 第八組 ) 連威森 王鼎 黃智楹 張鈞淵
NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.
Introduction to Genes and Genomes with Ensembl
Introduction to Bioinformatics
Retrieving Information: Using Entrez
Archives and Information Retrieval
생물정보학 Bioinformatics.
Access to Sequence Data and Related Information
محسن شیرازی کارشناسي علوم کتابداري و اطلاع رساني پزشکی
Finding the needle in your DNAstack Ana Teresa Freitas Ciência 2010 – Encontro com a Ciência e Tecnologia em Portugal FIL, July 7,
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI

NCBI Home Page To learn more, visit “Site Map” and “About NCBI” web pages

Entrez: An Integrated Database Search and Retrieval System

Entrez The (ever) Expanding Entrez System Journals UniGene PubMedNucleotide Protein SNP Genome BooksProbeSet OMIM CDD Taxonomy 3D Domains UniSTS PopSet Structure

Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM)

Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene

Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly

ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG GenBank TATAGCCG AT GA C ATT GA ATT C C GA ATT C C GA ATT C GA ATT C GA ATT C C GA ATT C C UniGene RefSeq Genome Assembly Labs Curators Algorithms TATAGCCG AGCTCCGATA CCGATGACAA

The International Nucleotide Sequence Database Collaboration NIH NIH NCBI NCBIENTREZGenBank NIG NIG CIB CIB Get Entry Get Entry DDBJ DDBJ EMBL EMBL EBI EBI SRS SRS EMBL EMBL

Entrez Nucleotide GenBank 71% DDBJ 19% EMBL 9% RefSeq 1% PDB 0.01%

What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL)

The Old Way From Fran Lewitter, Whitehead Institute

GenBank: NCBI’s Primary Sequence Database full release every two months incremental and cumulative updates daily available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data Release 136June ,592,865Records 18,197,119(June 2002) 32,528,249,295 Nucleotides 22,616,937,182(June 2002) 110,000 +Species

GenBank Divisions Traditional Divisions BCTBacterial/Archeal INVInvertebrate MAMMammalian (ex. ROD/PRI) PHGPhage PLNPlant/Fungal PRIPrimate RODRodent SYNSynthetic (cloning vectors) VRLViral VRTOther Vertebrate Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site GSSGenome Survey Sequence HTGSHigh Throughput Genomic Sequence HTCHigh Throughput cDNA

A Traditional GenBank Record Locus FieldMolecule Type GenBank Division Modification Date Definition Line Taxonomy GI (GenInfo) Keywords Submission Field

Feature Table GenPept Record Genomic DNA Sequence

Bulk Sequence Divisions ESTExpressed Sequence Tag STSSequence Tagged Site HTGSHigh Throughput Genomic Sequence Batch Submission, , or ftp Inaccurate Poorly Characterized

EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

A gene-oriented view of sequence entries MegaBlast-based automated sequence clustering Nonredundant set of gene-oriented clusters Each cluster represents a unique gene Provides information on tissue-specific expression and map locations Includes well-characterized genes and novel ESTs Useful for gene discovery and selection of mapping reagents What is UniGene?

EST hits to Homo sapiens muscle creatine kinase mRNA Query Sequence (muscle creatine kinase mRNA) 5’ EST Hits 3’ EST Hits

UniGene Entry for H. sapiens Muscle Creatine Kinase

STS Division : Sequence Tagged Sites Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR

UniSTS: Database of Mapped Markers

40,000 to > 50,000 bp phase 1 phase 2 phase 3 ROD Acc = AC Acc =AC Acc = AC HTG HTG Division: High Throughput Genome Same accession numbers, different versions unfinished, oriented,ordered,may have gaps unfinished, may be unordered,with gaps finished,no gaps

HTG Division: High Throughput Genome

RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis Human model transcripts and proteins Assembled Genomic Regions (contigs) draft human genome mouse genome Chromosome records Microbial viral organelle

Chromosome: NC_ mRNA: NM_ Model mRNA: XM_ protein: NP_ Model RNA: XR_ RNA: NR_ Gene: NG_ Curated Automated Model protein: XP_ Contig: NT_ NW_ Reference Sequences

LOCUS NC_ bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_ VERSION NC_ GI: KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), (1999) MEDLINE PUBMED RefSeq Chromosomes: NC_

RefSeq Contig: NT_, NW_

Curated RefSeq Records: NM_, NP_

Alignment Generated Transcripts: XM_,XP_

REFSEQ:Summary

BLAST a starting point for most bioinformatics related problems…

BLAST

One BLAST, many flavors

BLAST databases

Example: BLASTing protein sequence

BLAST output

BLAST output formatting

BLAST output

BLAST output low complexity filter

BLAST Scores we get from BLAST have an underlying distribution. E-value: the number of alignments with a particular score, or better score, that are expected to occur by chance when comparing two random sequences

BLAST