NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

Databases (“knowledge bases”) used in genome analysis
Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.
Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.
NCBI data, sliding window programs and dot plots Sept. 25, 2012 Learning objectives-Become familiar with OMIM and PubMed. Understand the difference between.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.
COT 6930 HPC and Bioinformatics Bioinformatics Resources and Databases Xingquan Zhu Dept. of Computer Science and Engineering.
1.
NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
How to use the web for bioinformatics Molecular Technologies Ethan Strauss X 1171
Archives and Information Retrieval
Biological databases.
Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Bioinformatics Student host Chris Johnston Speaker Dr Kate McCain.
Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.
On line (DNA and amino acid) Sequence Information
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Speaker: Sean D. Mooney Date: August 27, 2015 Bioinformatics I Date: August 27, 2015.
NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.
Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.
NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.
REMINDERS 2 nd Exam on Nov.17 Coverage: Central Dogma of DNA Replication Transcription Translation Cell structure and function Recombinant DNA technology.
Function preserves sequences
NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.
NCBI Literature Databases: PubMed
Computer Storage of Sequences
NCBI NCBI Molecular Biology Resources —— Entrez 王禄山 Mar
EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
A Field Guide to GenBank and NCBI Molecular Biology Resources
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Copyright OpenHelix. No use or reproduction without express written consent1.
Copyright OpenHelix. No use or reproduction without express written consent1.
NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.
Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.
Web services and genome annotation in GRID by DNA Data Bank of Japan (DDBJ) Center for Information Biology and DNA Data Bank of Japan National Institute.
E-utilities: Short course. The Entrez Query System at NCBI.
Lecture 1: Introduction to Entrez October 16-19, 2007 NCBI PowerScripting.
Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.
Introduction to Genes and Genomes with Ensembl
A Practical Guide to NCBI BLAST
Retrieving Information: Using Entrez
NCBI Molecular Biology Resources
Basics of BLAST Basic BLAST Search - What is BLAST?
Archives and Information Retrieval
BLAST.
Lesson 3 Bioinformatics Laboratory
Chapter 3. THE GENBANK SEQUENCE DATABASE
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

NCBI NCBI Molecular Biology Resources A Field Guide Nov. 6, 2001

NCBI n About NCBI n NCBI Sequence Databases Primary Database – GenBank Primary Database – GenBank Derivative Databases - RefSeq Derivative Databases - RefSeq n Entrez Databases and Text Searching n BLAST Services n Genomic Resources NCBI Resources

NCBI The National Center for Biotechnology Information (NCBI) n Created as a part of the National Library of Medicine in 1988 Establish public databases Establish public databases Research in computational biology Research in computational biology Develop software tools for sequence analysis Develop software tools for sequence analysis Disseminate biomedical information Disseminate biomedical information n Tools: BLAST(1990), Entrez (1992) n GenBank (1992) n Free MEDLINE (PubMed, 1997) n Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq

NCBI Molecular Databases n Primary Databases Original submissions by experimentalists Original submissions by experimentalists Database staff organize but don’t add additional information Database staff organize but don’t add additional information Example: GenBank Example: GenBank n Derivative Databases Human curated Human curated compilation and correction of data compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Computationally Derived Example: UniGene Example: UniGene Combinations Combinations Example: NCBI Genome Assembly Example: NCBI Genome Assembly

NCBI What is GenBank? NCBI’s Primary Sequence Database n Nucleotide only sequence database n Archival in nature n GenBank Data Direct submissions individual records (BankIt, Sequin) Direct submissions individual records (BankIt, Sequin) Batch submissions via (EST, GSS, STS) Batch submissions via (EST, GSS, STS) ftp accounts sequencing centers ftp accounts sequencing centers n Data shared nightly among three collaborating databases GenBank GenBank DNA Database of Japan (DDBJ). DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL) at EBI. European Molecular Biology Laboratory Database (EMBL) at EBI.

EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates

NCBI

GenBank full release every two months incremental and cumulative updates daily available only through internet ftp://ncbi.nlm.nih.gov/genbank/ or ftp://genbank.sdsc.edu/pub/ Release 126October ,602,262Records 14,396,883,064Nucleotides 80,000 +Species

NCBI GenBank on FTP site ftp> open ftp.ncbi.nlm.nih.gov. ftp> cd genbank Release 125: 243 files; Gigabytes uncompressed

Bulk Sequence Divisions PAT Patent EST Expressed Sequence Tags (133 files) STS Sequence Tagged Site GSS Genome Survey Sequence (41 files) HTG High Throughput Genome (25 files) HTC High Throughput cDNA CON Contig Traditional Divisions BCTINVMAMPHGPLNPRI RODSYNUNA VRLVRT GenBank Divisions

NCBI EST Division: Expressed Sequence Tags ,000 RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones -sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

NCBI STS Division : Sequence Tagged Sites n Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) n PCR with STS primers gives unique product (one per genome) n Basis of Radiation Hybrid Mapping UniGene UniGene Genome Assembly Genome Assembly n Related resource: Electronic PCR

NCBI RH mapping using STSs ABCD AB DAB CD ABCDABCD Hybrid Cells PCR Results Human Chromosome

NCBI ePCR Results Hexokinase 1 EST SHGC dbSTS id: 44155, GenBank Accession: G29974 Organism: Homo sapiens Primer1: CATACGACACGGCTCACAAA Primer2: CTGTTTGTCTCGTGGGGG STS location: Chromosome: 10 Expected amplicon size: 129, Observed amplicon size: 130 Primers match in forward orientation Query sequence: 1 TTTTTGAATT GGTACAAAGT TTACTAGGTC ATACGACACG GCTCACAAAG CGGTGGGAAA 61 TTCCAGTGAT GGCATTGTTT GTTGGTTGGT TCCTTTTATC CAAATGGAGA CAAGACACAT 121 TTCCGCAGAC GTGTCCACCT CCCCCCACGA GACAAACAGA ATGCAAGACT GTCACACGCG 181 GCTAGGACTG GTTCCACGGA CACACGATTT TGTGGCATTG ACACACCACG ATGCGATGCC 241 AGGCCACAGT GGGTGCCAGG AGGGGAGGAA GCAGCTAATG CTATGCCCAC ACTCGCCTTC 301 AGCATGTGCC CCGGGAGGAG GCCCGGCAGT GTCTGCTGGT GATAATACAT TTCACACGGG 361 GAGGGGGAAC CAAGGATGAG CTTTGGAGGC CAGAAGGCTG TCAGGTGGTG TG

NCBI Genome Sequencing Draft Sequence (HTG division) sonication Whole BAC insert (or genome) cloning isolating assembly sequencing GSS division

NCBI GSS Division: Genome Survey Sequences Genomic equivalent of ESTs BAC and other first pass surveys BAC end sequences Whole Genome Shotgun (some) RAPIDS and other anonymous loci Genomic Clone (BAC) T7 end SP6 end

HTG Division: High Throughput Genome Records 40,000 to > 350,000 bp phase 1 phase 2 phase 3 HTG PRI Acc = AC gi = Acc = AC gi = Acc = AC gi =

NCBI The GenBank Record

LOCUS AF bp mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF VERSION AF GI: KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: A Simple GenBank Record

FEATURES Location/Qualifiers source /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC " /db_xref="GI: " /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // GenBank Record, cont.

LOCUS AF bp mRNA INV 02-MAR-2000 Sequence and Database Identifiers Locus, accession, gi, version DEFINITION Limulus polyphemus myosin III mRNA, complete cds. GB Division Locus Name DEF line (Title) Modification Date mol-type mRNA (= cDNA) rRNA snRNA DNA Sequence length VERSION AF GI: ACCESSION AF Accession Number Accession.version gi number

NCBI Keywords, Source-organism KEYWORDS. SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. Legacy field exception EST GSS HTG Accepted common name Scientific name Taxonomic lineage according to GenBank

NCBI REFERENCE 1 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE A myosin III from Limulus eyes is a clock-regulated phosphoprotein JOURNAL J. Neurosci. (1998) In press REFERENCE 2 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REFERENCE 3 (bases 1 to 3808) AUTHORS Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. TITLE Direct Submission JOURNAL Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA REMARK Sequence update by submitter COMMENT On Mar 2, 2000 this sequence version replaced gi: Citation Article Submitter Block Update history Previous version

FEATURES Location/Qualifiers source /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" CDS /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC " /db_xref="GI: " /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL " Feature Table Coding Sequence Biosource Reading Frame GenPept Protein Identifiers

BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata 3781 aagatacagt aactagggaa aaaaaaaa // Sequence End of record Indicates beginning of sequence data

NCBI NCBI Derivative Sequence Databases: RefSeq NCBI Reference Sequences mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein XM_ Predicted Transcript XP_ Predicted Protein Gene Records NG_ Reference Genomic Sequence Assemblies NT_ Contig (Mouse and Human Genomes) NC_ Chromosome (Microbial Genomes)

Curated RefSeq Records: NM_, NP_ LOCUS NM_ bp mRNA PRI 26-JUL-1999 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator(CFTR) mRNA. ACCESSION NM_ COMMENT REFSEQ: This reference sequence was derived from M PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. LOCUS NP_ aa PRI 26-JUL-1999 DEFINITION cystic fibrosis transmembrane conductance regulator. ACCESSION NP_ PID g VERSION NP_ GI: DBSOURCE REFSEQ: accession NM_ RefSeq Protein RefSeq Nucleotide REFSEQ: This reference sequence was derived from M , M On Feb 17, 2000 this sequence version replaced gi: Summary: Cystic fibrosis transmembrane conductance regulator is member 7 of the ATP-binding cassete sub-family C. The protein functions as a chloride channel and controls the regulation of other transport pathways. Mutations in this gene cause the autosomal recessive disorder, cystic fibrosis (CF) and congenital bilateral aplasia of the vas deferens (CBAVD). Alternative splice variants have been described, many of which result from mutations in the CFTR gene. COMPLETENESS: full length. Reviewed

NCBI Alignment Generated Transcripts: XM_, XP_ mismatch LOCUS XM_ bp mRNA PRI 16-NOV-2000 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA. ACCESSION XM_ VERSION XM_ GI:

RefSeq Human Contig: NT_ LOCUS NT_ bp DNA CON 16-NOV-2000 DEFINITION Homo sapiens chromosome 7 working draft sequence segment, complete sequence. ACCESSION NT_ VERSION NT_ GI: KEYWORDS HTG. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to ) AUTHORS International Human Genome Project collaborators. TITLE Toward the complete sequence of the human genome JOURNAL Unpublished COMMENT GENOME ANNOTATION REFSEQ: NCBI contigs are derived from assembled genomic sequence data. They may include both draft and finished sequence. COMPLETENESS: not full length. mRNA complement(join( , , , , , , , , , , , , , , , , , , , , , , , , , , )) /partial /gene="CFTR" /product="cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7)" /transcript_id="XM_ " /db_xref="LocusID:1080" /db_xref="MIM:602421" /note="derived by automated computational analysis using gene prediction method: Acembly. Supporting evidence includes similarity to: 9 proteins, 1 mRNAs See details in AceView" gene complement( ) /gene="CFTR" /note="CF; MRP7; ABC35; ABCC7" /db_xref="LocusID:1080" CONTIG join(AC : ,gap(100),AC : , gap(100),AC : ,gap(100), complement(AC : ),gap(100), AC : ,AC : , AC : ,AC : ,gap(100), AC : ,gap(100),AC : ,gap(100), complement(AC : ),gap(100), AC : ,gap(100),AC : , gap(100),complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), AC : ,gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), complement(AC : ),gap(100), Reordering draft sequence

NCBI Map View of RefSeqs NM_ XM_ NT_

NCBI RefSeq Genome Records: NG_

NCBI LOCUS NC_ bp DNA circular BCT 02-OCT-2001 DEFINITION Escherichia coli O157:H7, complete genome. ACCESSION NC_ VERSION NC_ GI: KEYWORDS. SOURCE Escherichia coli O157:H7. ORGANISM Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. REFERENCE 1 (sites) AUTHORS Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. TITLE Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak JOURNAL Genes Genet. Syst. 74 (5), (1999) MEDLINE PUBMED RefSeq Chromosomes: NC_

NCBI Other NCBI Derivative Databases UniGene - gene oriented expressed sequence clusters LocusLink - central resource and interface for known genes

NCBI NCBI Homepage

NCBI Entrez Similarity Searching Mendelian Inheritance in Man NCBI Homepage

NCBI Using Entrez An integrated database search and retrieval system

Genomes Taxonomy Entrez: Neighboring and Hard Links PubMed abstracts Nucleotide sequences Protein sequences 3-D Structure (MMDB) 3 -D Structure Word weight VAST BLAST Phylogeny

NCBI WWW Entrez All of MEDLINE plus others Abstracts Links to online Journals GenBank, EMBL, DDBJ RefSeq, PDB GenBank, DDBJ, EMBL translations PDB, PIR, SWISS-PROT, PRF, RefSeq NCBI’s MMDB - derived from PDB Reference Genomes: Graphical views, assembled sequence and mapping data

NCBI Database Searching with Entrez u Using limits and field restriction to find mouse GAPD u Linking and neighboring with mouse GAPD

NCBI Entrez Nucleotides Mouse

NCBI Document Summaries: Mouse[All Fields] Chicken not mouse !? 3 million records

NCBI Entrez Nucleotides: Limits: Preview/Index Mouse

NCBI Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Field Restriction Only From RefSeq GenBank EMBL DDBJ Exclude unwanted categories of sequences Molecule Genomic DNA/RNA mRNA rRNA Gene Location Genomic DNA/RNA Mitochondrion Chloroplast Mouse

NCBI Entrez Nucleotides: Limits: Organism Mouse

NCBI Document Summaries: Mouse[Organism] 2,976,070[All Fields] -2,921,009[Organism] 55,061

NCBI Exclude Bulk Sequences, mRNA

NCBI Adding Terms: Preview/Index glyceraldehyde 3 phosphate dehydrogenase Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Title Word Uid Volume Search History

NCBI Mouse GAPD Records

NCBI Displaying Mouse GAPD Records Summary Brief GenBank ASN.1 FASTA GI list LinkOut PubMed Links Protein Links Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Formats Links and neighbors (related records)

NCBI Entrez GenBank / GenPept GenPept

NCBI >gi|193425|gb|M |MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC > FASTA Format FASTA Definition Line >gi|193425|gb|M |MUSGAPDS gi number Database Identifiers gbGenBank embEMBL dbjDDBJ spSWISS-PROT pdbProtein Databank pirPIR prf PRF refRefSeq Accession number Locus Name

NCBI Seq-entry ::= set { level 1, class nuc-prot, descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products", update-date std { year 1994, month 11, day 9 }, source { org { taxname "Mus musculus", common "house mouse", db { { db "taxon", tag id } }, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Protein GenPeptGenBank ASN.1

NCBI /***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI Toolbox Toolbox Sources ftp> open ncbi.nlm.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools

NCBI Related Proteins Protein Neighbors-Structure Links Structure Links Cn3D GAPD Structure

NCBI Advanced Neighbors: BLink

NCBI BLink

NCBI PubMed Link

NCBI Online Books

NCBI Entrez Structures Molecular Modeling Database (MMDB) and Cn3D

NCBI MM MMDB: Molecular Modeling Data Base n Derived from experimentally determined PDB records n Value added to PDB records including: Addition of explicit chemical graph information Addition of explicit chemical graph information Validation Validation Inclusion of Taxonomy, Citation, and other information Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Conversion to parseable ASN.1 data description language n Structure neighbors determined by Vector Alignment Search Tool (VAST)

NCBI Searching MMDB 1CET

NCBI Structure Summary Cn3D viewer VAST neighbors BLAST neighbors

NCBI Cn3D : Displaying Structures Chloroquine

NCBI Structure Neighbors

NCBI Structural Alignments Chloroquine NADH

NCBI Why do we need similarity searching?  Identification and annotation Incomplete or no annotations (GenBank) Incorrectly annotated sequences  Evolutionary relationships homologous molecules may have similar functions but it ain’t necessarily so!

NCBI Basic Local Alignment Search Tool n Widely used similarity search tool n Heuristic approach based on Smith Waterman algorithm n Finds best local alignments n Provides statistical significance n All combinations (DNA/Protein) query and database. DNA vs DNA DNA vs DNA DNA translation vs Protein DNA translation vs Protein Protein vs Protein Protein vs Protein Protein vs DNA translation Protein vs DNA translation DNA translation vs DNA translation DNA translation vs DNA translation n www, server, standalone, and network clients

NCBI Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution Expected number with score S or greater E = Kmne - S or E = mn2 -S ’ K = scale for search space = scale for scoring system S ’ = bitscore = ( S - lnK)/ln2 For ungapped alignments:

NCBI Scoring Systems Nucleic acids identity matrix Proteins Position Independent MatricesPosition Independent Matrices PAM Matrices (Percent Accepted Mutation) Implicit model of evolution Higher PAM number all calculated from PAM1 PAM250 widely used BLOSUM Matrices (BLOck SUbstition Matrices) Empirically determined from alignment of conserved blocks Each includes information up to a certain level of identity BLOSUM62 widely used Position Specific Score Matrices (PSSM)Position Specific Score Matrices (PSSM) PSI and RPS BLAST

NCBI A 4 R -1 5 N D C Q E G H I L K M F P S T W Y V X A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 Common amino acids have low weightsRare amino acids have high weights Negative for less likely substitutions Positive for more likely substitutions

NCBI Position Specific Substitution Rates Active site serineTypical serine

NCBI Position Specific Score Matrix (PSSM) A R N D C Q E G H I L K M F P S T W Y V 206 D G V I S S C N G D S G G P L N C Q A Serine scored differently in these two positions Active site nucleophile

NCBI Gapped Alignments Gapping provides more biologically realistic alignments Statistical behavior not completely understood for gapped alignments Gapped BLAST parameters must be found by simulations for each matrix Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b)

NCBI Intermission