Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University.

Slides:

Advertisements

Similar presentations

Databases (“knowledge bases”) used in genome analysis

Advertisements

Beyond PubMed and BLAST: Exploring NCBI tools and databases Kate Bronstad David Flynn Alumni Medical Library.

Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate.

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

CS 177 Hands-on lab with databases Quiz #1 Summary: Nucleotide and protein databases Sequence formats Lab exercises Quiz #1 Summary: Nucleotide and protein.

NCBI Molecular Biology Resources

NCBI web resources I: databases and Entrez Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

On line (DNA and amino acid) Sequence Information Lecture 7.

The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.

NCBI FieldGuide National Center for Biotechnology Information A Field Guide to GenBank and NCBI’s Molecular Biology Resources August 30, 2005 University.

NCBI Field Guide NCBI Molecular Biology Resources March 2007 NCBI Databases.

NCBI Molecular Biology Resources

Archives and Information Retrieval

Biological databases.

Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.

Phage? New Sequence Horizontal Transfer Molecular Evolution.

Copyright OpenHelix. No use or reproduction without express written consent1 Organization of genomic data… Genome backbone: base position number sequence.

Lecture 2.21 Retrieving Information: Using Entrez.

Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center

Sequence Analysis. Today How to retrieve a DNA sequence? How to search for other related DNA sequences? How to search for its protein sequence? How to.

NCBI resources III: GEO and expression data analysis Yanbin Yin Fall

Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.

An Introduction to Bioinformatics Molecular Biology Databases.

Introductory Overview

Gene Expression Omnibus (GEO)

NCBI FieldGuide A Minimal Guide to NCBI Nucleotide Resources.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.

NCBI FieldGuide NCBI Molecular Biology Resources July 8, 2004 University of São Paulo, Brazil “ Third Latin American Course on Bioinformatics for Tropical.

Genomics and Personalized Health Care Databases Bailee Ludwig Quality Management.

NCBI FieldGuide NCBI Molecular Biology Resources January 2008 Using Entrez.

Biological Databases By : Lim Yun Ping E mail :

Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.

UCSC Genome Browser 1. The Progress 2 Database and Tool Explosion : 230 databases and tools 1996 : first annual compilation of databases and tools.

1 Review of Biological Database Utilization. 2 Biological Databases We will discuss: Usefulness to the bioinformaticist Database types Search methods.

Bioinformatics Overview, NCBI & GenBank JanPlan 2012.

Sequence Analysis (I) Yuh-Shan Jou ( 周玉山 ) Institute of Biomedical Sciences, Academia Sinica.

NCBI FieldGuide NCBI Molecular Biology Resources January 12, 2007 A Field Guide Part 1.

جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی

NCBI resources II: web-based tools and ftp resources Yanbin Yin Fall 2014 Most materials are downloaded from ftp://ftp.ncbi.nih.gov/pub/education/ 1.

Opportunities & Challenges in Applying IR Techniques to Bioinformatics ChengXiang (“Cheng”) Zhai Department of Computer Science Institute for Genomic Biology.

NCBI FieldGuide NCBI Molecular Biology Resources March 2007 Using Entrez.

NCBI Literature Databases: PubMed

Sequencing the World of Possibilities for Energy & Environment MGM workshop. 19 Oct 2010 Information Sources for Genomics Konstantinos Mavrommatis Genome.

Gene Expression Omnibus (GEO)

Computer Storage of Sequences

EBI is an Outstation of the European Molecular Biology Laboratory. EBI patent related services Jennifer McDowall Senior Scientist, EMBL-EBI 3 rd Annual.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

A Field Guide to GenBank and NCBI Molecular Biology Resources

Applied Bioinformatics Week 9 Jens Allmer. Theory I Gene Expression Microarray.

Copyright OpenHelix. No use or reproduction without express written consent1.

Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.

Copyright OpenHelix. No use or reproduction without express written consent1.

NCBI FieldGuide September 29, 2004 ICGEB NCBI Molecular Biology Resources A Field Guide part 1.

GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.

Welcome to the combined BLAST and Genome Browser Tutorial.

NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide part 2 (post intermission) September 30, 2004 ICGEB.

NCBI Molecular Biology Resources February 2007 Part 1.

Database resources of the National Center for Biotechnology The National Center for Biotechnology Information (NCBI) at the National Institutes of Health.

NCBI PubMed NCBI Literature Databases: PubMed Session #1, April 28, 2005 Session #2, April 29, 2005 Ho Chi Minh City, VietNam.

Keeping Current: Genetics Resources. This workshop will provide an overview of NCBI resources for finding-- Background information & journal articles.

Introduction to Genes and Genomes with Ensembl

Wolbachia Bioinformatics

Retrieving Information: Using Entrez

NCBI Molecular Biology Resources

Bioinformatics for your classroom

Archives and Information Retrieval

Gene Expression Omnibus (GEO)

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Most slides are taken from NCBI field guide at the web site

The Central Dogma & Biological Data Protein structures -Experiments -Models (homologues) Literature information Original DNA Sequences (Genomes) Protein Sequences -Inferred -Direct sequencing Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs)

Entrez Integrates Most of Them! Entrez Nucleotide PubMed Protein Taxonomy Structur e Domains3D Domains Journal s PMC OMIM Books PopSet SNP UniGene UniST S Genome Gene GEO GEO Datasets MeSH CancerChromosomes Homologen e

Outline NCBI & Entrez Major Biological Databases Using Entrez

Some background about Entrez…

The National Center for Biotechnology Information Created in 1988 as a part of the National Library of Medicine at NIH –Establish public databases –Research in computational biology –Develop software tools for sequence analysis –Disseminate biomedical information Bethesda,MD

Web Access:

Number of Users and Hits Per Day Christmas & New Year’s Days Currently averaging 10,000,000 to 50,000,000 hits per day!

Major Biological Databases

Entrez: Database Integration Hard Link Neighbors Related Structures 3 -D Structure VAST Neighbors Related Sequences NucleotideSequences BLAST Neighbors Related Sequences BLink Domains ProteinSequences BLAST Taxonomy Phylogeny PubMedAbstracts Word weight Related Articles Genome Gene OMIM Cancer Chromosome CDD 3D domain PubChem Books PMC OMIM SNP Genome Project HomoloGene UniGene GEO

Types of Databases Primary Databases –Original submissions by experimentalists –Content controlled by the submitter Examples: GenBank, SNP, GEO Derivative Databases –Built from primary data –Content controlled by third party (NCBI) Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain

Primary vs. Derivative Sequence DatabasesGenBank SequencingCenters GA ATT C C GA ATT C C AT GA ATT C C GA ATT C C TTGACA ATTGACTA ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG ACGTGC TTGACA CGTGA ATTGACTA TATAGCCG C ATT GA ATT C C GA ATT C C Labs Algorithms UniGene Curators RefSeq Genome Assembly TATAGCCG AGCTCCGATA CCGATGACAA Updated continually by NCBI Updated ONLY by submitters

Entrez Nucleotides Primary GenBank / EMBL / DDBJ 57,172,944 Derivative RefSeq 1,278,742 Third Party Annotation 4,653 PDB 5,973 Total 58,462,312

Entrez Protein: Derivative Databases GenPept 3,515,141 RefSeq 1,802,523 Third Party Annotation 4,217 Swiss Prot 189,324 PIR 222,232 PRF 12,079 PDB 68,621 Total 5,814,137 BLAST nr total 2,726,372

Database 1: GenBank NCBI’s Primary Sequence Database

What is GenBank? Nucleotide only sequence database Archival in nature –Historical –Reflective of submitter point of view (subjective) –Redundant GenBank Data –Direct submissions (traditional records) –Batch submissions (EST, GSS, STS) –ftp accounts (genome data) Three collaborating databases –GenBank –DNA Database of Japan (DDBJ) –European Molecular Biology Laboratory (EMBL) Database

EBI GenBank DDBJ EMBL EMBL Entrez SRS getentry NIG CIB NCBI NIH Submissions Updates Submissions Updates Submissions Updates International Sequence Database Collaboration

GenBank Divisions “Organismal” PRI (28) Primate ROD (15) Rodent PLN (13) Plant and Fungal BCT (11) Bacterial/Archeal INV (7) Invertebrate VRT (7) Other Vertebrate VRL (4) Viral MAM (2) Mammalian PHG (1) Phage SYN (1) Synthetic UNA (1) Unannotated “Functional” EST (377) Expressed Sequence Tag GSS (138) Genome Survey Sequence HTG (63) High Throughput Genomic PAT (17) Patent STS (9) Sequence Tagged Site CON (1) Contigs, virtual Organized by taxonomy (sort of) Direct submissions (Sequin/Bankit) Accurate (~1 error per 10,000 bp) Well characterized Organized by sequence type Batch submissions (ftp/ ) Inaccurate Poorly characterized

GenBank Functional (Bulk) Divisions GenBank EST STS GSS HTG Expressed Sequence Tag –1st pass single read cDNA Genome Survey Sequence –1st pass single read gDNA High Throughput Genomic –incomplete sequences of genomic clones Sequence Tagged Site –PCR-based mapping reagents Whole Genome Shotgun

EST Division: Expressed Sequence Tags RNA gene products nucleus 30,000 genes ,000 unique cDNA clones in library - isolate unique clones - sequence once from each end make cDNA library 5’ 3’ >IMAGE: ', mRNA sequence NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC >IMAGE: ' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG

ESTs in Entrez Total 28 million records Human 6.0 million Mouse 4.3 million Rat 0.7 million Zebrafish 0.6 million Wheat0.6 million Barley0.3 million Maize0.4 million Total 28 million records Human 6.0 million Mouse 4.3 million Rat 0.7 million Zebrafish 0.6 million Wheat0.6 million Barley0.3 million Maize0.4 million

GSS, WGS, HTG shred Whole BAC insert (or genome) isolate clonessequence GSS division or trace archive Draft sequence ( HTG division ) assembly whole genome shotgun assemblies (traditional division)

HTG Example: Honeybee Draft Sequences Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division Unfinished sequences of BACs Gaps and unordered pieces Finished sequences (Phase 3) move to traditional GenBank division LOCUS AC bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC VERSION AC GI: KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. LOCUS AC bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC VERSION AC GI: KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT.

Sequence Records (millions) Total Base Pairs (billions) Sequence records Total base pairs Release 148: 45.2 million records 49.4 billion nucleotides Average doubling time ≈ 14 months ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 ’05 ’

File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA (the simplest format) ASN.1 & XML (useful for programmers)

A Traditional GenBank Record LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // Header Feature Table Sequence The Flatfile Format

LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: The Header

LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Locus Line LOCUS AY bp mRNA linear PLN 04-MAY-2004 Molecule type Division Modification Date Locus name Length

LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Database Identifiers ACCESSION AY VERSION AY GI: ACCESSION AY VERSION AY GI: Accession Stable Reportable Universal Accession Stable Reportable Universal Version Tracks changes in sequence Version Tracks changes in sequence GI number NCBI internal use GI number NCBI internal use

LOCUS AY bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY VERSION AY GI: KEYWORDS. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi: Header: Organism SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. NCBI-controlled taxonomy

FEATURES Location/Qualifiers source /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene /gene="AFS1" CDS /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO " /db_xref="GI: " /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" The Feature Table Coding sequence start (atg) stop (tag) Implied protein Implied protein GenPept Identifiers

The Sequence: 99.99% Accurate ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a // 1741 ggacccacat cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga 1801 aataaatagc agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt 1861 tgtaacgttg ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa 1921 aaaaaaaaaa a //

GenPept: FASTA format >gi| |gb|AAO | (E,E)-alpha-farnesene synthase [Malus x domestica] MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIE EVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQH GYKVSQDIFGRFMDEKGTLENHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSN LSRDVVHSLELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWWANLG IADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGSEEELKHFTNAVDRWDS RETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLTKVWADFCKALLVEAEWYNKSHIPTLEEY LRNGCISSSVSVLLVHSFFSITHEGTKEMADFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIV CYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEK GPRTHILSLLFQPLVN >gi| |gb|AAP | putative doublecortin domain-containing protein MAKTGAEDHREALSQSSLSLLTEAMEVLQQSSPEGTLDGNTVNPIYKYILNDLPREFMSSQAKAVIKTTD DYLQSQFGPNRLVHSAAVSEGSGLQDCSTHQTASDHSHDEISDLDSYKSNSKNNSCSISASKRNRPVSAP VGQLRVAEFSSLKFQSARNWQKLSQRHKLQPRVIKVTAYKNGSRTVFARVTAPTITLLLEECTEKLNLNM AARRVFLADGKEALEPEDIPHEADVYVSTGEPFLNPFKKIKDHLLLIKKVTWTMNGLMLPTDIKRRKTKP VLSIRMKKLTERTSVRILFFKNGMGQDGHEITVGKETMKKVLDTCTIRMNLNLPARYFYDLYGRKIEDIS KGKH

Seq-entry ::= set { class nuc-prot, descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds.", source { org { taxname "Malus x domestica", common "cultivated apple", db { { db "taxon", tag id 3750 } }, orgname { name binomial { genus "Malus", species "x domestica" }, mod { { subtype cultivar, subname "'Law Rome'" }, { subtype old-name, subname "Malus domestica", attrib "(10)cultivar='Law Rome'" } }, lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus", gcode 1,, Abstract Syntax Notation: ASN.1 FASTA Nucleotide FASTA Nucleotide FASTA Protein FASTA Protein GenPept GenBank ASN.1

Database 2: RefSeq NCBI’s Derivative Sequence Database

What is RefSeq? Curated transcripts and proteins (NM_, NP_) –reviewed –human, mouse, rat, fruit fly, zebrafish, arabidopsis microbial genomes (proteins), and more Model transcripts and proteins (XM_, XP_) Assembled Genomic Regions (contigs) (NT_, NW_) –human genome –mouse genome –rat genome Chromosome records (NC_) –Human genome –microbial –organelle ftp://ftp.ncbi.nih.gov/refseq/release / srcdb_refseq[Properties]

RefSeq Benefits non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators

Curated genomic DNA (NC, NT, NW) Curated Model mRNA (XM) (XR) Curated mRNA (NM) (NR) Model protein (XP) RefSeq Curation Processes Protein (NP) Scanning....

RefSeq Accession Numbers mRNAs and Proteins NM_ Curated mRNA NP_ Curated Protein NR_ Curated non-coding RNA XM_ Predicted mRNA XP_ Predicted Protein XR_ Predicted non-coding RNA Gene Records NG_ Reference Genomic Sequence Chromosome NC_ Microbial replicons, organelle, viral genomes, human chromosomes Assemblies NT_ Contig NW_ WGS Supercontig

From GenBank to RefSeq

NM_000121: Sequence Revision History

Database 3: UniGene NCBI’s Derivative EST Database

UniGene Records are clusters of mRNAs and ESTs that ideally represent single genes Records are created automatically by a modified BLAST algorithm UniGene provides a means to identify an EST or unannotated mRNA Clustering Expressed Sequences

Gene-oriented clusters of expressed sequences Automatic clustering using MegaBlast Each cluster represents a unique gene Informed by genome hits Information on tissue types and map locations Useful for gene discovery and selection of mapping reagents UniGene unique gene

A Cluster of ESTs query 5’ EST hits 3’ EST hits

UniGene Collections

Example UniGene Cluster

Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186)

UniGene Cluster Hs SELECTED PROTEIN SIMILARITES

UniGene Cluster Hs GENE EXPRESSION

UniGene Cluster Hs.95351: expression

UniGene Cluster Hs.95351: seqs

Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/

Database 4: MMDB NCBI’s derivative protein structure database

Indexing into MMDB Structure id 1, name "helix 1", type helix, location subgraph residues interval { { molecule-id 1, from 49, to 61 } } }, Add secondary structure inter-residue-bonds { { atom-id-1 { molecule-id 1, residue-id 1, atom-id 1 }, atom-id-2 { molecule-id 1, residue-id 2, atom-id 9 } }, Add chemical bonds Import only experimentally determined structures Convert to ASN.1 Verify sequences Create “backbone” model (Cα, P only) Create single-conformer model MMDB Molecular Modeling Data Base

Structure Summary Cn3D viewer Conserved Domains 3D Domain Neighbors Structure Neighbors

Cn3D 4.1: C-Src

Cn3D 4.1: Structural Alignment Casein kinase S. pombe Src Kinase H. sapiens Conserved ATP binding site

Cn3D: Simple Homology Modeling human swordtail

NCBI CD: Tyrosine Kinase

Using Cn3D to model domains

Submitting a PDB File to VAST Choose the file format Remove all lines except ATOM This is the best way to convert PDB files to MMDB format for viewing with Cn3D!

Database 5: GEO NCBI’s Gene Expression Omnibus

GPL Platform descriptions GSM Raw/processed spot intensities from a single slide/chip GSE Grouping of slide/chip data “a single experiment” GDS Grouping of experiments Curated by NCBI Submitted by Experimentalists Submitted by Manufacturer* Entrez GEO Entrez GEO Datasets G EO S a M ple : experimental conditions G EO SE ries : set of related samples

What’s a DataSet? Platform (GPL) array definition Sample (GSM) hyb. measurements Series (GSE) related Samples Supplied by submitter DataSet (GDS) A collection of experimentally-related samples processed using the same platform. Samples within DataSets are organized into subgroups based on experimental variables. Form the basis of GEO’s query, analysis and data display tools. Assembled by GEO staff

Gene Expression Omnibus Dataset browser

GEO Dataset Browser

GEO Dataset Report

GEO Profiles … of 12625

Database 6: CDD NCBI’s Derivative Conserved Domain Database

Entrez CDD

Conserved Domain Database Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments) Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments)

CDD >gi| |gb|AAS | ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE

CDD CD Pfam COG Click on a colored bar to align your sequence to the CD

Conserved Domain Database: cd , HMA

CDD

CDART: Conserved Domain Architecture Retrieval Tool

Database 7: NCBI Genome Map

Viewing Complex Genomes Map Viewer Home Page Shows all supported organisms Provides links to genomic BLAST –Genome Overview Page Provides links to individual chromosomes Shows hits on a genome graphically –Chromosome Viewing Page Allows interactive views of annotation details Provides numerous maps unique to each genome NCBI Map Viewer

The Map Viewer Genome BLAST

Map Viewer: Human MLH1 Customizable NCBI Assembly EST Hits Gene Annotations Models Transcripts

Maps and Options

Mapped Variations

MLH1 Synteny: Mammalian Genomes

Many Other NCBI Databases…

Other Specialized Databases Gene Symbol Database ( HUGO Gene Nomenclature )HUGO Gene Nomenclature KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway KEGG EPD (Eukaryote Promoter Database) EPD Transcription Factor Database ( TRANSFAC )TRANSFAC Many organism-specific databases (e.g., Flybase, Beebase)Flybase Beebase …

Access Databases through Entrez

Accessing the Data in Entrez Web Tools –Batch Entrez Upload a file of GI or accession numbers to retrieve sequences –Batch Citation Matcher Send citation information to Entrez and retrieve PubMed IDs for linking, citation display or other applications –Advanced Entrez Searching Advanced searching techniques for Web Entrez –My NCBI Includes automatic ing of search updates and filters for search results Requires a username and password to access stored searches Programming Tools –E-Utilities Run Entrez queries and download data from your own scripts over the Web –Linking to Entrez Link to specific Entrez pages from your own web pages or applications –Entrez Client/Server C language library for embedding Entrez calls into your programs

Entrez: Web Access Default search: Against all databases in Entrez Interface: Global Entrez Target database: Adjustable using the pull-down menu Default search: Against all databases in Entrez Interface: Global Entrez Target database: Adjustable using the pull-down menu

/************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include #ifdef ENABLE_ID1 #include #endif FILE *fpl; Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL,NULL,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL,NULL,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Toolbox Sources ftp> open ftp.ncbi.nih.gov. ftp> cd toolbox ftp> cd ncbi_tools ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools NCBI Toolbox

Challenges in Bioinformatics Hard Link Neighbors Related Structures 3 -D Structure VAST Neighbors Related Sequences NucleotideSequences BLAST Neighbors Related Sequences BLink Domains ProteinSequences BLAST Taxonomy Phylogeny PubMedAbstracts Word weight Related Articles Genome Gene OMIM Cancer Chromosome CDD 3D domain PubChem Books PMC OMIM SNP Genome Project HomoloGene UniGene GEO How can we help biologists manage and exploit all such rapid growing, heterogeneous, and inaccurate information both efficiently and effectively?