Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Databases Vasileios Hatzivassiloglou University of Texas at Dallas.
NCBI BLAST, CDD, Mini-courses Katia Guimarães 2007/2.
Databases (“knowledge bases”) used in genome analysis
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
SEQUENCE DATABASES Daniel Svozil. Primary sequence databases All published genome sequences are available over the internet requirement of every scientific.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Sequence Databases April 28, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Archives and Information Retrieval
Biological databases.
Sequence Databases June 21, 2005 Learning objectives-Understand how information is stored in GenBank. Learn how to read a GenBank flat file. Learn how.
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
Sequence Databases – 20 June 2008 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Bioinformatics for your classroom Seth Bordenstein Discover the Microbes Within! March 12, 2006 NCBI BLAST 1. No programming skills needed 2.Familiarity.
How to use the web for bioinformatics Ethan Strauss X 1171
An Introduction to Bioinformatics Molecular Biology Databases.
BIOLOGY 3020 Fall 2008 Gene Hunting (DNA database searching)
Wellcome Trust Workshop Working with Pathogen Genomes Module 1 Artemis.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
On line (DNA and amino acid) Sequence Information
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
WIBR Bioinformatics, © Whitehead Institute, 2004 Relational Databases for Biologists: Efficiently Managing and Manipulating Your Data Robert Latek, Ph.D.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
Biological databases Nicky Mulder:
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
1 Orthology and paralogy A practical approach Searching the primaries Searching the secondaries Significance of database matches DB Web addresses Software.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Organizing information in the post-genomic era The rise of bioinformatics.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
The EST database is a collection of short single-read transcript sequences from GenBank. These sequences provide a resource to evaluate gene expression,
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
>gi| |gb|AAB | ADP-glucose pyrophosphorylase large subunit [Oryza sativa] 02-AUG-1996 Gene accession U66041 Plant Physiol. 112, 1399 (1996)
NCBI Literature Databases: PubMed
Bioinformatics and Computational Biology
Computer Storage of Sequences
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Exploring and Exploiting the Biological Maze Zoé Lacroix Arizona State University.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Copyright OpenHelix. No use or reproduction without express written consent1.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Module 2 Sequence DBs and Similarity Searches Learning objectives Understand how information is stored in GenBank. Learn how to read a Genbank flat file.
Databases, archives, search tools. Bioinformatics: ”convergence of two historical trends in biological research - storage of molecular sequences in computer.
Starter What do you know about DNA and gene expression?
What is BLAST? Basic BLAST search What is BLAST?
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
What is BLAST? Basic BLAST search What is BLAST?
Dilvan Moreira (based on Prof. André Carvalho presentation)
Introduction to Genes and Genomes with Ensembl
Barcode sequences at GenBank
Archives and Information Retrieval
Access to Sequence Data and Related Information
Week 5 Discussion Section
Chapter 3. THE GENBANK SEQUENCE DATABASE
Introduction to Databases
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Sequence Databases – 21 June 2007 Learning objectives- Be able to describe how information is stored in GenBank. Be able to read a GenBank flat file. Be able to search GenBank for information. Be able to explain the content difference between a header, features and sequence. Be able to say what distinguishes between a primary database and a secondary database. Be able to access and navigate the ENTREZ platform for biological data analysis.

BIOSEQs – entry common to all sequence databases BIOSEQ = Biological sequence Central element in the NCBI database model. Found in both the nucleotide and protein databases Comprises the sequence of a single continuous molecule of nucleic acid or protein. Entry must have At least one sequence identifier (Seq-id) Information on the physical type of molecule (DNA, RNA, or protein) Descriptors, which describe the entire Bioseq Annotations, which provide information regarding specific locations within the Bioseq

What is GenBank? The NIH genetic sequence database, an annotated collection of all publicly available DNA sequences Each record represents a single contiguous stretch of DNA or RNA DNA stretches may have more than one coding region (gene). RNA sequences are presented with T, not U Records are generated from direct submissions to the DNA sequence databases from the investigators (authors). GenBank is part of the International Nucleotide Sequence Database Collaboration.

General Comments on GBFF Three sections: 1) Header-information about the whole record 2) Features-description of annotations-each represented by a key. 3) Nucleotide sequence-each ends with // on last line of record. Nucleic acid (DNA or RNA (cDNA)) sequence translated to amino acid sequence is a “feature” Genbank Flat File (MyoD1 as an example)

Feature Keys Purpose: 1) Indicates biological nature of sequence 2) Supplies information about changes to sequences Feature KeyDescription conflictSeparate determinations of the same seq. differ rep_originOrigin of replication protein_bindProtein binding site on DNA CDSProtein coding sequence

Feature Keys-Terminology Feature Key Location/Qualifiers CDS /product=“alcohol dehydro.” /gene=“adhI” The feature CDS is a coding sequence beginning at base 23 and ending at base 400, has a product called “alcohol dehydrogenase” and corresponds to the gene called “adhI”.

Feature Keys-Terminology (Cont.) Feat. Key Location/Qualifiers CDS join ( , ) /product=“T-cell recep. B-ch.” /partial The feature CDS is a partial coding sequence formed by joining the indicated elements to form one contiguous sequence encoding a product called T-cell receptor beta-chain. (For MyoD1 – Accession number X61655)X61655

Record from GenBank LOCUS SCU bp DNA PLN 21-JUN-1999 DEFINITION Saccharomyces cerevisiae TCP1-beta gene, partial cds, and Axl2p (AXL2) and Rev7p (REV7) genes, complete cds. ACCESSION U49845 VERSION U GI: KEYWORDS. SOURCE baker's yeast. ORGANISM Saccharomyces cerevisiae Eukaryota; Fungi; Ascomycota; Hemiascomycetes; Saccharomycetales; Saccharomycetaceae; Saccharomyces. Modification date GenBank division (plant, fungal and algal) Coding region Unique identifier (never changes) Nucleotide sequence identifier (changes when there is a change in sequence (accession.version)) GeneInfo identifier (changes whenever there is a change) Word or phrase describing the sequence (not based on controlled vocabulary). Not used in newer records. Common name for organism Formal scientific name for the source organism and its lineage based on NCBI Taxonomy Database Locus name

Record from GenBank (cont.1) REFERENCE 1 (bases 1 to 5028) AUTHORS Torpey,L.E., Gibbs,P.E., Nelson,J. and Lawrence,C.W. TITLE Cloning and sequence of REV7, a gene whose function is required for DNA damage-induced mutagenesis in Saccharomyces cerevisiae JOURNAL Yeast 10 (11), (1994) MEDLINE REFERENCE 2 (bases 1 to 5028) AUTHORS Roemer,T., Madden,K., Chang,J. and Snyder,M. TITLE Selection of axial growth sites in yeast requires Axl2p, a novel plasma membrane glycoprotein JOURNAL Genes Dev. 10 (7), (1996) MEDLINE Medline UID REFERENCE 3 (bases 1 to 5028) AUTHORS Roemer,T. TITLE Direct Submission JOURNAL Submitted (22-FEB-1996) Terry Roemer, Biology, Yale University, New Haven, CT, USA Submitter of sequence (always the last reference)

Record from GenBank (cont.2) FEATURES Location/Qualifiers source /organism="Saccharomyces cerevisiae" /db_xref="taxon:4932" /chromosome="IX" /map="9" CDS < /codon_start=3 /product="TCP1-beta" /protein_id="AAA " /db_xref="GI: " /translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA AEVLLRVDNIIRARPRTANRQHM" The 5’ end of the coding sequence begins upstream of the first nucleotide of the sequence. The 3’ end is complete. There are three parts to the feature key: a keyword (indicates functional group), a location (instruction for finding the feature), and a qualifier (auxiliary information about a feature) Keys Location Qualifiers Descriptive free text must be in quotations Start of open reading frame Database cross-refs Protein sequence ID # Note: only a partial sequence Values

Record from GenBank (cont.3) gene /gene="AXL2" CDS /gene="AXL2" /note="plasma membrane glycoprotein" /codon_start=1 /function="required for axial budding pattern of S. cerevisiae" /product="Axl2p" /protein_id="AAA " /db_xref="GI: " /translation="MTQLQISLLLTATISLLHLVVATPYEAYPIGKQYPPVARVN... “ gene complement( ) /gene="REV7" CDS complement( ) /gene="REV7" /codon_start=1 /product="Rev7p" /protein_id="AAA " /db_xref="GI: " /translation="MNRWVEKWLRVYLKCYINLILFYRNVYPPQSFDYTTYQSFNLPQ... “ Cutoff Another location

Record from GenBank (cont.4) BASE COUNT 1510 a 1074 c 835 g 1609 t ORIGIN 1 gatcctccat atacaacggt atctccacct caggtttaga tctcaacaac ggaaccattg 61 ccgacatgag acagttaggt atcgtcgaga gttacaagct aaaacgagca gtagtcagct...//

Primary databases vs. Secondary databases Primary database comprises information submitted directly by the experimenter. is called an archival database. Secondary database comprises information derived from primary database. is a curated database.

Types of primary databases carrying biological infomation GenBank/EMBL/DDBJ PDB-Three-dimensional structure coordinates of biological molecules PROSITE-database of protein domain/function relationships.

Types of secondary databases carrying biological infomation dbSTS-Non-redundant db of sequence-tagged sites (useful for physical mapping) Genome databases-(there are over 20 genome databases that can be searched EPD:eukaryotic promoter database NR-non-redundant GenBank+EMBL+DDBJ+PDB. Entries with 100% sequence identity are merged as one. ProDom PRINTS BLOCKS

RNA cDNA DNA protein DNA databases derived from GenBank containing data for a single gene Non-redundant (nr) dbGSS (genome survey sequences) dbHTGS (high throughput) dbSTS (sequence tagged site) LocusLink RNA (cDNA) databases derived from GenBank containing data for a single gene dbEST (expressed sequence tag) UniGene LocusLink Protein databases derived from GenBank containing data for a single gene Non-redundant (nr) Swissprot PIR (Int’l. protein sequence) LocusLink Secondary Databases

References for understanding the NCBI sequence database model Here is the website for NCBI developer tools. KDOCS/INDEX.HTML KDOCS/INDEX.HTML

Mature mRNA RNA, but NOT mRNA DNA  RNA  PROTEIN RNA processing