Chapter 3. THE GENBANK SEQUENCE DATABASE

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
On line (DNA and amino acid) Sequence Information Lecture 7.
Classical and Modern Genetics.  “Genetics”: study of how biological information is carried from one generation to the next –Classical Laws of inheritance.
Nucleic Acids and Protein Synthesis
Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.
Archives and Information Retrieval
Lecture 2.21 Retrieving Information: Using Entrez.
Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
CHAPTER 15 Microbial Genomics Genomic Cloning Techniques Vectors for Genomic Cloning and Sequencing MS2, RNA virus nt sequenced in 1976 X17, ssDNA.
An Introduction to Bioinformatics Molecular Biology Databases.
On line (DNA and amino acid) Sequence Information
Chapter 13.2 (Pgs ): Ribosomes and Protein Synthesis
The Ensembl Gene set The “Genebuild” 21 April 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
Arabidopsis Genome Annotation TAIR7 Release. Arabidopsis Genome Annotation  Overview of releases  Current release (TAIR7)  Where to find TAIR7 release.
Introduction to Bioinformatics CPSC 265. Interface of biology and computer science Analysis of proteins, genes and genomes using computer algorithms and.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Doug Raiford Lesson 3.  More and more sequence data is being generated every day  Useless if not made available to other researchers.
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
جلسه اول بیو انفورماتیک گردآوری:مسعود رسول آبادی
Organizing information in the post-genomic era The rise of bioinformatics.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Accessing information on molecular sequences Bio 224 Dr. Tom Peavy Sept 1, 2010.
Bioinformatics and Computational Biology
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
ESTs Ian Keller Laboratory Techniques in Molecular Bio.
Copyright OpenHelix. No use or reproduction without express written consent1.
1 of 28 Evaluating Genes and Transcripts (“Genebuild”)
Transcription and Translation. Central Dogma of Molecular Biology  The flow of information in the cell starts at DNA, which replicates to form more DNA.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
12-3 RNA and Protein Synthesis Page 300. A. Introduction 1. Chromosomes are a threadlike structure of nucleic acids and protein found in the nucleus of.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
Chapter – 10 Part II Molecular Biology of the Gene - Genetic Transcription and Translation.
Gene Expression = Protein Synthesis.
Gene Expression and Protein Synthesis
Introduction to Bioinformatics
The Transcriptional Landscape of the Mammalian Genome
Human Genome Project.
Biological Databases By: Komal Arora.
Retrieving Information: Using Entrez
13.2 Ribosomes and Protein Synthesis
Archives and Information Retrieval
Section 3: RNA and Gene Expression
생물정보학 Bioinformatics.
Unit 8 – DNA Structure and Replication
Human Cells Gene Expression
From Gene to Protein Chapter 17.
What is Bioinformatics?
Gene Expression Continued
RNA & Gene Expression.
RNA & Gene Expression.
Access to Sequence Data and Related Information
Genomes and Their Evolution
BLAST.
Introduction to Bioinformatics
Synthetic Biology: Protein Synthesis
Identification and Characterization of pre-miRNA Candidates in the C
Protein synthesis: Overview
Lesson 3 Bioinformatics Laboratory
Introduction to Databases
An Overview of Gene Expression
Genes and Protein Synthesis Review
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Chapter 3. THE GENBANK SEQUENCE DATABASE

Introduction GenBank, the National Institutes of Health (NIH) genetic sequence database, is an annotated collection of all publicly available nucleotide and protein sequences. GenBank, which is built by the National Center for Biotechnology Information (NCBI), is part of the International Nucleotide Sequence Database Collaboration. DNA Data Bank of Japan (DDBJ), European Molecular Biology Laboratory (EMBL) Historically, the protein database preceded the nucleotide databases. 1960s, Dayhoff, Atlas of Protein Sequence and Structures (1965) 1982, DNA sequence database, EMBL – GenBank – DDBJ 1988, International Nucleotide Sequence Database Collaboration

Introduction MAGEST: ESTs and Gene Expression Pattern Database for Halocynthia roretzi Maternal cDNA

Introduction NGIC 국가유전체정보센터

Primary and secondary databases There is an important distinction between primary (archival) and secondary (curated) databases. The primary databases represent experimental results but are not a curated review. Curated reviews are found in the secondary databases. The primary databases represent experimental results GenBank nucleotide sequence records are derived from the sequencing of a biological molecule that exists in a test tube. Secondary database are obtained results from primary databases.

Format vs. Content: Computer vs. Humans DNA sequence record and be represented as a string of nucleotides with some tag or identifier . FASTA (pearson format) >NM_007348 ttttgtccgc ctgccgccgc cgtcccagat attaatcacg gagttccagg gagaaggaac ttgtgaaatg ggggagccgg ctggggttgc cggcaccatg gagtcacctt ttagcccggg actctttcac aggctggatg aagattggga ttctgctctc tttgctgaac tcggttattt cacagacact gatgagctgc aattggaagc agcaaatgag acgtatgaaa acaattttga taatcttgat tttgatttgg atttgatgcc ttgggagtca gacatttggg acatcaacaa ccaaatctgt acagttaaag atattaaggc agaacctcag ccactttctc cagcctcctc aagttattca gtctcgtctc ctcggtcagt ggactcttat tcttcaactc agcatgttcc >NP_031374 MGEPAGVAGTMESPFSPGLFHRLDEDWDSALFAELGYFTDTDELQLEAANETYENNFDNLDFDLDLMPWESDIWDINNQICTVKDIKAEPQPLSPASSSYSVSSPRSVDSYSSTQHVPEELDLSSSSQMSPLSLYGENSNSLSSAEPLKEDKPVTGPRNKTENGLTPKKKIQVNSKPSIQPKPLLLPAAPKTQTNSSVPAKTIIIQTVPTLMPLAKQQPIISLQPAPTKGQTVLLSQPTVVQLQAPGVLPSAQPVLAVAGGVTQLPNHVVNVVPAPSANSPVNGKLSVTKPVLQSTMRNVGSDIAVLRRQQRMIKNRESACQSRKKKKEYMLGLEARLKAALSENEQLKKENGTLKRQLDEVVSENQRLKVPSPKRRVVCVMIVLAFIILNYGPMSMLEQDSRRMNPSVSPANQRRHLLGFSAKEAQDTSDGIIQKNSYRYDHSVSNDKALMVLTEEPLLYIPPPPCQPLINTTESLRLNHELRGWVHRHEVERTKSRRMTNNQQKTRILQGALEQGSNSQLMAVQYTETTSSISRNSGSELQVYYASPRSYQDFFEAIRRRGDTFYVVSFRRDHLLLPATTHNKTTRPKMSIVLPAININENVINGQDYEVMMQIDCQVMDTR ILHIKSSSVPPYLRDQQRNQTNTFFGSPPAATEATHVVSTIPESLQ >gi |56786156| Homo sapiens activating transcription factor 6 (ATF6) gene, complete cds.

The database There are three important consequences of not having the correct or proper information on the nucleotide record. If a coding sequence is not indicated on a nucleic acid record, it will not be represented in the protein databases. The set of features usable on the nucleotide feature table that are specific to protein sequences themselves is limited. If a coding feature on a nucleotide record contains incorrect information about the protein, this could be propagated to other records in both the nucleotide and protein databases on the basis of sequence similarity

The GenBank flatfile : a dissection The GenBank flatfile (GBFF) is the elementary unit of information in the GenBank database. It is one of the most commonly used formats in the representation of biological sequences. The GBFF can be separated into three parts, the header the features the nucleotide sequences http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=56786156

The Header

The Header (locus) ▣ Locus name 1. This element was historically used to represent the locus that was the subject of the record. 2. All letters are uppercase. 3. Most DNA sequence records represented only one genetic locus . HUMHBB : human b-globin locus SV40 : simian virus 4. use an accession number of ensured uniqueness, cannot exceed 10 characters ▣ Sequence length 1. Sequences can range from 1 to 350,000 base pairs 2. seldom accept sequences shorter than 50 bp, primer sequences is discouraged 3. Records of greater than 350 kb are acceptable in the database if the sequence represents a single gene.

The Header (locus) ▣ Moleclue type 1. The “mol type” usually is DNA or RNA. 2. The acceptable mol type are DNA, RNA, tRNA, rRNA, mRNA, and uRNA 3. If the tRNA or rRNA has been sequenced directly or via some cDNA intermediate, then tRNA or rRNA is shown as the mol type. 4. If rRNA gene sequense was obtained via the PCR from genomic DNA, then DNA is the mol type. ▣ GenBank division code 1. three letters, taxonomic inferences or other classification purposes 2. recalling the time when the various GenBank division were used to break up the database files into what was then a more manageable size. 3. new function-based divisions : represent functional and definable sequence type

The Header (locus) ▣ GenBank division code EST (Expressed Sequence Tags) : contains "single-pass" cDNA sequences from a number of organisms.

The Header (locus) ▣ GenBank division code GSS (Genome Survey Sequences) : similar to the EST division with the exception that most of the sequences are genomic in origin, rather than cDNA (mRNA). ① random "single pass read" genome survey sequences. ② cosmid/BAC/YAC end sequences ③ exon trapped genomic sequences ④ transposon-tagged sequences

The Header (locus) ▣ GenBank division code STS (Sequence Tagged Sites) : contains sequence and mapping data on short genomic landmark sequences or Sequence Tagged Sites

The Header (locus) ▣ GenBank division code ▣ Date CON (contigged) : In shotgun DNA sequencing projects, a contig (from contiguous) is a set of overlapping DNA segments derived from a single genetic source. ▣ Date 1. The date is the date the record was last made public. 2. It should be noted that none of these dates is legally binding on the promulgating organization.

The Header (definition) The definition line is the line in the GenBank record that attempts to summarize the biology of the record. ▣ mRNA definition Genus species product name (gene symbol) mRNA, complete cds. ▣ genomic record Genus species product name (gene symbol) gene, complete cds. ▣ organelle sequences DEFINITION Genus species protein X(xxx) gene, complete cds; DEFINITION Genus species XXS ribosomal RNA gene, complete cds; Nuclear gene (s) for mitochondrial product (s) Nuclear gene (s) for chloroplast product (s) Mitochondrial gene (s) for mitochondrial product (s) chloroplast gene (s) for chloroplast product (s)

The Header (accession) 1. The accession number represents the primary key to reference a given record in the database 2. This is the number that is cited in publication and is always associated with the record 3. If the sequence is updated, the accession number will not change. 4. Format “1 + 5” : one uppercase letter followed by five digits “2 + 6” : two letters plus six digits ▣ VERSION 1. The version line contains the Accession.version and the gi. These identifers are associated with a unique nucleotide sequence. 2. If the sequence changes, the version number in the Accession.version will be incremented by one and the gi will change.

The Header (keywords) ▣ Keyward line 1. The keyword line is another historical relic that is, in many cases, unfortunately misused. 2. NCBI discourages the use of keywords but will include them on request, especially if the words are not present elsewhere in the record or are used in a controlled fashion (EST, STS, GSS, HTG).

The Header (source) ▣ Source line The source line will either have the common name for the organism or its scientific name.

The Header (references) Each GenBank record must have at least one reference or citation Published paper PubMed identifier provides a link to the PubMed databases. Unpublished paper could be submitted Direct submission placeholders for a publication

The Header (comment) This section includes a great variety of notes and comment that refer to the whole record. This section is optional and not found in most records in GenBank. The comment section also contains information about the history of the sequence. If the sequence of a particular record is updated, the comment will contain a pointer to the previous of the record.

The Feature Table The most important direct representation of the biological information in the record. A full set of annotations within the record facilitates quick extraction of the relevant biological features and allows the submitter to indicate why this record was submitted to the databases.

The Source Feature The source feature is the only feature that must be present on all GenBank records. All DNA sequence records have some origin, even if synthetic in the extreme case. Care should be taken to avoid adding superfluous information to the record

The CDS Feature

The CDS Feature ▣ Database cross-reference (db_xref) The CDS feature contains instructions to the reader in how to join two sequences together or on how to make an amino acid sequence from the indicated coordinates and the inferred genetic code. ▣ Database cross-reference (db_xref) This controlled qualifier allows the databases to cross-reference the sequence in question to an external database with an identifier used in that database. ▣ protein_id Each protein sequence is assigned a protein_id or protein accession number. The format of this accession number is “3 + 5” or three letters and five digits. Because amino acid sequences represent one of the most important by-products of the nucleotide sequence database, much attention is devoted to making sure they are valid. These sequences are the starting material for the protein databases and offer the most sensitive way of making new gene discoveries.

The Gene Feature The RNA Feature The gene feature represents a segment of DNA that can be identified with a name or some arbitrary number, as is often used in genome sequencing project. The gene feature allows the user to see the gene area of interest and in some cases to select it. The RNA Feature Although these are presently not instantiated into separate records as protein sequences are, these sequences are essential to our understanding of how higher genomes are organized. The RNA feature on a genomic record should represent the experimental evidence of the presence of that biological molecule.