SEQUENCE DATABASES Daniel Svozil. Primary sequence databases All published genome sequences are available over the internet requirement of every scientific.

Slides:



Advertisements
Similar presentations
Bioinformatics Ayesha M. Khan Spring 2013.
Advertisements

© Wiley Publishing All Rights Reserved. Using Nucleotide Sequence Databases.
An Introduction to Bioinformatics Finding genes in prokaryotes.
Genomics: READING genome sequences ASSEMBLY of the sequence ANNOTATION of the sequence carry out dideoxy sequencing connect seqs. to make whole chromosomes.
On line (DNA and amino acid) Sequence Information Lecture 7.
The National Center for Biotechnology Information (NCBI) a primary resource for molecular biology information Database Resources.
Bioinformatics What is bioinformatics? Why bioinformatics? The major molecular biology facts Brief history of bioinformatics Typical problems of bioinformatics:
RNA and Protein Synthesis
Biological databases.
The Molecular Genetics of Gene Expression
Lecture 2.21 Retrieving Information: Using Entrez.
How to use the web for bioinformatics Molecular Technologies February 11, 2005 Ethan Strauss X 1373
Biological Databases Chi-Cheng Lin, Ph.D. Associate Professor Department of Computer Science Winona State University – Rochester Center
(CHAPTER 12- Brooker Text)
Bioinformatics Lecture 3 BCH 550 Arjumand Warsy. Retrieving Protein Sequences.
Step 1 of Protein Synthesis
Translation and Transcription
Protein Synthesis.
FROM GENE TO PROTEIN: TRANSCRIPTION & RNA PROCESSING Chapter 17.
Protein Synthesis The genetic code – the sequence of nucleotides in DNA – is ultimately translated into the sequence of amino acids in proteins – gene.
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
On line (DNA and amino acid) Sequence Information
Essentials of the Living World Second Edition George B. Johnson Jonathan B. Losos Chapter 13 How Genes Work Copyright © The McGraw-Hill Companies, Inc.
CHAPTER 17 FROM GENE TO PROTEIN Copyright © 2002 Pearson Education, Inc., publishing as Benjamin Cummings Section B: The Synthesis and Processing of RNA.
NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.
CHMI E.R. Gauthier, Ph.D. 1 CHMI 2227E Biochemistry I Gene expression.
From Gene to Phenotype DNA molecule Gene 1 Gene 2 Gene 3 DNA strand (template) TRANSCRIPTION mRNA Protein TRANSLATION Amino acid A CCAAACCGAGT U G G U.
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Bioinformatics Overview, NCBI & GenBank JanPlan 2012.
Part I: Identifying sequences with … Speaker : S. Gaj Date
DNA sequencing. Dideoxy analogs of normal nucleotide triphosphates (ddNTP) cause premature termination of a growing chain of nucleotides. ACAGTCGATTG ACAddG.
1 Genes and How They Work Chapter Outline Cells Use RNA to Make Protein Gene Expression Genetic Code Transcription Translation Spliced Genes – Introns.
Organizing information in the post-genomic era The rise of bioinformatics.
Genetics 3: Transcription: Making RNA from DNA. Comparing DNA and RNA DNA nitrogenous bases: A, T, G, C RNA nitrogenous bases: A, U, G, C DNA: Deoxyribose.
A Biology Primer Part III: Transcription, Translation, and Regulation Vasileios Hatzivassiloglou University of Texas at Dallas.
Protein Synthesis. DNA is in the form of specific sequences of nucleotides along the DNA strands The DNA inherited by an organism leads to specific traits.
Transcription in Prokaryotic (Bacteria) The conversion of DNA into an RNA transcript requires an enzyme known as RNA polymerase RNA polymerase – Catalyzes.
PROTEIN SYNTHESIS HOW GENES ARE EXPRESSED. BEADLE AND TATUM-1930’S One Gene-One Enzyme Hypothesis.
Bioinformatics and Computational Biology
Eukaryotic Gene Structure. 2 Terminology Genome – entire genetic material of an individual Transcriptome – set of transcribed sequences Proteome – set.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
Transcription. Recall: What is the Central Dogma of molecular genetics?
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Protein Synthesis-Transcription Why are proteins so important? Nearly every function of a living thing is carried out by proteins … -DNA replication.
Tools in Bioinformatics Genome Browsers. Retrieving genomic information Previous lesson(s): annotation-based perspective of search/data Today: genomic-based.
Finding genes in the genome
Introduction to Molecular Biology and Genomics BMI/CS 776 Mark Craven January 2002.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
1 Discussion Practical 1. Features of major databases (PubMed and NCBI Protein Db) 2.
GENBANK FILE FORMAT LOCUS –LOCUS NAME Is usually the first letter of the genus and species name, followed by the accession number –SEQUENCE LENGTH Number.
Welcome to the combined BLAST and Genome Browser Tutorial.
The Central Dogma of Molecular Biology DNA  RNA  Protein  Trait.
Transcription and The Genetic Code From DNA to RNA.
Gene Activity 1Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger RNA Translation  Transfer.
COURSE OF BIOINFORMATICS Exam_30/01/2014 A.
The Central Dogma of Life. replication. Protein Synthesis The information content of DNA is in the form of specific sequences of nucleotides along the.
Gene Activity Chapter 14. Gene Activity 2Outline Function of Genes  One Gene-One Enzyme Hypothesis Genetic Code Transcription  Processing Messenger.
Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.
GROUP 2 DNA TO PROTEIN. 9.1 RICIN AND YOUR RIBOSOMES.
Transcription and Translation HL 2014!
bacteria and eukaryotes
Transcription in Prokaryotic (Bacteria)
Central Dogma Central Dogma categorized by: DNA Replication Transcription Translation From that, we find the flow of.
Chapter 3. THE GENBANK SEQUENCE DATABASE
credit: modification of work by NIH
Chapter 6.2 McGraw-Hill Ryerson Biology 12 (2011)
Presentation transcript:

SEQUENCE DATABASES Daniel Svozil

Primary sequence databases All published genome sequences are available over the internet requirement of every scientific journal Main resources (primary databases, big three): NCBI database (GenBank) ( European Molecular Biology Laboratory (EMBL) database ( DNA Database of Japan (DDBJ) ( DDBJ/EMBL/GenBank – form The International Nucleotide Sequence Database Collaboration (INSDC, Contain all publicly available nucleotide sequences and their protein translations. They exchange data nightly, so contain essentially the same data.

GenBank Nucleotide in drop-down menu local copy of DB – release (every 2 month in GenBank) 15 February 2012, release ,384,889,783 bases, from 149,819,246 reported sequences, cca 100,000 organisms exponential growth, doubling every 18 months

Direct submissions from individual laboratories, as well as bulk submissions from large-scale sequencing centers disadvantage Primary database contain experimental results (with some interpretation – annotation) but are not curated. There is no guarantee about data quality. Curated reviews are found in secondary databases. Sequences are identified by an accession number. Unique, reported in scientific papers describing that sequence, combination of letters and numbers e.g. X01714 (1+5 variety), GL (6+2 variety)

GenBank - prokaryotic gene Prokaryotes genome: circular DNA genome size: Mb gene density: 1 gene per 1 kb 70% coding for proteins no overlap between genes genes transcribed right after the promoter genes are single piece (no splicing)

Bioinformatics for Dummies Low variability in prokaryotic promoters. Typical promoter: Pribnow box, -10, T 80 A 95 T 45 A 60 A 50 T 96 Protein sequences are derived by translating the longest open reading frame ORF (from ATG to STOP) spanning the gene-transcript sequence. The mRNA sequence gets translated into a protein after a special signal, called the Ribosome Binding Site (RBS).

Transcription runs in the 5’ → 3’ of newly synthesized RNA strand. Is assymetric – only one DNA strand is transcribed ([-], template, non-coding, antisense) [+], nontemplate, coding, sense – sequence identical with the RNA coding, sense – term is related to the resulting protein (mRNA is coding for protein, it makes sense by determining the amino-acid sequence) 1 st nucleotide of transcribed RNA corresponds to DNA nucleotide +1. Sequence before this point (i.e. opposite to the flow of transcription) – upstream (-) Sequence behind this point (i.e. in the direction of transcription) – downstream (+) Tara Robinson, Genetics for Dummies antisense strand sense strand TATA

GenBank - prokaryotic gene Search for X GenBank entry is refered to as flat-file format. It’s called so because you can read it in linear fashion, it does not involve indexes, pointers (well, actually it contains several hyperlinks, it is not 100% flat).

The header LOCUS – the locus name (an arbitrary name), sequence length, molecule type, division code (classification), date of last modification. DEFINITION – short definition of the gene. ACCESSION – refered to as the primary accession number. VERSION – accession.version, GI (geninfo identifier, GenBank specific, accession.version is now preferred) KEYWORDS – list of terms broadly characterizing the entry, historical reasons, no controlled vocabulary, not used in new records SOURCE – common name of the organism.

The header ORGANISM – formal scientific name for the source organism (genus and species) and its lineage, based on the phylogenetic classification scheme used in the NCBI Taxonomy Database SOURCE vs. ORGANISM: baker's yeast vs. Saccharomyces cerevisiae, search for each of these will yield the same results REFERENCE – at least one COMMENT – optional, some comment

The features table

Direct representation of the biological information in the record. Feature key (which biological property), location information (where the feature is located), additional qualifiers. source – mandatory, origin of specific regions of the sequence, useful when you want to distinguish cloning vectors from host sequences. promoter – coordinates of a promoter element. In X01714, a –35 region is in , another promoter -10 region misc_feature – in this case the putative location of the transcription start (mRNA synthesis) RBS (Ribosome Binding Site) – the location of the last upstream element

The features table CDS (CoDing Segment) – describes the gene’s open reading frame (ORF): The first line indicates the coordinates of the ORF from its initial ATG to the last nucleotide of the first stop codon TAA (343 to 798). Each of the following lines (indented at the same level) gives the name of a protein product, indicates the reading frame to use (here, 343 is the first base of the first codon), the genetic code to apply (/transl_table), and a number of IDs for the protein sequence. /translation introduces the conceptual amino-acid sequence of the coding segment. This sequence is a computer translation that uses the coordinates, reading frame, and genetic code indicated in the preceding lines.

The sequence section Starts with ORIGIN, ends with // Each line contains 60 nucleotides 1 st nucleotide gets number 1 Save in FASTA format (Display/FASTA (text)) single line description starts with >, should be shorter than 80 characters *.fasta, *.fa, *.seq, *.fsa

GenBank - eukaryotic gene Eukaryotes genome: multiple pieces – chromosomes genome size: 10 Mb – 670 Gbp gene density: 1 gene per 100 kb in human, much lower than prok. genome is not efficient – less than 5% codes for proteins in human genes on opposite DNA strands might (rarely) overlap genes transcribed right after the promoter, but sequence elements located far away can have a strong influence on this process splicing – exons + introns alternative splicing (1 gene = more proteins, genes in human result in proteins)

Tara Robinson, Genetics for Dummies

GenBank - eukaryotic gene Search for U I carefully chose mRNA sequence, not genomic sequence. Thus this entry is not that complex, no exons etc. sig_peptide – location of a mitochondrial targeting sequence mat_peptide – exact boundaries of the mature peptide

GenBank - eukaryotic gene Search for AF – gene from which the previous mRNA sequence originated. This sequence is still rather simple, but it already contains eukaryotic-specific entries.

SEGMENT: This field relates to the mosaic structure of eukaryotic genes. It indicates that this current GenBank entry is the second segment of a super entry made of four. You need all four entries to reconstruct the complete mRNA sequence used as a template for producing the protein The source section contains a /map section. For AF018430, it indicates that the sequence belongs to chromosome 15, and was more precisely mapped on the long arm (q) of this chromosome, within th q21.1 cytogenetic band.

gene – describe precisely the reconstruction of the various mRNAs spread over several separate entries order – exon splicing recipe: take nucleotides from positions 1 to 1735 from entry AF018429, add nucleotides from positions 1 to 1177 from the current entry, … The indicates that the gene might actually continue beyond the indicated position. mRNA – alternative splicing

Bioinformatics for Dummies

exon – the position of the sole exon present in this sequence search AF – multiple exons in a single entry You get accession numbers by reading articles reporting about the sequence. After you’ve accessed the first GenBank entry relevant to your work, you can retrieve other related genes. search U90223 Display – Summary – Related sequences Retrieved: various mRNA forms and partial sequences, partial genomic sequences (around exons), and two large (154kb and 192kb) sequences of the 15q21.1 genomic region. There are some monkey sequences as well!

Sample GenBank Record

No accession number GenBank is not the best database for keyword-based searches (gene-centric databases are). Querying database by gene or protein keywords is still possible, but not that reliable. you want to find the nucleotide sequence encoding the human dUTPase sarch for human [organism] AND dUTPase [Protein name] find related sequences to AF How many entries? dUTPase is a nickname for which protein? “dUTP pyrophosphatase” sarch for human [organism] AND “dUTP pyrophosphatase” [Title] Are the resulting entries same as in the previous search? This illustrates the general difficulty in retrieving all entries relevant to a given subject, due to inconsistent usage of synonymous terms.

many of the resulting entries are ESTs Limits – Exclude ESTs

RefSeq Many sequences are more than once in GenBank – redundancy. NCBI developed RefSeq collection – a curated secondary database aim: provide comprehensive, integrated, nonredundant set of sequences For each model organism, RefSeq provides separate and linked records for the genomic DNA, the gene transcripts, and the proteins arising from those transcripts. RefSeq is limited to major organisms for which sufficient data is available (more than 16,000 distinct “named” organisms as of Jan 2012), while GenBank includes sequences for any organism submitted.

RefSeq RefSeq entries – distinct accession number format, “2 + 6” with underscore e.g. NC_ CategoryDescription NTgenomic contigs NCcomplete genomic molecules NGincomplete genomic region NMmRNA NRncRNA NPprotein Reference:

Practise Search the nucleotide domain of Entrez for breast cancer. Provide the following information: 1. The number of Core nucleotide sequence records associated with breast cancer 2. Number of the above sequence records that are from the RefSeq database 3. Number of Core nucleotide sequences associated with breast cancer that are mRNAs 4. Number of the above mRNA sequence records that are in RefSeq and the words breast cancer appears in their titles 5. Number of Human gene BRCA1 RefSeq mRNA sequence records with the words breast cancer in their titles

Practise 6. Accession numbers of the mRNA records for human BRCA1a gene in RefSeq 7. Total number of nucleotides reported in 1 st transcript variant 8. Exact chromosomal location of BRCA1a on the human genome 9. Number of times this sequence was updated 10. Total number of amino acids that this mRNA encodes 11. Identify the last amino acid in the encoded protein 12. Total number of Exons and Introns in this gene variant (BRCA1a)

Practise 13. BRCA1a encodes full length BRCA1 protein (isoform 1). How many variants (isoforms) there exist for BRCA1? Also provide the accession numbers of their mRNA sequences in RefSeq. 14. How is the “isoform 2” different from “isoform 1”? 15. Exact location (nucleotide position) of the BRCA1a start codon 16. Exact location and sequence of the stop codon in BRCA1a 17. What is the sequence of polyA signal for BRCA1a?

Querying the NCBI database

Gene-centric databases Sequence databases are great tools when you want to come up with a bibliography for a particular sequence. However, they do not provide easy access to sequence data when your query deals with broader issues related to a gene or function. The second-generation nucleotide-sequence databases have adopted a more gene-centric perspective. all the sequence information relevant to a given gene is made accessible at once NCBI Entrez Gene

Genome-centric databases Nucleotide sequences are routinely determined at the whole genome or chromosome scale – at least for microorganisms We now have information not only about individual gene sequences, but also e.g. about their relative positions or strand orientation. To take advantage of this more global information, researchers have had to design state-of-the-art genome- centric sequence-information management systems that can connect specialized sequence collections with browsing tools.