Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov.

Slides:



Advertisements
Similar presentations
HIDDEN MARKOV MODELS IN COMPUTATIONAL BIOLOGY CS 594: An Introduction to Computational Molecular Biology BY Shalini Venkataraman Vidhya Gunaseelan.
Advertisements

Profiles for Sequences
ATG GAG GAA GAA GAT GAA GAG ATC TTA TCG TCT TCC GAT TGC GAC GAT TCC AGC GAT AGT TAC AAG GAT GAT TCT CAA GAT TCT GAA GGA GAA AAC GAT AAC CCT GAG TGC GAA.
Supplementary Fig.1: oligonucleotide primer sequences.
CS262 Lecture 9, Win07, Batzoglou Gene Recognition.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Introduction to bioinformatics Lecture 2 Genes and Genomes.
CS262 Lecture 9, Win07, Batzoglou Rapid Global Alignments How to align genomic sequences in (more or less) linear time.
Gene Recognition Credits for slides: Marina Alexandersson Lior Pachter Serge Saxonov.
Gene Prediction: Past, Present, and Future Sam Gross.
Genomics 101 DNA sequencing Alignment Gene identification Gene expression Genome evolution …
CS262 Lecture 15, Win07, Batzoglou Multiple Sequence Alignments.
Finding genes in human using the mouse Finding genes in mouse using the human Lior Pachter Department of Mathematics U.C. Berkeley.
Indexed Alignment Tricks of the Trade Ross David Bayer 18 th October, 2005 Note: many diagrams taken from Serafim’s CS 262 class.
Gene Prediction: Statistical Approaches Lecture 22.
Multiple Sequence Alignments Algorithms. MLAGAN: progressive alignment of DNA Given N sequences, phylogenetic tree Align pairwise, in order of the tree.
Eukaryotic Gene Finding
Introduction to Molecular Biology. G-C and A-T pairing.
 Genetic information, stored in the chromosomes and transmitted to the daughter cells through DNA replication is expressed through transcription to RNA.
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Today… Genome 351, 8 April 2013, Lecture 3 The information in DNA is converted to protein through an RNA intermediate (transcription) The information in.
IGEM Arsenic Bioremediation Possibly finished biobrick for ArsR by adding a RBS and terminator. Will send for sequencing today or Monday.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Nature and Action of the Gene
Biological Dynamics Group Central Dogma: DNA->RNA->Protein.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Genes: Regulation and Structure Many slides from various sources, including S. Batzoglou,
Undifferentiated Differentiated (4 d) Supplemental Figure S1.
Supplemental Table S1 For Site Directed Mutagenesis and cloning of constructs P9GF:5’ GAC GCT ACT TCA CTA TAG ATA GGA AGT TCA TTT C 3’ P9GR:5’ GAA ATG.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
GENE EXPRESSION. Gene Expression Our phenotype is the result of the expression of proteins Different alleles encode for slightly different proteins Protein.
PART 1 - DNA REPLICATION PART 2 - TRANSCRIPTION AND TRANSLATION.
Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.
TRANSLATION: information transfer from RNA to protein the nucleotide sequence of the mRNA strand is translated into an amino acid sequence. This is accomplished.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Sequence Analysis with Artemis and Artemis Comparison Tool (ACT) Carribean Bioinformatics Workshop 18 th -29 th January, 2010.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Passing Genetic Notes in Class CC106 / Discussion D by John R. Finnerty.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
A knowledge-based approach to integrated genome annotation Michael Brent Washington University.
Example 1 DNA Triplet mRNA Codon tRNA anticodon A U A T A U G C G
Name of presentation Month 2009 SPARQ-ed PROJECT Mutations in the tumor suppressor gene p53 Pulari Thangavelu (PhD student) April Chromosome Instability.
DNA, RNA and Protein.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
Genomics 101 DNA sequencing Alignment Gene identification
bacteria and eukaryotes
RNA and Protein Synthesis
Modelling Proteomes.
Supplementary information Table-S1 (Xiao)
Sequence – 5’ to 3’ Tm ˚C Genome Position HV68 TMER7 Δ mt. Forward
Supplementary Figure 1 – cDNA analysis reveals that three splice site alterations generate multiple RNA isoforms. (A) c.430-1G>C (IVS 6) results in 3.
Interpolated Markov Models for Gene Finding
Huntington Disease (HD)
DNA By: Mr. Kauffman.
Eukaryotic Gene Finding
Gene architecture and sequence annotation
Ab initio gene prediction
More on translation.
Python.
Bellringer Please answer on your bellringer sheet:
Shailaja Gantla, Conny T. M. Bakker, Bishram Deocharan, Narsing R
Presentation transcript:

Gene Recognition Credits for slides: Serafim Batzoglou Marina Alexandersson Lior Pachter Serge Saxonov

The Central Dogma Protein RNA DNA transcription translation CCTGAGCCAACTATTGATGAA PEPTIDEPEPTIDE CCUGAGCCAACUAUUGAUGAA

Gene structure exon1 exon2exon3 intron1intron2 transcription translation splicing exon = protein-coding intron = non-coding Codon: A triplet of nucleotides that is converted to one amino acid

Locating Genes We have a genome sequence, maybe with related genomes aligned to it…where are the genes? Yeast genome is about 70% protein coding About 6000 genes Human genome is about 1.5% protein coding About 22,000 genes

Finding Genes in Yeast Start codon ATG 5’3’ Stop codon TAG/TGA/TAA Intergenic Coding Intergenic Mean coding length about 1500bp (500 codons) Transcript

Finding Genes in Yeast ORF Scanning  Look for long open reading frames (ORFs)  ORFs start with ATG and contain no in-frame stop codons  Long ORFs unlikely to occur by chance (i.e., they are probably genes)

Finding Genes in Yeast Yeast ORF distribution

Introns: The Bane of ORF Scanning Start codon ATG 5’ 3’ Stop codon TAG/TGA/TAA Splice sites Intergenic Exon Intron Intergenic Exon Intron Transcript

Introns: The Bane of ORF Scanning Drosophila: 3.4 introns per gene on average mean intron length 475, mean exon length 397 Human: 8.8 introns per gene on average mean intron length 4400, mean exon length 165 ORF scanning is defeated

Where are the genes?

Needles in a Haystack

Now What? We need to use more information to help recognize genes  Regular structure  Exon/intron lengths  Nucleotide composition  Biological signals Start codon, stop codon, splice sites  Patterns of conservation

Regular Gene Structure Protein coding region starts with ATG, ends with TAA/TAG/TGA Exons alternate with introns Introns start with GT/GC, end with AG Each exon has a reading frame determined by the codon position at the end of the last exon

Next Exon: Frame 0 Next Exon: Frame 1

Exon/Intron Lengths

Nucleotide Composition Base composition in exons is characteristic due to the genetic code Amino AcidSLCDNA Codons IsoleucineIATT, ATC, ATA LeucineLCTT, CTC, CTA, CTG, TTA, TTG ValineVGTT, GTC, GTA, GTG PhenylalanineFTTT, TTC MethionineMATG CysteineCTGT, TGC AlanineAGCT, GCC, GCA, GCG GlycineGGGT, GGC, GGA, GGG ProlinePCCT, CCC, CCA, CCG ThreonineTACT, ACC, ACA, ACG SerineSTCT, TCC, TCA, TCG, AGT, AGC TyrosineYTAT, TAC TryptophanWTGG GlutamineQCAA, CAG AsparagineNAAT, AAC HistidineHCAT, CAC Glutamic acidEGAA, GAG Aspartic acidDGAT, GAC LysineKAAA, AAG ArginineRCGT, CGC, CGA, CGG, AGA, AGG

Biological Signals How does the cell recognize start/stop codons and splice sites?  In part, from characteristic base composition Donor site (start of intron) is recognized by a section of U1 snRNA U1 snRNA: GUCCAUUCA Donor site consensus: MAGGTRAGT M means “A or C”, R means “A or G”

atg tga ggtgag caggtg cagatg cagttg caggcc ggtgag

5’ 3’ Donor site Position  -8…-2012…17 A26… …21 C26…155012…27 G25… …27 T23… …25 Splice Sites

(

WMM: weight matrix model = PSSM (Staden 1984) WAM: weight array model = 1 st order Markov (Zhang & Marr 1993) MDD: maximal dependence decomposition (Burge & Karlin 1997)  Decision-tree algorithm to take pairwise dependencies into account For each position I, calculate S i =  j  i  2 (C i, X j ) Choose i * such that S i* is maximal and partition into two subsets, until No significant dependencies left, or Not enough sequences in subset  Train separate WMM models for each subset All donor splice sites G5G5 not G 5 G 5 G -1 G 5 not G -1 G 5 G -1 A 2 G 5 G -1 not A 2 G 5 G -1 A 2 U 6 G 5 G -1 A 2 not U 6 Splice Sites

Patterns of Conservation Functional sequences are much more conserved than nonfunctional sequences Signal sequences show compensatory mutations  If one position mutates away from consensus, often a different one will mutate to consensus Coding sequence shows three-periodic pattern of conservation

Three Periodicity Most amino acids can be coded for by more than one DNA triplet Usually, the degeneracy is in the last position HumanCCTGTT(Proline, Valine) MouseCCAGTC(Proline, Valine) RatCCAGTC(Proline, Valine) DogCCGGTA(Proline, Valine) ChickenCCCGTG(Proline, Valine)

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Exon Intron Intergenic Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

GTCAGATGAGCAAAGTAGACACTCCAGTAACGCGGTGAGTACATTAA Exon Intron Intergenic Hidden Markov Models for Gene Finding Intergene State First Exon State Intron State

GENSCAN

Burge and Karlin, Stanford, 1997 Before The Human Genome Project  No alignments available  Estimated human gene count was 100,000 Explicit state duration HMM (with tricks)  Intergenic and intronic regions have geometric length distribution  Exons are only possible when correct flanking sequences are present

GENSCAN Output probabilities for NC and CDS depend on previous 5 bases (5 th -order)  P(X i | X i-1, X i-2, X i-3, X i-4, X i-5 ) Each CDS frame has its own model WAM models for start/stop codons and acceptor sites MDD model for donor sites Separate parameters for regions of different GC content

GENSCAN Performance First program to do well on realistic sequences  Long, multiple genes in both orientations Pretty good sensitivity, poor specificity  70% exon Sn, 40% exon Sp Not enough exons per gene Was the best gene predictor for about 4 years

TWINSCAN Korf, Flicek, Duan, Brent, Washington University in St. Louis, 2001 Uses an informant sequence to help predict genes  For human, informant is normally mouse Informant sequence consists of three characters  Match:|  Mismatch::  Unaligned:. Informant sequence assumed independent of target sequence

The TWINSCAN Model Just like GENSCAN, except adds models for conservation sequence 5 th -order models for CDS and NC, 2 nd -order models for start and stop codons and splice sites  One CDS model for all frames Many informants tried, but mouse seems to be at the “sweet spot”

TWINSCAN Performance Slightly more sensitive than GENSCAN, much more specific  Exon sensitivity/specificity about 75% Much better at the gene level  Most genes are mostly right, about 25% exactly right Was the best gene predictor for about 4 years

N-SCAN Gross and Brent, Washington University in St. Louis, 2005 If one informant sequence is good, let’s try more! Also several other improvements on TWINSCAN

N-SCAN Improvements Multiple informants Richer models of sequence evolution Frame-specific CDS conservation model Conserved noncoding sequence model 5’ UTR structure model

GENSCAN TWINSCAN N-SCAN HMM Outputs TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Conservation|||:||:||:|||||:|||||||| sequence TargetGGTGAGGTGACCAAGAACGTGTTGACAGTA Informant1GGTCAGC___CCAAGAACGTGTAG Informant2GATCAGC___CCAAGAACGTGTAG Informant3GGTGAGCTGACCAAGATCGTGTTGACACAA...

Phylogenetic Bayesian Network Models

Homology-Based Gene Prediction Idea: Try to predict a gene in one organism using a known orthologous gene or protein from another organism Genewise  Protein homology Projector  Gene structure homology Very accurate if (and only if??) homology is high

Evaluating Performance Three main levels of performance: gene, exon, nucleotide Two measures of performance:  Sensitivity: what fraction of the true features did we predict correctly?  Specificity: what fraction of our predicted features were correct? Testing standard is whole-genome prediction  Predicting on single-gene sequences is easier and less interesting

Exact Exon Accuracy

Exact Gene Accuracy

Intron Sensitivity By Length

Human Informant Effectiveness

Drosophila Informant Effectiveness

The Future Many new genomes being sequenced—they will need annotations!  Current experimental “shotgun” methods not enough  However, cheap targeted experiments are available to verify predicted genes Promising directions in gene prediction:  Conditional random fields  Multiple informants—can we actually get them to work???