Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science.

Slides:



Advertisements
Similar presentations
Gene Prediction: Similarity-Based Approaches
Advertisements

An Introduction to Bioinformatics Finding genes in prokaryotes.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Ab initio gene prediction Genome 559, Winter 2011.
Ka-Lok Ng Dept. of Bioinformatics Asia University
Profiles for Sequences
Predicting Genes in Mycobacteriophages December 8, In Silico Workshop Training D. Jacobs-Sera.
Finding Eukaryotic Open reading frames.
Gene Prediction Methods G P S Raghava. Prokaryotic gene structure ORF (open reading frame) Start codon Stop codon TATA box ATGACAGATTACAGATTACAGATTACAGGATAG.
CISC667, F05, Lec18, Liao1 CISC 467/667 Intro to Bioinformatics (Fall 2005) Gene Prediction and Regulation.
Gene prediction and HMM Computational Genomics 2005/6 Lecture 9b Slides taken from (and rapidly mixed) Larry Hunter, Tom Madej, William Stafford Noble,
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Gene Identification Lab
CSE182-L10 Gene Finding.
Gene Prediction: Statistical Approaches Lecture 22.
Eukaryotic Gene Finding
Computational Biology, Part 4 Protein Coding Regions Robert F. Murphy Copyright  All rights reserved.
Computational Gene Finding Dong Xu Computer Science Department 109 Engineering Building West
Lecture 12 Splicing and gene prediction in eukaryotes
CSE182-L10 MS Spec Applications + Gene Finding + Projects.
Eukaryotic Gene Finding
Biological Motivation Gene Finding in Eukaryotic Genomes
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Finding prokaryotic genes and non intronic eukaryotic genes
Sequencing a genome and Basic Sequence Alignment
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
3. Genome Annotation: Gene Prediction. Gene: A sequence of nucleotides coding for protein Gene Prediction Problem: Determine the beginning and end positions.
Gene Structure and Identification
Chapter 6 Gene Prediction: Finding Genes in the Human Genome.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
Gene Prediction in silico Nita Parekh BIRC, IIIT, Hyderabad.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Gene prediction. Gene Prediction: Computational Challenge aatgcatgcggctatgctaatgcatgcggctatgctaagctgggatccgatgacaatgcatgcggctatg ctaatgcatgcggctatgcaagctgggatccgatgactatgctaagctgggatccgatgacaatgcatgc.
More on translation. How DNA codes proteins The primary structure of each protein (the sequence of amino acids in the polypeptide chains that make up.
Regular Expression ^ beginning of string $ end of string. any character except newline * match 0 or more times + match 1 or more times ? match 0 or 1 times;
Gene finding and gene structure prediction M. Fatih BÜYÜKAKÇALI Computational Bioinformatics 2012.
Lecture 10, CS5671 Neural Network Applications Problems Input transformation Network Architectures Assessing Performance.
Sequencing a genome and Basic Sequence Alignment
CISC667, F05, Lec9, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Sequence Database search Heuristic algorithms –FASTA –BLAST –PSI-BLAST.
Exploring Alternative Splicing Features using Support Vector Machines Feature for Alternative Splicing Alternative splicing is a mechanism for generating.
Pattern Matching Rhys Price Jones Anne R. Haake. What is pattern matching? Pattern matching is the procedure of scanning a nucleic acid or protein sequence.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Mark D. Adams Dept. of Genetics 9/10/04
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
From Genomes to Genes Rui Alves.
Genome Annotation Haixu Tang School of Informatics.
Gene, Proteins, and Genetic Code. Protein Synthesis in a Cell.
Genes and Genomes. Genome On Line Database (GOLD) 243 Published complete genomes 536 Prokaryotic ongoing genomes 434 Eukaryotic ongoing genomes December.
An Introduction to Bioinformatics Algorithmswww.bioalgorithms.info Gene Prediction: Statistical Approaches.
(H)MMs in gene prediction and similarity searches.
Finding genes in the genome
1 Mona Singh What is computational biology?. 2 Mona Singh Genome The entire hereditary information content of an organism.
Gene Structure Prediction (Gene Finding) I519 Introduction to Bioinformatics, 2012.
1 Applications of Hidden Markov Models (Lecture for CS498-CXZ Algorithms in Bioinformatics) Nov. 12, 2005 ChengXiang Zhai Department of Computer Science.
BIOINFORMATICS Ayesha M. Khan Spring 2013 Lec-8.
Dynamic Programming (cont’d) CS 466 Saurabh Sinha.
Ch. 11: DNA Replication, Transcription, & Translation Mrs. Geist Biology, Fall Swansboro High School.
Biological Motivation Gene Finding in Eukaryotic Genomes Rhys Price Jones Anne R. Haake.
1 Gene Finding. 2 “The Central Dogma” TranscriptionTranslation RNA Protein.
10. Decision Trees and Markov Chains for Gene Finding.
bacteria and eukaryotes
Gene architecture and sequence annotation
Genome Center of Wisconsin, UW-Madison
Ab initio gene prediction
Recitation 7 2/4/09 PSSMs+Gene finding
More on translation.
Introduction to Bioinformatics II
Gene Prediction: Statistical Approaches
The Toy Exon Finder.
Presentation transcript:

Gene Prediction: Statistical Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 20, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from and Ying Xu’s lecturehttp://

Approaches to Gene Prediction Similarity-based approaches: –Exploit the fact that many genes are conserved across species –Can be highly reliable –Only good for finding unknown genes Statistical approaches –Exploit statistical characteristics of coding regions and non- coding regions and other knowledge about genes –Can potentially detect new genes –May not be reliable They can/should be combined –Currently no principled approaches for doing this Given a new genome, identify “known genes” first Learn from “known genes” to identify new gene

Gene Prediction Analogy Newspaper written in unknown language –Certain pages contain encoded message, say 99 letters on page 7, 30 on page 12 and 63 on page 15. How do you recognize the message? You could probably distinguish between the ads and the story (ads contain the “$” sign often) Statistics-based approach to Gene Prediction tries to make similar distinctions between exons and introns.

Noting the differing frequencies of symbols (e.g. ‘%’, ‘.’, ‘-’) and numerical symbols could you distinguish between a story and the stock report in a foreign newspaper? Statistical Approach: Metaphor in Unknown Language

A Few Basic Questions What is exactly a gene for the purpose of prediction? –In Prokaryotes, gene = mRNA  Protein –In Eukaryotes, gene = Exon (coding region) What does a gene look like? –Where does it start? –Where does it end? –What is the codon usage inside a gene (exon)? –What is the codon usage outside a gene (intron)? –… How do we exploit such knowledge to identify genes?

Statistical Characteristics of a Gene Gene starts with a start codon Gene ends at a stop codon Splicing signals Codon usage distributions …

Gene Structure

UAA, UAG and UGA correspond to 3 Stop codons that (together with Start codon ATG) delineate Open Reading Frames Genetic Code and Stop Codons

Six Frames in a DNA Sequence stop codons start codons GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC GACGTCTGCTTTGGAGAACTACATCAACCGGACTGTGGCTGTTATTACTTCTGATGGCAGAATGATTGTG CTGCAGACGAAACCTCTTGATGTAGTTGGCCTGACACCGACAATAATGAAGACTACCGTCTTACTAACAC

The “Sly Fox” & Effect of Base Deletion In the following string THE SLY FOX AND THE SHY DOG Delete 1, 2, and 3 nucleotifes after the first ‘S’: THE SYF OXA NDT HES HYD OG THE SFO XAN DTH ESH YDO G THE SOX AND THE SHY DOG Which of the above makes the most sense?

Splicing Signals and Exon Boudnaries Exons are interspersed with introns and typically flanked by GT and AG

Splicing mechanism (

Donor and Acceptor Sites: GT and AG dinucleotides The beginning and end of exons are signaled by donor and acceptor sites that usually have GT and AC dinucleotides Detecting these sites is difficult, because GT and AC appear very often exon 1exon 2 GTAC Acceptor Site Donor Site

Donor and Acceptor Sites: Motif Logos ( Donor: 7.9 bits Acceptor: 9.4 bits (Stephens & Schneider, 1996)

Codon Usage in Human Genome Biased codon usage in exons allows us to distinguish exons from introns

Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein sequences is NOT evenly distributed The average frequency is 5% Some amino acids prefer to be next to each other Some other amino acids prefer to be not next to each other shewanella

Dicodon Frequencies The biased (uneven) dimer frequencies are the foundation of many gene finding programs! Basic idea of gene finding – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence; otherwise proteins prefer to have such dimers Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!

General Steps for Gene Prediction Identify candidate exons in Open Reading Frames (ORFs) –Determine ORFs: An ORF starts with a start codon and ends at a stop codon –Determine sites for receptors/donors Evaluate the potential of a candidate exon for coding (Exploit codon usages)

Step 1: Identify ORFs

Long vs.Short ORFs Long open reading frames may be a gene –At random, we should expect one stop codon every (64/3) ~= 21 codons –However, genes are usually much longer than this A basic approach is to scan for ORFs whose length exceeds certain threshold –This is naïve because some genes (e.g. some neural and immune system genes) are relatively short

Prediction of Translation Starts Translation start: ATG How to predict a translation start Collect a set of experimentally validated translation starts with flanking regions and align them up …. ATG …… GCCATGGCGA ….. ACGATGCTGT …. GACATGGTAC … AGGATGGGCT … GCGATGTGGC …

Prediction of Translation Starts Certain nucleotides prefer to be in certain position around start “ATG” and other nucleotides prefer not to be there The “biased” nucleotide distribution is information! It is a basis for translation start prediction Question: which one is more probable to be a translation start? ATG A C T G CACC ATG GC TCGA ATG TT

Prediction of Translation Starts Mathematical model: F i (X): frequency of X (A, C, G, T) in position i Score a string by  i log (F i (X)/0.25) A C T G CACC ATG GC TCGA ATG TT log (58/25) + log (49/25) + log (40/25) + log (50/25) + log (43/25) + log (39/25) = = 1.60 log (6/25) + log (6/25) + log (15/25) + log (7/25) + log (13/25) + log (14/25) = -( ) = The model captures our intuition!

Prediction of Translation Starts Build a mathematical model, based on collected translation start sequence For each candidate translation start sequence, apply the model and get a score If the score if larger than zero, predict it is a “translation start”; the higher score, the higher the probability the prediction is true ATG ……

Step 2: Identify Exon Boundaries

Prediction of Splice Junction Sites A start exon starts with a translation start and ends with a donor site An internal exon starts with an acceptor site and ends with a donor site A terminal exon starts with an acceptor site and ends with a stop codon Accurate prediction of exons/genes requires accurate prediction of splice junctions { translation start, acceptor site } { translation stop, donor site } exon

Prediction of Splice Junction Sites Splice junctions: –donor site: coding region | GT –acceptor: AG | coding region Like translation starts, the flanks of splice junctions (acceptors and donors) show “biased” distributions of nucleotides in certain positions These biased distributions of nucleotides are the basis for prediction of splice junctions

Prediction of Acceptor Sites Nucleotide distribution in the flanks of acceptors Multiple positions have high “information content” Information content:  log (F (X)/0.25) If every nucleotide has 0.25 frequency in a position, then the position’s information content is ZERO. Use “information content” as a criterion for determining the length of flanks

Prediction of Acceptor Sites Mathematical model: F i (X): frequency of X (A, C, G, T) in position i Score a segment as a candidate acceptor site by  i log (F i (X)/0.25) For each candidate acceptor sequence, apply the model and get a score If the score if larger than zero, predict it is an “acceptor”; the higher score, the higher the probability the prediction is true AG

Prediction of Donor Sites Nucleotide distribution in the flanks of donors Mathematical model: F i (X): frequency of X (A, C, G, T) in position i Score a segment as a possible donor site by  i log (F i (X)/0.25)

Prediction of Donor Sites For each candidate donor sequence, apply the model and get a score If the score is larger than zero, predict it is a “donor”; the higher score, the higher the probability the prediction is true GT

Prediction of Donors/Acceptors Position specific weight matrix model Build a “position specific weight matrix model” –collect known {donor, acceptor} sequences and align them so that the GT or AG are aligned –Calculate the percentage of each type of nucleotide at each position There are more sophisticated models for capturing higher order relationships between positions

Prediction of Exons For each orf, find all donor and acceptor candidates by finding GT and YAG motifs Score each donor and acceptor candidate using our position-specific weight matrix models Find all pairs of (acceptor, donor) above some thresholds Score the coding potential of the segment [donor, acceptor], using the hexmer model CAG GT

Step 3: Classify Candidate Exons

Testing Exons: Codon Usage Create a 64-element hash table and count the frequencies of codons in a candidate exon Amino acids typically have more than one codon, but in nature certain codons are more in use Uneven use of the codons may characterize a real gene

Codon Usage and Likelihood Ratio An ORF is more “believable” than another if it has more “likely” codons Do sliding window calculations to find ORFs that have the “likely” codon usage Allows for higher precision in identifying true ORFs; much better than merely testing for length. However, average vertebrate exon length is 130 nucleotides, which is often too small to produce reliable peaks in the likelihood ratio Further improvement: in-frame hexamer count (frequencies of pairs of consecutive codons)

Codon Usage in Mouse Genome AA codon /1000 frac Ser TCG Ser TCA Ser TCT Ser TCC Ser AGT Ser AGC Pro CCG Pro CCA Pro CCT Pro CCC AA codon /1000 frac Leu CTG Leu CTA Leu CTT Leu CTC Ala GCG Ala GCA Ala GCT Ala GCC Gln CAG Gln CAA

Exon Prediction Method 1: TestCode

TestCode Statistical test described by James Fickett in 1982: tendency for nucleotides in coding regions to be repeated with periodicity of 3 –Judges randomness instead of codon frequency –Finds “putative” coding regions, not introns, exons, or splice sites TestCode finds ORFs/Exons based on compositional bias with a periodicity of three

TestCode Statistics Define a window size no less than 200 bp, slide the window the sequence down 3 bases. In each window: –Calculate for each base {A, T, G, C} max (n 3k+1, n 3k+2, n 3k ) / min ( n 3k+1, n 3k+2, n 3k ) Use these values to obtain a probability from a lookup table (which was a previously defined and determined experimentally with known coding and noncoding sequences

TestCode Statistics (cont’d) Probabilities can be classified as indicative of " coding” or “noncoding” regions, or “no opinion” when it is unclear what level of randomization tolerance a sequence carries The resulting sequence of probabilities can be plotted

TestCode Sample Output Coding No opinion Non-coding

Exon Prediction Method 2: Likelihood Ratio/Suprevised Learning

Prediction of Coding Regions Build a coding-region predictor –Collect coding and non-coding sequences from GenBank (NCBI) –Calculate hexmer frequencies in coding and noncoding sequences (hexmer = 6-mer, or dimer of amino acids) Application – coding-region prediction –consider both strands: forward and reverse –for each strand, identify all the orfs –for prokaryotic genome, for each possible translation start ATG, evaluate the coding of [ATG, STOP] –for eukaryotic genome, for the segment defined by each pair of possible start/acceptor & donor/stop, evaluate the coding potential

Prediction of Exons For each segment [acceptor, donor], we get three scores (coding potential, donor score, acceptor score) Various possibilities –all three scores are high – probably true exon –all three scores are low – probably not a real exon –all in the middle -- ?? –some scores are high and some are low -- ?? What are the rules for exon prediction?

Prediction of Exons Learning to classify exons from nonexons –Collect a set of exons and non-exons –Score them using our scoring schemes –Plot them as follows –“draw” a separating line between exons and non-exons Making a prediction based on which side of the separating line a new point falls coding: noncoding:

Prediction of Exons A “classifier” can be trained to separate exons from non-exons, based on the three scores Closer to reality – other factors could also help to distinguish exons from non-exons exon length distribution 150 bp 50% G+C coding density is different in regions with different G+C contents A practical gene finding software may use many features to distinguish exons from non-exons

Prediction of Exons Each box represents a predicted exon A true exon typically has more than one predicted candidates, overlapping with each other

Gene Prediction in a New Genome Dicodon (hexmer) frequencies are different from genome to genome – gene finder for one genome cannot be directly applied to another genome shewanella bovine

Popular Gene Prediction Algorithms GENSCAN: uses Hidden Markov Models (HMMs) TWINSCAN –Uses both HMM and similarity (e.g., between human and mouse genomes) HMM will be covered later in the course

What You Should Know What is an open reading frame (ORF) What is splicing signal How the likelihood ratio method works