Computational Gene Finding

Computational Gene Finding
Sanja Rogic CS Department UBC Jan 23, 2003 Computational Gene Finding

What is Computational Gene Finding?
Given an uncharacterized DNA sequence, find out: Which region codes for a protein? Which DNA strand is used to encode the gene? Which reading frame is used in that strand? Where does the gene starts and ends? Where are the exon-intron boundaries in eukaryotes? (optionally) Where are the regulatory sequences for that gene? Jan 23, 2003 Computational Gene Finding

Prokaryotic Vs. Eukaryotic Gene Finding
Prokaryotes: small genomes 0.5 – 10·106 bp high coding density (>90%) no introns Gene identification relatively easy, with success rate ~ 99% Problems: overlapping ORFs short genes finding TSS and promoters Eukaryotes: large genomes 107 – 1010 bp low coding density (<50%) intron/exon structure Gene identification a complex problem, gene level accuracy ~50% Problems: many Jan 23, 2003 Computational Gene Finding

Gene Structure Jan 23, 2003 Computational Gene Finding

Gene Finding: Different Approaches
Similarity-based methods (extrinsic) - use similarity to annotated sequences: proteins cDNAs ESTs Comparative genomics - Aligning genomic sequences from different species Ab initio gene-finding (intrinsic) Integrated approaches Jan 23, 2003 Computational Gene Finding

Similarity-based methods
Based on sequence conservation due to functional constraints Use local alignment tools (Smith-Waterman algo, BLAST, FASTA) to search protein, cDNA, and EST databases Will not identify genes that code for proteins not already in databases (can identify ~50% new genes) Limits of the regions of similarity not well defined -protein: databases SwissProt or PIR///cannot delimit UTRs////not all domains present cDNA///most relevant for determining gene structure/// best if derived from the same organism//// -EST:poor sequence quality///chimeric///contamination with primers///wrong orientation - Jan 23, 2003 Computational Gene Finding

Comparative Genomics Based on the assumption that coding sequences are more conserved than non-coding Two approaches: intra-genomic (gene families) inter-genomic (cross-species) Alignment of homologous regions Difficult to define limits of higher similarity Difficult to find optimal evolutionary distance (pattern of conservation differ between loci) Jan 23, 2003 Computational Gene Finding

Jan 23, 2003 Computational Gene Finding

Summary for Extrinsic Approaches
Strengths: Rely on accumulated pre-existing biological data, thus should produce biologically relevant predictions Weaknesses: Limited to pre-existing biological data Errors in databases Difficult to find limits of similarity Jan 23, 2003 Computational Gene Finding

Ab initio Gene Finding, Part 1
Input: A DNA string over the alphabet {A,C,G,T} Output: An annotation of the string showing for every nucleotide whether it is coding or non-coding AAAGCATGCATTTAACGAGTGCATCAGGACTCCATACGTAATGCCG Gene finder AAAGC ATG CAT TTA ACG A GT GCATC AG GA CTC CAT ACG TAA TGCCG Jan 23, 2003 Computational Gene Finding

Ab initio Gene Finding, Part 2
Using only sequence information Identifying only coding exons of protein-coding genes (transcription start site, 5’ and 3’ UTRs are ignored) Integrates coding statistics with signal detection Jan 23, 2003 Computational Gene Finding

Coding Statistics, Part 1
Unequal usage of codons in the coding regions is a universal feature of the genomes uneven usage of amino acids in existing proteins uneven usage of synonymous codons (correlates with the abundance of corresponding tRNAs) We can use this feature to differentiate between coding and non-coding regions of the genome Coding statistics - a function that for a given DNA sequence computes a likelihood that the sequence is coding for a protein Jan 23, 2003 Computational Gene Finding

Coding Statistics, Part 2
Many different ones codon usage hexamer usage GC content compositional bias between codon positions nucleotide periodicity … Hexamer usage shown to be most discriminative and majority of current algos are using it Jan 23, 2003 Computational Gene Finding

An Example of Coding Statistics, Part 1
For each codon the table displays the frequency of usage of each codon (per thousand) in human (first column) Relative frequency of each codon among synonymous codons (second column) Jan 23, 2003 Computational Gene Finding

Let F(c) be the frequency (probability) of codon c in the genes of the species under consideration Given the sequence of codons C=c1c2…cm and assuming independence between adjacent codons: P(C)=F(c1)F(c2)…F(cm) is probability of finding C, knowing that C codes for protein Example: S=AGGACC c1=AGG c2= ACC P(S) = F(AGG)·F(ACC) = · 0.038= Jan 23, 2003 Computational Gene Finding

Let F0(c) be the frequency of codon c in a non-coding sequence. P0 (C)=F0(c1)F0(c2)…F0(cm) is the probability of finding C, knowing that C is non-coding Assuming the random model of non-coding DNA, F0(c) = 1/64= for all codons P0 (S) = · = The log-likelihood (LP) ratio for S is: LP(S) = log( / ) = log(3.43) = 0.53 LP(S) > S is coding Jan 23, 2003 Computational Gene Finding

Computing Coding Statistics in Practice
Usually, the value of coding statistics is computed using sliding windows coding profile of the sequence Larger windows are required to detect a clear signal (50 – 200 bp) Sliding window = successive overlapping windows Small exons might be missed Jan 23, 2003 Computational Gene Finding

Coding Profile of ß-globin gene
Window size 120 Distance between overlapping windows 10 LP computed for all three reading frame Jan 23, 2003 Computational Gene Finding

Signal Sensors, Part 1 Signal – a string of DNA recognized by the cellular machinery Jan 23, 2003 Computational Gene Finding

Signal Sensors, Part 2 Various pattern recognition method are used for identification of these signals: consensus sequences weight matrices weight arrays decision trees Hidden Markov Models (HMMs) neural networks … Jan 23, 2003 Computational Gene Finding

Example of Consensus Sequence
obtained by choosing the most frequent base at each position of the multiple alignment of subsequences of interest TACGAT TATAAT GATACT TATGAT TATGTT consensus sequence consensus (IUPAC) Leads to loss of information and can produce many false positive or false negative predictions TATAAT MELON MANGO HONEY SWEET COOKY IUPAC –set of symbols encoding each subset of four nucleotides R – purine N- any TATRNT MONEY Jan 23, 2003 Computational Gene Finding

Example of (Positional) Weight Matrix
Computed by measuring the frequency of every element of every position of the site (weight) Score for any putative site is the sum of the matrix values (converted in probabilities) for that sequence (log-likelihood score) Disadvantages: cut-off value required assumes independence between adjacent bases TACGAT TATAAT GATACT TATGAT TATGTT 1 2 3 4 5 6 A C G T Jan 23, 2003 Computational Gene Finding

Example of Decision Tree
- model that captures most significant dependencies between the nucleotides H is A, C or U Jan 23, 2003 Computational Gene Finding

Markov Models Collection of states that have transitional probabilities between them The future state depends only on the present state and not on the past ones Jan 23, 2003 Computational Gene Finding

Ingredients of a Markov Model
Collection of states {S1, S2, …,SN} State transition probabilities (transition matrix) Aij = P(qt+1 = Si | qt = Sj) Initial state distribution i = P(q1 = Si) Jan 23, 2003 Computational Gene Finding

Ingredients of Our Markov Model
Collection of states {Ssunny, Srainy, Ssnowy} State transition probabilities (transition matrix) A = Initial state distribution i = ( ) .08 .15 .05 .38 .6 .02 .75 .2 Jan 23, 2003 Computational Gene Finding

Probability of a Sequence of Events
P(Ssunny) x P(Srainy | Ssunny) x P(Srainy | Srainy) x P(Srainy | Srainy) x P(Ssnowy | Srainy) x P(Ssnowy | Ssnowy) = 0.7 x 0.15 x 0.6 x 0.6 x 0.02 x 0.2 = This is a classification problem Jan 23, 2003 Computational Gene Finding

Hidden Markov Models Jan 23, 2003 Computational Gene Finding

Ingredients of a HMM Collection of states: {S1, S2,…,SN} State transition probabilities (transition matrix) Aij = P(qt+1 = Si | qt = Sj) Initial state distribution i = P(q1 = Si) Observations: {O1, O2,…,OM} Observation probabilities: Bj(k) = P(vt = Ok | qt = Sj) Jan 23, 2003 Computational Gene Finding

Ingredients of Our HMM States: {Ssunny, Srainy, Ssnowy} State transition probabilities (transition matrix) A = Initial state distribution i = ( ) Observations: {O1, O2,…,OM} Observation probabilities (emission matrix): B = .08 .15 .05 .38 .6 .02 .75 .2 .08 .15 .05 .38 .6 .02 .75 .2 Jan 23, 2003 Computational Gene Finding

Probability of a Sequence of Events
P(O) = P(Ogloves, Ogloves, Oumbrella,…, Oumbrella) =  P(O | Q)P(Q) =  P(O | q1,…,q7) = 0.7 x 0.86 x 0.32 x 0.14 x … all Q q1,…q7 Jan 23, 2003 Computational Gene Finding

Typical HMM Problems Annotation Given a model M and an observed string S, what is the most probable path through M generating S Classification Given a model M and an observed string S, what is the total probability of S under M Consensus Given a model M, what is the string having the highest probability under M Training Given a set of strings and a model structure, find transition and emission probabilities assigning high probabilities to the strings Annotation – gene finding Classification – signal sensor, content sensor Annotation and classification problems can be solved efficiently using dynamic programming (Viterbi) Jan 23, 2003 Computational Gene Finding

HMMs and Gene Structure
Nucleotides {A,C,G,T} are the observables Different states generates generate nucleotides at different frequencies A simple HMM for unspliced genes: AAAGC ATG CAT TTA ACG AGA GCA CAA GGG CTC TAA TGCCG The sequence of states is an annotation of the generated string – each nucleotide is generated in intergenic, start/stop, coding state This HMM has 4 states: x- non-coding, c- coding, start and stop Jan 23, 2003 Computational Gene Finding

State of the Art in Ab initio Gene Finding
Coding statistics and signal sensors are integrated in overall gene model using machine learning techniques (HMMs, decision trees, neural networks) discriminant functions (linear, quadratic) Use dynamic programming (Viterbi) to find the highest scoring path through the model Capable of predicting: genes on both strands simultaneously partial and multiple genes in a sequence suboptimal exons Number of potential gene models grows exponentially with the number of predicted exons Jan 23, 2003 Computational Gene Finding

Examples of Gene Finders
FGENES – linear DF for content and signal sensors and DP for finding optimal combination of exons GeneMark – HMMs enhanced with ribosomal binding site recognition Genie – neural networks for splicing, HMMs for coding sensors, overall structure modeled by HMM Genscan – WM, WA and decision trees as signal sensors, HMMs for content sensors, overall HMM HMMgene – HMM trained using conditional maximum likelihood Morgan – decision trees for exon classification, also Markov Models MZEF – quadratic DF, predict only internal exons Jan 23, 2003 Computational Gene Finding

Genscan Example Developed by Chris Burge 1997 One of the most accurate ab initio programs Uses explicit state duration HMM to model gene structure (different length distributions for exons) Different model parameters for regions with different GC content Also generalized Jan 23, 2003 Computational Gene Finding

Einit Eterm 3’ UTR 5’ UTR Esngl polyA P forward strand N backward strand polyA P 5’ UTR Esngl 3’ UTR Einit Eterm I0 I1 I2 E0 E1 E2 Jan 23, 2003 Computational Gene Finding

Genscan’s Architecture
HMM’s states for exons and introns in three different phases, single exon, 5’ and 3’ UTRs, promoter region and polyA site and intergenic region Explicit length modeling HMMs for exons, introns and intergenic regions WM and WA for acceptor site, branch point, polyA site and promoter region Decision tree (maximal dependence decomposition) for donor sites Jan 23, 2003 Computational Gene Finding

Ab initio Gene Finding is Difficult
Genes are separated by large intergenic regions Genes are not continuous, but split in a number of (small) coding exons, separated by (larger) non-coding introns in humans coding sequence comprise only a few percents of the genome and an average of 5% of each gene Sequence signals that are essential for elucidation of a gene structure are degenerate and highly unspecific Alternative splicing Repeat elements (>50% in humans) – some contain coding regions It is almost impossible to distinguish between signals that are truly processed by the cell from those that are apparently non-functional Jan 23, 2003 Computational Gene Finding

Problems with Ab initio Gene Finding
No biological evidence In long genomic sequences many false positive predictions Prediction accuracy high, but not sufficient Jan 23, 2003 Computational Gene Finding

Evaluation of Gene Finding Programs
Calculating accuracy of programs’ predictions Several evaluation studies: Burset and Guigó, 1996 (vertebrate sequences) Pavy et al., 1999 (Arabidopsis thaliana) Rogic et al., 2001 (mammalian sequences) Jan 23, 2003 Computational Gene Finding

Measures of Prediction Accuracy, Part 1
Nucleotide level accuracy Sensitivity = Specificity = TN FP FN TP REALITY PREDICTION number of correct exons number of actual exons number of correct exons number of predicted exons Jan 23, 2003 Computational Gene Finding

Measures of Prediction Accuracy, Part 2
Exon level accuracy WRONG EXON CORRECT EXON MISSING EXON REALITY PREDICTION Jan 23, 2003 Computational Gene Finding

Evaluation in Rogic et al., 2001
We developed a new dataset for this purposes – HMR195 Characteristics of the sequences: human – mouse – rat origin Relatively short DNA sequences from GenBank one gene per sequence sequences used for the training of the programs were excluded Jan 23, 2003 Computational Gene Finding

More About HMR195 Filtering: Canonical start and stop codon No in-frame stop codons Canonical splice site dinucleotides AG – GT Redundancy filtering: similar sequences excluded Confirming exon locations using mRNA alignment Jan 23, 2003 Computational Gene Finding

Evaluation Results Jan 23, 2003 Computational Gene Finding

Additional Testing Accuracy as a function of sequence and prediction characteristics: GC content exon length exon type exon type and signal exon probabilities and scores phylogenetic specificity Jan 23, 2003 Computational Gene Finding

Integrated Approaches for Gene Finding
Programs that integrate results of similarity searches with ab initio techniques (GenomeScan, FGENESH+, Procrustes) Programs that use synteny between organisms (ROSETTA, SLAM) Integration of programs predicting different elements of a gene (EuGène) Combining predictions from several gene finding programs (combination of experts) Jan 23, 2003 Computational Gene Finding

Combining Programs’ Predictions
Set of methods used and they way they are integrated differs between individual programs Different programs often predict different elements of an actual gene they could complement each other yielding better prediction Jan 23, 2003 Computational Gene Finding

Related Work This approach was suggested by several authors Burset and Guigó (1996) Investigated correlation between 9 gene-finding programs 99% of exons predicted by all programs were correct 1% of exons completely missed by all programs Murakami and Tagaki (1998) Five methods for combining the prediction by 4 gene-finding programs Nucleotide level accuracy measures improved by 3-5% in comparison with the best single Jan 23, 2003 Computational Gene Finding

AND and OR Methods exon 1 exon 2 union intersection Jan 23, 2003 Computational Gene Finding

Combining Genscan and HMMgene
High prediction accuracy as well as reliability of their exon probability made them the best candidates for our study Genscan predicted 77% of exons correctly, HMMgene 75%, both 87% Genscan 111 624 91 HMMgene Jan 23, 2003 Computational Gene Finding

Control Datasets Burset/Guigó dataset – 570 vertebrate genomic sequences containing exactly one multi-exon gene Multi-gene dataset – 22 human/murine sequences containing more than one gene Jan 23, 2003 Computational Gene Finding

EUI Method (exon union – intersection)
Union of exons with p  0.75 Intersection of exons with p < 0.75 Rule for initial exon Jan 23, 2003 Computational Gene Finding

GI Method (gene intersection)
Intersection of genes Apply EUI method to exons completely belonging to GI genes Jan 23, 2003 Computational Gene Finding

EUI_frame Method (EUI with reading frame consistency)
Assign probabilities to GI genes. Determine position of acceptor and donor site in a reading frame. GI gene with higher probability imposes the reading frame. Choose only EUI exons contained in GI genes that are in a chosen reading frame. Jan 23, 2003 Computational Gene Finding

Results – HMR dataset Sp increased 3.2%, ESn increased 2.6%, ESp increased 11.7% Number of wrong exons significantly decreased Jan 23, 2003 Computational Gene Finding

Results – Burset/Guigó dataset
Similar to HMR195 Jan 23, 2003 Computational Gene Finding

Results – Drosophila Adh region
Sp increased 21%, ESn increased 6.8%, ESp increased 32.5% Number of wrong exons decreased several-fold Jan 23, 2003 Computational Gene Finding

Why Does It Work? Jan 23, 2003 Computational Gene Finding

Thank you Jan 23, 2003 Computational Gene Finding

Computational Gene Finding

Similar presentations

Presentation on theme: "Computational Gene Finding"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Gene Finding

Similar presentations

Presentation on theme: "Computational Gene Finding"— Presentation transcript:

Similar presentations

About project

Feedback