Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation
Outline Gene finding using HMMs Adding trees to HMMs phyloHMM N-SCAN BLAST+ Gene Finding SGP2 Examples
3 Markov Sequence Models Key: distinguish coding/non-coding statistics Popular models: 6-mers (5 th order Markov Model) Homogeneous/non-homogeneous (reading frame specific) Not sensitive enough for eukaryote genes: exons too short, poor detection of splice junctions
Simple HMMs can only encode genometric length distributions The length of each exon (intron) : CG © Ron Shamir, Length Distribution exonintron p q 1-p 1-q
CG © Ron Shamir, Exon Length Distribution The length distribution of introns is ≈ geometric For exons, it isn’t: also affected by splicing itself: Too short (under 50bps): the spliceosomes have no room Too long (over 300bps): ends have problems finding each other. But as usual there are exceptions. A different model for exons is needed A different model is needed for exons.
CG © Ron Shamir, Generalized HMM (Burge & Karlin, J. Mol. Bio ) Instead of a single char, each state omits a sequence with some length distribution
CG © Ron Shamir, Generalized HMM (Burge & Karlin, J. Mol. Bio ) Overview: Hidden Markov states q 1,…q n State q i has output length distribution f i Output of each state can have a separate probabilistic model (weight matrix model, HMM…) Initial state probability distribution State transition probabilities T ij
CG © Ron Shamir, Burge & Karlin JMB 97 GenScan Model
CG © Ron Shamir, GenScan model states = functional units on a gene The allowed transitions ensure the order is biologically consistent. As an intron may cut a codon, one must keep track of the reading frame, hence the three I phases: phase I 0 : between codons phase I 1: : introns that start after 1st base phase I 2 : introns that start after 2nd base
Phylogenetic HMMs Due to Siepel and Haussler A simple gene-finding HMM looks at a single Markov process: Along the sequence: each position is dependent on the previous position If we incorporate sequences from multiple organisms, we can look at another process: Along the tree: each position is dependent on its ancestor
Phylogenetic HMMs A simple HMM can be thought of as a machine that generates a sequence Every state omits a single character Multinomial distribution at every state A phyloHMM generates an MSA Every state omits a single MSA column Phylogenetic model at every state
Phylogenetic HMMs
Phylogenetic models in phyloHMM Defines a stochastic process of substitution Every position is independent The following process occurs: A character is assigned to the root The character substitution occur based of some substitution matrix and based on the branch lengths The characters at the leaves of the tree correspond to the MSA column
Phylogenetic models in phyloHMM Different models for different states: Different substitution rates E.g., in exons, we’ll see less substitutions Different patterns of substitutions E.g., third position bias in coding sequences Different tree topologies E.g., following recombination
Formally S – set of states Ψ – phylogenetic models (instead of E in a standard HMM) A – state transitions b – initial probabilities
Formally Q – substitution rate matrix (e.g., derived from PAM) Π – background frequencies τ – the phylogenetic tree β – branch lengths
Formally - Probability of a column X i being omitted by the model ψ i Can be computed efficiently by Felsenstein’s “pruning algorithm” (recitation 6) Joint probability of a path in the HMM and and alignment X Viterbi, forward-backward etc. – as usual
Simple phylo-gene-finder Non-coding 3 rd position If the parameters are known – Viterbi can be used to find the most probably path – segmentation into coding regions
Phylo-gene-finder is a good idea Use of phylogeny is important: Imposes structure on the substitutions Weights different pairs differently based on the evolutionary distance
N-SCAN Another phylogeny-HMM-gene-finder A GHHM that emits MSA columns Annotates one sequence at a time: the target sequence Distinguishes between a target sequence – T and other informative sequences (Is) that may contain gaps States correspond to sequence types in the target sequence
N-SCAN Bayesian network instead of a simple evolutionary model Accounts for: 5’ UTRs Conserved non-coding Highly conserved No “coding” features
Transforming the phylogenetic tree into a BN
SGP-2 Drawback of the described approaches: require meaningful alignment Impossible if one of the genomes is not yet finished An alignment is not necessary “correct”
SGP-2 A framework working on two genomes Idea: Use BLAST to identify which positions are more/less conserved Feed the BLAST scores into the gene-finding HMM The BLAST results serve to modify the scores of the exons.
SGP-2
SGP-2 is an extension of GENEID GENEID (Single genome): 5 th -order HMM Exons scored based on HMM scores + some additional features A spliced alignment used for assembly of genes from exons
Scoring exons Ideally, we would like to score exons based on their alignment score In this case a BLAST variant is used: the extent of the highest-scoring HSP overlapping the exon is used as a proxy for its “conservation” Problem: If the genome is not assembled, different exon parts will have HSPs with different genome parts Solution: consider HSPs covering different parts of the exon. Unify them into a single exon score. Final exon score:
BACH1
OLIG2
PPM1A
Summary Different approaches for gene finding Adding phylogeny generally helps But What about genes/exons which are specific to humans Ape genomes are not (almost) available and too similar Phylogenetic help almost essential in more difficult problems Motif finding (promoter analysis) Ultraconserved regions with no evident function