Download presentation
Presentation is loading. Please wait.
Published byShauna Nelson Modified over 9 years ago
1
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation
2
Outline Gene finding using HMMs Adding trees to HMMs phyloHMM N-SCAN BLAST+ Gene Finding SGP2 Examples
3
3 Markov Sequence Models Key: distinguish coding/non-coding statistics Popular models: 6-mers (5 th order Markov Model) Homogeneous/non-homogeneous (reading frame specific) Not sensitive enough for eukaryote genes: exons too short, poor detection of splice junctions
4
Simple HMMs can only encode genometric length distributions The length of each exon (intron) : CG © Ron Shamir, 20084 Length Distribution exonintron p q 1-p 1-q
5
CG © Ron Shamir, 20085 Exon Length Distribution The length distribution of introns is ≈ geometric For exons, it isn’t: also affected by splicing itself: Too short (under 50bps): the spliceosomes have no room Too long (over 300bps): ends have problems finding each other. But as usual there are exceptions. A different model for exons is needed A different model is needed for exons.
6
CG © Ron Shamir, 20086 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) Instead of a single char, each state omits a sequence with some length distribution
7
CG © Ron Shamir, 20087 Generalized HMM (Burge & Karlin, J. Mol. Bio. 97 268 78-94) Overview: Hidden Markov states q 1,…q n State q i has output length distribution f i Output of each state can have a separate probabilistic model (weight matrix model, HMM…) Initial state probability distribution State transition probabilities T ij
8
CG © Ron Shamir, 20088 Burge & Karlin JMB 97 GenScan Model
9
CG © Ron Shamir, 20089 GenScan model states = functional units on a gene The allowed transitions ensure the order is biologically consistent. As an intron may cut a codon, one must keep track of the reading frame, hence the three I phases: phase I 0 : between codons phase I 1: : introns that start after 1st base phase I 2 : introns that start after 2nd base
10
Phylogenetic HMMs Due to Siepel and Haussler A simple gene-finding HMM looks at a single Markov process: Along the sequence: each position is dependent on the previous position If we incorporate sequences from multiple organisms, we can look at another process: Along the tree: each position is dependent on its ancestor
11
Phylogenetic HMMs A simple HMM can be thought of as a machine that generates a sequence Every state omits a single character Multinomial distribution at every state A phyloHMM generates an MSA Every state omits a single MSA column Phylogenetic model at every state
12
Phylogenetic HMMs
13
Phylogenetic models in phyloHMM Defines a stochastic process of substitution Every position is independent The following process occurs: A character is assigned to the root The character substitution occur based of some substitution matrix and based on the branch lengths The characters at the leaves of the tree correspond to the MSA column
14
Phylogenetic models in phyloHMM Different models for different states: Different substitution rates E.g., in exons, we’ll see less substitutions Different patterns of substitutions E.g., third position bias in coding sequences Different tree topologies E.g., following recombination
15
Formally S – set of states Ψ – phylogenetic models (instead of E in a standard HMM) A – state transitions b – initial probabilities
16
Formally Q – substitution rate matrix (e.g., derived from PAM) Π – background frequencies τ – the phylogenetic tree β – branch lengths
17
Formally - Probability of a column X i being omitted by the model ψ i Can be computed efficiently by Felsenstein’s “pruning algorithm” (recitation 6) Joint probability of a path in the HMM and and alignment X Viterbi, forward-backward etc. – as usual
18
Simple phylo-gene-finder Non-coding 3 rd position If the parameters are known – Viterbi can be used to find the most probably path – segmentation into coding regions
19
Phylo-gene-finder is a good idea Use of phylogeny is important: Imposes structure on the substitutions Weights different pairs differently based on the evolutionary distance
20
N-SCAN Another phylogeny-HMM-gene-finder A GHHM that emits MSA columns Annotates one sequence at a time: the target sequence Distinguishes between a target sequence – T and other informative sequences (Is) that may contain gaps States correspond to sequence types in the target sequence
21
N-SCAN Bayesian network instead of a simple evolutionary model Accounts for: 5’ UTRs Conserved non-coding Highly conserved No “coding” features
22
Transforming the phylogenetic tree into a BN
23
SGP-2 Drawback of the described approaches: require meaningful alignment Impossible if one of the genomes is not yet finished An alignment is not necessary “correct”
24
SGP-2 A framework working on two genomes Idea: Use BLAST to identify which positions are more/less conserved Feed the BLAST scores into the gene-finding HMM The BLAST results serve to modify the scores of the exons.
25
SGP-2
26
SGP-2 is an extension of GENEID GENEID (Single genome): 5 th -order HMM Exons scored based on HMM scores + some additional features A spliced alignment used for assembly of genes from exons
27
Scoring exons Ideally, we would like to score exons based on their alignment score In this case a BLAST variant is used: the extent of the highest-scoring HSP overlapping the exon is used as a proxy for its “conservation” Problem: If the genome is not assembled, different exon parts will have HSPs with different genome parts Solution: consider HSPs covering different parts of the exon. Unify them into a single exon score. Final exon score:
28
BACH1
29
OLIG2
30
PPM1A
31
Summary Different approaches for gene finding Adding phylogeny generally helps But What about genes/exons which are specific to humans Ape genomes are not (almost) available and too similar Phylogenetic help almost essential in more difficult problems Motif finding (promoter analysis) Ultraconserved regions with no evident function
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.