Download presentation
Presentation is loading. Please wait.
Published byBathsheba MargaretMargaret Patrick Modified over 9 years ago
1
A markovian approach for the analysis of the gene structure C. MelodeLima 1, L. Guéguen 1, C. Gautier 1 and D. Piau 2 1 Biométrie et Biologie Evolutive UMR CNRS 5558, Université Claude Bernard Lyon 1, France 2 Institut Camille Jordan UMR CNRS 5208, Université Claude Bernard Lyon 1, France PRABI
2
Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Contents Conclusion Direction of research
3
Introduction Intensive sequencing Genes represent only 3% of the human genome Markovian models are widely used for the identification of genes We propose an analysis of the structural properties of genes, using a discrimination method based on HMMs
4
Advantages: Each state represents a different type of region in the sequence The complexity of the algorithm is linear with respect to the length of the sequence Hidden Markov model Drawback: The distribution of the sojourn time in a given state is geometric The empirical distribution of the length of the exons is not geometric ! Introduction
5
HMM for the genomic structure of DNA sequences CDS No CDS Structure of the HMM model 1-t 1 1-t 2 t1t1 t2t2 Basesprobabilities ApA CpC GpG TpT Basesprobabilities AqA CqC GqG TqT CDS: coding sequence
6
Model of order 5 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
7
Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
8
Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
9
Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
10
Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
11
Model of order 5 StSt S t-1 S t-2 S t-3 S t-4 S t-5 S t-6 XtXt X t-1 X t-2 X t-3 X t-4 X t-5 X t-6 HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
12
Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:
13
Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:
14
Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon Several biological properties of DNA sequences were taken into account HMM for the genomic structure of DNA sequences Length distributions of exons and introns according to their position in genes:
15
Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:
16
Direct and reverse strands Intergenic region Single exon Initial exon Initial intron Internal intron Internal exon Terminal intron Terminal exon HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account Length distributions of exons and introns according to their position in genes:
17
Codons: 1-p Exon p frame 0frame 1frame 2 ppp 1-p HMM for the genomic structure of DNA sequences Several biological properties of DNA sequences were taken into account
18
Sojourn time in a HMM state must follows a geometric law Length of a hidden state CDS p T: sojourn time in a given state T follows a geometric law Geometric law 1-p HMM for the genomic structure of DNA sequences Times of stay in state CDS Probability 11-p 2p (1-p) 3p 2 (1-p) … np n-1 (1-p)
19
Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region Geometric laws does not fit the empirical distribution of the length of exons
20
Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region We suggest to: State 1State 2State Geometric laws does not fit the empirical distribution of the length of exons
21
Probability Length of the internal exons Méthode HMM for the genomic structure of DNA sequences Method: estimation of the length of a region We suggest to: State 1State 2State Good fit with sums of 5 geometric random variables Length of the internal exons Probabilityt
22
Method: estimation of the length of a region Data: Human genome * extracted from HOVERGEN Different length distributions: * Sum of geometric laws of equal parameter with =1..7 * Sum of 2 or 3 geometric laws of different parameters For each region: * We choose parameters that minimize the Kolmogorov-Smirnov distance * We do not use the maximum likelihood method HMM for the genomic structure of DNA sequences
23
Results: Estimation of the length of a region HMM for the genomic structure of DNA sequences Probability Length of the initial exon Maximum likelihood estimation Kolmogorov-Smirnov estimation
24
The model fits very well the empirical distribution HMM for the genomic structure of DNA sequences Results: Estimation of the length distribution of internal exons Length of the internal exons Probabilityt Sum of 5 geometric laws p=1/26
25
HMM for the genomic structure of DNA sequences Results: Estimation of the length distribution of intronless genes Many small genes with single exons are pseudogenes Sum of 2 geometric laws p=1/440
26
Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Conclusion Contents Direction of research
27
Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5) Method: A model for initial, internal, terminal exons Discrimination method based on HMM
28
Emission probabilities for each state are estimated by the frequencies of words with 6 letters (model of order 5) Method: A model for initial, internal, terminal exons Discrimination method based on HMM D = { log P(S/ HMM 1 ) - log P(S/ HMM 2 ) } / |S| (Eq. 1) S is the test sequence of length |S| Discrimination method to test the homogeneity between regions: HMM 1 : Initial ExonHMM 2 : Internal exon Sequence likelihood Sequence is characterized by the HMM with the best likelihood
29
Quality of the decision: We want to know if models are well adapted to their regions (HMMs are compared pair wise) {Initial exon sequences} N Decision N 1 initial exonsN-N 1 internal exons N1N1 N-N 1 Discrimination method based on HMM Each model is characterized by the frequency of sequence recognition
30
Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM
31
Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM
32
Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM
33
Results: Comparison of different HMMs on different test sequences Internal exon ≈ Terminal exon Initial exon ≠ Internal exon Initial exon ≠ Terminal exon Discrimination method based on HMM
34
To determine the break point in first exon sequences, we consider different HMMs: HMM StartHMM End Initial exon HMM k The HMM representing the initial exon was split into 2 HMMs around the k th base A “Start” HMM is trained on the first k bases An “End” HMM is trained on the remaining bases Discrimination method based on HMM Results: Break in the homogeneity of the first coding exon
35
M_EI 80 Other models Discrimination method based on HMM
36
Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM
37
Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM
38
Results: Break in the homogeneity of the first coding exon M_EI 80 Other models Discrimination method based on HMM
39
Results: Initial exons HMM Start HMM End 25% 75% with peptide signal (SignalP) Discrimination method based on HMM
40
Result: Initial exons HMM Start HMM End 25% 75% with peptide signal (SignalP) HMM Start characterizes well the peptide signal 90% 10% without peptide signal Discrimination method based on HMM
41
Modelling of the exons length distribution: The model has relatively few parameters Sum of 5 geometric laws of the same parameter (internal exons) Sum of 3 geometric laws of different parameters (terminal exons) Sums of geometric laws fit well the distribution of exons lengths Conclusion
42
Modelling of the exons length distribution: The model has relatively few parameters Sum of 5 geometric laws of the same parameter (internal exons) Sum of 3 geometric laws of different parameters (terminal exons) Sums of geometric laws fit well the distribution of exons lengths Conclusion Discrimination method based on HMM: Bad annotation in database of the intronless genes Homogeneity between internal and terminal exons Break of homogeneity of initial exon around 80 th base Peptide signal
43
Introduction HMM for the genomic structure of DNA sequences Discrimination method based on HMM Conclusion Contents Direction of research
44
Versteeg 2003 Chromosome 9 Content of GC Markovian models for the analysis of the organization of genomes Direction of research
45
Versteeg 2003 Chromosome 9 Content of GC Genes density Markovian models for the analysis of the organization of genomes Direction of research
46
Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Markovian models for the analysis of the organization of genomes Direction of research
47
Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Repeated elements Markovian models for the analysis of the organization of genomes Direction of research
48
Versteeg 2003 Chromosome 9 Genes density Content of GC Size of introns Repeated elements Genes expression Markovian models for the analysis of the organization of genomes Direction of research
49
Structure superposition in genomes A chromosome Isochore level Gene level Exon-intron level Codon level intron exon acc gcc agt tac ccc aga Direction of research
50
–Build 3 HMMs adapted to the organization structure of each of the 3 isochores classes H, L, M H = [72%, 100%] M = ]56%, 72%[ L = [0%, 56%] –Human chromosomes are divided into overlapping 100 kb segments. Two successive segments overlap by half of their length. –Bayesian approach: for each segment and for each model (H, L and M), we compute the probability P[Model | Segment] Segment is characterized by the model with the best probability Scan the genome Direction of research
51
Results: Human chromosome 1 Model H Model M Model L Genes density Repartition of isochores G+C content Direction of research
52
Comparing the human genome with genomes of different organisms can be useful to: better understand the structure and function of human genes study evolutionary changes among organisms help to identify the genes that are conserved among species Comparative Genomic Analysis
53
Human ChimpanzeeMouse Chicken Tetraodon Direction of research
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.