Download presentation
Presentation is loading. Please wait.
1
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell Schwartz Computer Science Dept, Tepper School of Business and Dept of Biological Sciences. Carnegie Mellon University.
2
Motivation Individual characteristics genetic factors Model the structure of correlated genetic variation (haplotypes) in the DNA Extract haplotype patterns (motifs) from the model Perform association studies comparing motifs to susceptibility to diseases
3
Single Nucleotide Polymorphism (SNP) Rows: Individual samples Columns: Nucleotides ACCTGTATACGTA0000000000000 ACATGTAGACGGA0010000100010 ACCTGTAGACGGA0000000100010 ACATGTATACGTA0010000000000 ACCTGTAGACGGA0000000100010
4
Related Article
5
Evolution Two types of events: mutation and recombination Mutation (one strand of one chromosome shown): ACGTACCGTATATA ACGTACTGTATATA Recombination (one strand of two homologous chromosomes shown): ACGTACCGTATATAACGTACCGTACGTA GTACTACGTACGTAGTACTACGTATATA
6
Recombination Ancestral Sequences: ████████████████████ ████████████████████ Current Population: ████████████████████
7
Comparison of blocks and motifs Blocks [Daly et al, 2000]Motifs [Schwartz 2003] Blocks [Daly et al. 2000]Motifs [Schwartz, 2003]
8
Minimum Description Length (MDL) Let: M represent the parameters of the model I represent the input matrix E be the explanation of I using M L be the length of encoding Objective: Minimize L(M) + L(E(I)|M) Complicated models are penalized Prevents over-fitting
9
Dynamic Program - Blocks Dynamic Program [Koivisto et al. 2003]: where C ( j+1, i ) is the cost of creating a single block from j+1 to i. Running time: O(n 2 ) Work space: O(n) i single block best …
10
Expectation-Maximization Algorithm - Motifs 1. Create a DAG of all possible motifs with a ‘start’ vertex 2. Initialize probabilities 3. For each EM iteration i. For each row r in R a. In sub-graph corresponding to r find ML path from start ii. Re-normalize probabilities based on the number of times the vertices were used in ML path
11
Example 0 0 00 0 0 0 0 0 0 0 0 0 0 0 0 0 0
12
Example 0 0 00 1 0 0 0 1 0 1 0 0 1 0 0 0 0
13
Example 0 0 00 1 0 0 0 1 1 1 0 0 2 0 0 01
14
Example 5 2 9 0 0 4 5 4 2 0 1 3 3 5 1 0 6 6
15
Example Example - Re-normalize 5 2 9 0 0 4 5 4 2 0 1 3 3 5 1 0 6 6 0.33 0.17 1.0 0.0
16
Heuristics EM finds P 1 and P 2 but cost(P 1 ) + cost(P 2 ) ≥ cost(P’ 1 U P’ 2 ) Use knowledge from previous EM iteration Multiple shortest paths with weight: (1+ε) -cost(P) Addition of small constants to prevent zero probability in first few iterations Initialize the probabilities to favor smaller motifs Restrict maximum length of motifs
17
Experimental Results Num Seqs/ Recomb Rt Num SNPsDesc Length – Motifs (in bits) Desc Length- Blocks (in bits) 100/low229.004528.265029.53 200/low241.006600.218528.48 300/low248.339944.4513341.84 400/low219.0012735.5218270.20 100/high202.335787.316025.29 200/high250.338852.3110511.31 300/high209.6711921.3815418.65 400/high211.0019626.8526233.36 Simulated data using the ms program [Hudson, 2002]
18
Experimental Results
19
Motifs: High recombination Blocks: High recombination Motifs: Low Recombination Blocks: Low Recombination
20
Conclusion Characterized the problem of inferring haplotype structure as an optimization problem that is robust against over-fitting Haplotype motif model better captures the structure than haplotype blocks Furthermore, motif method performs progressively better with larger input size
21
Discussion & Future Work Extensions: Polynomial time algorithm/NP-hardness Clustering and error models Real data – recombination hot-spots Future directions Genotype data Haplotype Data Motifs/ Blocks/? Disease Analysis/ Drug design phasing direct optimization htSNP, association tests, ?current work
22
Encoding Motifs Let s i be the start locations of motifs Let t i,j be the number of motifs that start at i and end at j Let E i ={e i, 1, …, e i,k } be the ordered set of end locations for motifs that start at i Cost for encoding model: Additional cost for encoding motif probabilities
23
Explanation Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:
24
Human Genetic Structure Chromosomes in the nucleus of cells 23 pairs of chromosomes Double helix structure of chromosomes Chromosomes: Genes and inter-genic regions Genes: Encode for proteins
25
Single Nucleotide Polymorphism (SNP) Human genomes are very similar SNP: Single base with high probability of variation Bi-allelic: Two out of four possible nucleotides In humans reduction in size ~ × 300
26
Encoding Blocks Let s i represent the start columns of blocks t i represent the number of blocks starting at t i Cost of encoding Model: Additionally, encoding for probabilities for block haplotypes
27
Encoding Blocks Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:
28
DNA Building blocks (nucleotides): Adenine(A), Cytosine(C), Guanine(G) and Thymine(T) Adenine(A) pairs with Thymine(T) Cytosine(C) pairs with Guanine(G)
29
Haplotypes Contiguous regions of correlated genetic variation Two models: Blocks and Motifs Blocks: Popular and widely assumed [Daly et al. 2000] Boundary aligned ‘block haplotypes’ Motifs: Recently introduced [Schwartz 2003] Overlapping ‘haplotype motifs’
30
Comparison of Blocks and Motifs Two models: Haplotype blocks[Daly et al. 2000] and haplotype motifs [Schwartz 2003]
31
Recent Article – dogs helping humans
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.