Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell Schwartz Computer Science Dept, Tepper School of Business and Dept of Biological Sciences. Carnegie Mellon University.
Motivation Individual characteristics genetic factors Model the structure of correlated genetic variation (haplotypes) in the DNA Extract haplotype patterns (motifs) from the model Perform association studies comparing motifs to susceptibility to diseases
Single Nucleotide Polymorphism (SNP) Rows: Individual samples Columns: Nucleotides ACCTGTATACGTA ACATGTAGACGGA ACCTGTAGACGGA ACATGTATACGTA ACCTGTAGACGGA
Related Article
Evolution Two types of events: mutation and recombination Mutation (one strand of one chromosome shown): ACGTACCGTATATA ACGTACTGTATATA Recombination (one strand of two homologous chromosomes shown): ACGTACCGTATATAACGTACCGTACGTA GTACTACGTACGTAGTACTACGTATATA
Recombination Ancestral Sequences: ████████████████████ ████████████████████ Current Population: ████████████████████
Comparison of blocks and motifs Blocks [Daly et al, 2000]Motifs [Schwartz 2003] Blocks [Daly et al. 2000]Motifs [Schwartz, 2003]
Minimum Description Length (MDL) Let: M represent the parameters of the model I represent the input matrix E be the explanation of I using M L be the length of encoding Objective: Minimize L(M) + L(E(I)|M) Complicated models are penalized Prevents over-fitting
Dynamic Program - Blocks Dynamic Program [Koivisto et al. 2003]: where C ( j+1, i ) is the cost of creating a single block from j+1 to i. Running time: O(n 2 ) Work space: O(n) i single block best …
Expectation-Maximization Algorithm - Motifs 1. Create a DAG of all possible motifs with a ‘start’ vertex 2. Initialize probabilities 3. For each EM iteration i. For each row r in R a. In sub-graph corresponding to r find ML path from start ii. Re-normalize probabilities based on the number of times the vertices were used in ML path
Example
Example
Example
Example
Example Example - Re-normalize
Heuristics EM finds P 1 and P 2 but cost(P 1 ) + cost(P 2 ) ≥ cost(P’ 1 U P’ 2 ) Use knowledge from previous EM iteration Multiple shortest paths with weight: (1+ε) -cost(P) Addition of small constants to prevent zero probability in first few iterations Initialize the probabilities to favor smaller motifs Restrict maximum length of motifs
Experimental Results Num Seqs/ Recomb Rt Num SNPsDesc Length – Motifs (in bits) Desc Length- Blocks (in bits) 100/low /low /low /low /high /high /high /high Simulated data using the ms program [Hudson, 2002]
Experimental Results
Motifs: High recombination Blocks: High recombination Motifs: Low Recombination Blocks: Low Recombination
Conclusion Characterized the problem of inferring haplotype structure as an optimization problem that is robust against over-fitting Haplotype motif model better captures the structure than haplotype blocks Furthermore, motif method performs progressively better with larger input size
Discussion & Future Work Extensions: Polynomial time algorithm/NP-hardness Clustering and error models Real data – recombination hot-spots Future directions Genotype data Haplotype Data Motifs/ Blocks/? Disease Analysis/ Drug design phasing direct optimization htSNP, association tests, ?current work
Encoding Motifs Let s i be the start locations of motifs Let t i,j be the number of motifs that start at i and end at j Let E i ={e i, 1, …, e i,k } be the ordered set of end locations for motifs that start at i Cost for encoding model: Additional cost for encoding motif probabilities
Explanation Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:
Human Genetic Structure Chromosomes in the nucleus of cells 23 pairs of chromosomes Double helix structure of chromosomes Chromosomes: Genes and inter-genic regions Genes: Encode for proteins
Single Nucleotide Polymorphism (SNP) Human genomes are very similar SNP: Single base with high probability of variation Bi-allelic: Two out of four possible nucleotides In humans reduction in size ~ × 300
Encoding Blocks Let s i represent the start columns of blocks t i represent the number of blocks starting at t i Cost of encoding Model: Additionally, encoding for probabilities for block haplotypes
Encoding Blocks Explanation of a row: specify the ordered set of block haplotypes that produce the bits of the row Cost for explanation of row r : Cost for explanation:
DNA Building blocks (nucleotides): Adenine(A), Cytosine(C), Guanine(G) and Thymine(T) Adenine(A) pairs with Thymine(T) Cytosine(C) pairs with Guanine(G)
Haplotypes Contiguous regions of correlated genetic variation Two models: Blocks and Motifs Blocks: Popular and widely assumed [Daly et al. 2000] Boundary aligned ‘block haplotypes’ Motifs: Recently introduced [Schwartz 2003] Overlapping ‘haplotype motifs’
Comparison of Blocks and Motifs Two models: Haplotype blocks[Daly et al. 2000] and haplotype motifs [Schwartz 2003]
Recent Article – dogs helping humans