Download presentation
Presentation is loading. Please wait.
1
Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005
2
10000010100010010 00010100101101001 01101101001000010 10101011111000010 Problem: Applying haplotype models Input: Output: a set of recurring patterns of the form (start column, end column, pattern) (14,17,“0010”)
3
Major Allele Minor allele Background SNP Haplotype Association Test Given that this sample has haplotype 1101, does it have the disease? 1000011010110100000010
4
…1110101… …1000011… Genetic Variation Mutation: …1000001… Recombination: …1110011… …1000101… …1001001… Because of recombination, similar genetic variation can be found within closely linked regions.
5
Controls: Cases: Data Sets Download from HapMap.org Generate using MS Apply Disease Model Apply Haplotype Model Perform Association Tests 10010011101 10010010101 10001110100 01100101101 Input: 1001001010110 1001001110100 0110010110100 1000111010010
6
Go through each SNP and determine which SNP’s accurately predict which samples have the disease and which do not. Case: 0 0 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 1 0 0 0 Control: 0 0 0 0 1 0 1 0 0 0 1 0 0 1 1 0 1 1 1 0 0 0 0 1 Testing individual SNP’s
7
Haplotype block method Instead of looking at each individual SNP, we can look at groups of contiguous SNP’s. 1101000000…11… 1101100100…01… 0111000000…10… 1101100100…00…
8
Haplotype motif method Notion that a sequence is the concatenation of segments (like the block method) but does not require conservation of boundaries. 1101000000… 1100100100… 0111000000… 1101100111…
9
Approximation Algorithm General idea: 10000100………………………………… 00011100………………………………… 11011110………………………………… 01010110………………………………… cccccccc Pick the best partition, minimizing the number of motifs needed to explain all the data.
10
Finding Motifs C 0 1 1 0 1 0 0 1 1 0 0 0 1 1 0 0 1 000…000 000..100 0 1 ……… 111…111
11
Problems Really, really, really slow Took over a week to partition our biggest data set. Added a ‘max leaves explored’ feature. Useless for larger c.
12
Real Data
13
Simulated Data
14
False Positives
15
General Linear Program Objective Function: minimize: x + y + z Constraints: x + y <= 2 1 1 0 x 2 x +2z <= 5 1 0 2 * y <= 5 z 0 <= x <= 3 0 <= y <= Inf -Inf <= z <= 0
16
A Linear Program Input: A matrix with M rows and N columns Output: The minimum number of motifs.
17
Variables X’s: each x corresponds to a motif Define a motif by a tuple: (start column, end column, string pattern) Y’s: each y corresponds to a row partition Define a row partition by a set of motifs: {(1,e1,“…”),(e1+1,e2,“…”),...,(en,N,“…”)}
18
Constraints Exactly one partition must be chosen per row. If a motif used in a row partition is not chosen, then the row partition may not be chosen. Minimize the sum of all X’s.
19
Example 10001101 X’s: (1,1,“1”),(1,2,“10”),(1,3,“100”), etc. Y’s: (1,1,“1”),(1,8,“0001101”) (1,2,“10”),(3,3,“0”),(4,8,“01101”)
20
Constraint Matrix(1) all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row 1 0 0 … 0 1 1 … Row 2 0 0 … 0 0 0 … Row 3 0 0 … 0 1 1 ….. Row M 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) Exactly one row partition must be chosen per row. =1 … =1
21
Constraint Matrix(2) If a motif used in a row partition is not chosen, then the row partition may not be chosen. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row i: (1,1,“1”) 1 0 … 0 -1 0 … (1,2,“10”) 0 0 … 1 0 -1 … (1,3,“100”) 0 0 … 0 0 0 ….. … … … … … … … (8,8,“1”) 0 0 … 0 0 0 Y_1 := (1,1,“1”),(1,8,“0001101”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) >=0 … >=0
22
Constraint Matrix x’s y’s 1 K K+1 K+P 0 1 0 0 0 0 0 …0 0 0 0 1 1 1 0 0 0 0…. 0 ** Constraint 1 ** 2 0 0 0 0 0 …0 0 0 0 1 0 0 1 1 1 0…. 0 == 1 … M 0 0 0 0 0 …0 0 0 0 0 0 1 0 0 0 1…. 1 1 1 1 0 0 0 0 …0 0 0 0 -1 0 0 0 ….0 0 ** Constraint 2 ** 2 0 1 0 0 0 …0 0 0 0 -1 -1 0 0….-1 0 >= 0 … K_1 0 0 1 0 0 …0 0 0 0 0 0 0 0 ….0 0. M Where K is the number of unique motifs, K_i is the number of motifs appearing in row i, and P is the number of unique partitions
23
Problems Each row has N(N+1)/2 motifs. So there will be a polynomial number of X’s. Good! Each row can be partitioned in 2^(N-1) ways. So there will be an exponential number of Y’s. Bad! Solution: column generation
24
Column generation We find the optimal solution to the problem which contains all X’s and only some of the Y’s. Then we see if adding any Y’s would improve the solution.
25
Where are we now? Where are we going?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.