Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005.

Slides:



Advertisements
Similar presentations
Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Advertisements

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Combinational Circuits
Approximation Algorithms Chapter 14: Rounding Applied to Set Cover.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Lecture 8 Tuesday, 11/19/02 Linear Programming.
1 MERLIN A polynomial solution for the Traveling Salesman Problem Dr. Joachim Mertz, 2005.
Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang ChenUniversity of Texas Pan American.
June 2, Combinatorial methods in Bioinformatics: the haplotyping problem Paola Bonizzoni DISCo Università di Milano-Bicocca.
1 Linear Programming Jose Rolim University of Geneva.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2006 Lecture 9 Wednesday, 11/15/06 Linear Programming.
Predicting the Function of Single Nucleotide Polymorphisms Corey Harada Advisor: Eleazar Eskin.
Applying haplotype models to association study design Natalie Castellana June 7, 2005.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Haplotyping via Perfect Phylogeny Conceptual Framework and Efficient (almost linear-time) Solutions Dan Gusfield U.C. Davis RECOMB 02, April 2002.
No Free Lunch (NFL) Theorem Many slides are based on a presentation of Y.C. Ho Presentation by Kristian Nolde.
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
CSE182-L17 Clustering Population Genetics: Basics.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
1Ellen L. Walker Matching Find a smaller image in a larger image Applications Find object / pattern of interest in a larger picture Identify moving objects.
Protein Encoding Optimization Student: Logan Everett Mentor: Endre Boros Funded by DIMACS REU 2004.
RECOMB Satellite Workshop, 2007 Algorithms for Association Mapping of Complex Diseases With Ancestral Recombination Graphs Yufeng Wu UC Davis.
Cardinality & Sorting Networks. Cardinality constraint Appears in many practical problems: scheduling, timetabling etc’. Also takes place in the Max-Sat.
Approximation Algorithms for Stochastic Combinatorial Optimization Part I: Multistage problems Anupam Gupta Carnegie Mellon University.
Efficient Model Selection for Support Vector Machines
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
AES Background and Mathematics CSCI 5857: Encoding and Encryption.
Segmentation Course web page: vision.cis.udel.edu/~cv May 7, 2003  Lecture 31.
10/11/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 22 Network.
RNA Secondary Structure Prediction Spring Objectives  Can we predict the structure of an RNA?  Can we predict the structure of a protein?
Closest String with Wildcards ( CSW ) Parameterized Complexity Analysis for the Closest String with Wildcards ( CSW ) Problem Danny Hermelin Liat Rozenberg.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Motif finding with Gibbs sampling CS 466 Saurabh Sinha.
Conservation of genomic segments (haplotypes): The “HapMap” n In populations, it appears the the linear order of alleles (“haplotype”) is conserved in.
1 Unconstrained Optimization Objective: Find minimum of F(X) where X is a vector of design variables We may know lower and upper bounds for optimum No.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Informative SNP Selection Based on Multiple Linear Regression
1/27 Discrete and Genetic Algorithms in Bioinformatics 許聞廉 中央研究院資訊所.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Genes in human populations n Population genetics: focus on allele frequencies (the “gene pool” = all the gametes in a big pot!) n Hardy-Weinberg calculations.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Introduction to Bioinformatics Algorithms Finding Regulatory Motifs in DNA Sequences.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Genetic Algorithms. 2 Overview Introduction To Genetic Algorithms (GAs) GA Operators and Parameters Genetic Algorithms To Solve The Traveling Salesman.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Estimating Recombination Rates. Daly et al., 2001 Daly and others were looking at a 500kb region in 5q31 (Crohn disease region) 103 SNPs were genotyped.
Section 2.1 Determinants by Cofactor Expansion. THE DETERMINANT Recall from algebra, that the function f (x) = x 2 is a function from the real numbers.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
TU/e Algorithms (2IL15) – Lecture 12 1 Linear Programming.
The Haplotype Blocks Problems Wu Ling-Yun
Constraints Satisfaction Edmondo Trentin, DIISM. Constraint Satisfaction Problems: Local Search In many optimization problems, the path to the goal is.
Constrained Hidden Markov Models for Population-based Haplotyping
Artificial Intelligence (CS 370D)
Estimating Recombination Rates
Chapter 4 Systems of Linear Equations; Matrices
ENGM 435/535 Optimization Adapting to Non-standard forms.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Chapter 4 Systems of Linear Equations; Matrices
Approximation Algorithms for the Selection of Robust Tag SNPs
Approximation Algorithms for the Selection of Robust Tag SNPs
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Parsimony population haplotyping
Presentation transcript:

Finding Bit Patterns Applying haplotype models to association study design Natalie Castellana Kedar Dhamdhere Russell Schwartz August 16, 2005

Problem: Applying haplotype models Input: Output: a set of recurring patterns of the form (start column, end column, pattern) (14,17,“0010”)

Major Allele Minor allele Background SNP Haplotype Association Test Given that this sample has haplotype 1101, does it have the disease?

… … … … Genetic Variation Mutation: … … Recombination: … … … … … … Because of recombination, similar genetic variation can be found within closely linked regions.

Controls: Cases: Data Sets Download from HapMap.org Generate using MS Apply Disease Model Apply Haplotype Model Perform Association Tests Input:

Go through each SNP and determine which SNP’s accurately predict which samples have the disease and which do not. Case: Control: Testing individual SNP’s

Haplotype block method Instead of looking at each individual SNP, we can look at groups of contiguous SNP’s …11… …01… …10… …00…

Haplotype motif method Notion that a sequence is the concatenation of segments (like the block method) but does not require conservation of boundaries … … … …

Approximation Algorithm General idea: ………………………………… ………………………………… ………………………………… ………………………………… cccccccc Pick the best partition, minimizing the number of motifs needed to explain all the data.

Finding Motifs C … ……… 111…111

Problems Really, really, really slow Took over a week to partition our biggest data set. Added a ‘max leaves explored’ feature. Useless for larger c.

Real Data

Simulated Data

False Positives

General Linear Program Objective Function: minimize: x + y + z Constraints: x + y <= x 2 x +2z <= * y <= 5 z 0 <= x <= 3 0 <= y <= Inf -Inf <= z <= 0

A Linear Program Input: A matrix with M rows and N columns Output: The minimum number of motifs.

Variables X’s: each x corresponds to a motif Define a motif by a tuple: (start column, end column, string pattern) Y’s: each y corresponds to a row partition Define a row partition by a set of motifs: {(1,e1,“…”),(e1+1,e2,“…”),...,(en,N,“…”)}

Constraints Exactly one partition must be chosen per row. If a motif used in a row partition is not chosen, then the row partition may not be chosen. Minimize the sum of all X’s.

Example X’s: (1,1,“1”),(1,2,“10”),(1,3,“100”), etc. Y’s: (1,1,“1”),(1,8,“ ”) (1,2,“10”),(3,3,“0”),(4,8,“01101”)

Constraint Matrix(1) all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row … … Row … … Row … ….. Row M Y_1 := (1,1,“1”),(1,8,“ ”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) Exactly one row partition must be chosen per row. =1 … =1

Constraint Matrix(2) If a motif used in a row partition is not chosen, then the row partition may not be chosen. all X’s all Y’s (1,1,“1”) (1,1,“0”)…(1,2,“10”) Y_1 Y_2 … Row i: (1,1,“1”) 1 0 … … (1,2,“10”) 0 0 … … (1,3,“100”) 0 0 … ….. … … … … … … … (8,8,“1”) 0 0 … Y_1 := (1,1,“1”),(1,8,“ ”) Y_2 := (1,2,“10”),(3,3,“0”),(4,8,“01101”) >=0 … >=0

Constraint Matrix x’s y’s 1 K K+1 K+P … …. 0 ** Constraint 1 ** … …. 0 == 1 … M … … … ….0 0 ** Constraint 2 ** … ….-1 0 >= 0 … K_ … ….0 0. M Where K is the number of unique motifs, K_i is the number of motifs appearing in row i, and P is the number of unique partitions

Problems Each row has N(N+1)/2 motifs. So there will be a polynomial number of X’s. Good! Each row can be partitioned in 2^(N-1) ways. So there will be an exponential number of Y’s. Bad! Solution: column generation

Column generation We find the optimal solution to the problem which contains all X’s and only some of the Y’s. Then we see if adding any Y’s would improve the solution.

Where are we now? Where are we going?