Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.

Slides:

Advertisements

Similar presentations

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

CS6800 Advanced Theory of Computation

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Basics of Linkage Analysis

MALD Mapping by Admixture Linkage Disequilibrium.

Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.

Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.

Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.

CSB Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis.

Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.

Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.

Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.

Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.

Computer vision: models, learning and inference Chapter 10 Graphical Models.

Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.

1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.

Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.

Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.

Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.

Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Imputation 2 Presenter: Ka-Kit Lam.

CS177 Lecture 10 SNPs and Human Genetic Variation

Gene Hunting: Linkage and Association

Informative SNP Selection Based on Multiple Linear Regression

Comp. Genomics Recitation 3 The statistics of database searching.

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.

National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.

BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.

2005MEE Software Engineering Lecture 11 – Optimisation Techniques.

BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.

Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,

Risk Prediction of Complex Disease David Evans. Genetic Testing and Personalized Medicine Is this possible also in complex diseases? Predictive testing.

Genetic Algorithms CSCI-2300 Introduction to Algorithms

Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.

Quality of LP-based Approximations for Highly Combinatorial Problems Lucian Leahu and Carla Gomes Computer Science Department Cornell University.

Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Association mapping for mendelian, and complex disorders January 16Bafna, BfB.

Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.

COT6930 Course Project. Outline Gene Selection Sequence Alignment.

Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.

Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.

Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs

Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.

National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.

CSE280Stefano/Hossein Project: Primer design for cancer genomics.

National Taiwan University Department of Computer Science and Information Engineering An Approximation Algorithm for Haplotype Inference by Maximum Parsimony.

International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.

The Haplotype Blocks Problems Wu Ling-Yun

Yufeng Wu and Dan Gusfield University of California, Davis

SNPs and complex traits: where is the hidden heritability?

Of Sea Urchins, Birds and Men

Constrained Hidden Markov Models for Population-based Haplotyping

upstream vs. ORF binding and gene expression?

Machine Learning Feature Creation and Selection

PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)

Discovery tools for human genetic variations

Phasing of 2-SNP Genotypes Based on Non-Random Mating Model

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Artificial Intelligence CIS 342

Approximation Algorithms for the Selection of Robust Tag SNPs

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Presentation transcript:

Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University of California at San Diego

11/28/2007 Outline SNPs, Genotypes, Common Complex Diseases Disease Association Search in Case-Control Studies Computational challenges Significance and Reproducibility of RF Genetic model / Atomic Risk Factor Maximum Odds Ratio Risk Factors Exhaustive Search Complimentary Greedy Search Algorithm K-Relaxed and Weighted Atomic Risk Factor WCGS Algorithm for finding K-ARF and W-ARF Dataset Results Conclusions

11/28/2007 SNP, Haplotypes, Genotypes Human Genome – all the genetic material in the chromosomes, length 3×10 9 base pairs Difference between any two people occur in 0.1% of genome SNP – single nucleotide polymorphism site where two or more different nucleotides occur in a large percentage of population. Diploid – two different copies of each chromosome Haplotype – description of a single copy (expensive) example: (0 is for major, 1 is for minor allele) Genotype – description of the mixed two copies example: (0=00, 1=11, 2=01)

11/28/2007 Heritable Common Complex Diseases Monogenic disease  Mutated gene is entirely responsible for the disease  Break the pathway, no another compensatory pathway  Typically rare in population: < 0.1%. Complex disease  Interaction of multiple genes One mutation does not cause disease Breakage of all compensatory pathways cause disease In case of cancer – breakage of several cell functions cause disease, e.g., cell-growing and cell-checking systems Hard to analyze - 2-gene interaction analysis for a genome-wide scan with 1 million SNPs has pair wise tests  Multiple independent causes There are different causes and each of these causes can be result of interaction of several genes Each cause explains certain percentage of cases Common diseases are Complex : > 0.1%. In NY city, 12% of the population has Type 2 Diabetes

11/28/2007 DA Search in Case/Control Study Disease Status Case genotypes: Control genotypes: SNPs Find: risk factors (RF) with significantly high odds ratio i.e., pattern/dihaplotype significantly more frequent among cases than among controls Given: a population of n genotypes each containing values of m SNPs and disease status

11/28/2007 Challenges in Disease Association Computational – Scalability  Interaction of multiple genes/SNP’s Too many possibilities – obviously intractable  Multiple independent causes Each RF may explain only small portion of case- control study Statistical – Reproducibility  Search space / number of possible RF’s Adjust to multiple testing  Searching engine complexity Adjust to multiple methods / search complexity

11/28/2007 Addressing Challenges in DA Computational – Scalability  Constraint model / reduce search space Negative effect = may miss “true” RF’s  Heuristic search Look for “easy to find” RF’s May miss only “maliciously hidden” true RF Statistical – Reproducibility  Validate on different case-control study That’s obvious but expensive  Cross-validate in the same study Usual method for prediction validation

11/28/2007 Significance of Risk Factors Relative risk (RR)  cohort study Odds ratio (OR) – case-control study P-value  binomial distribution  multiple testing adjustment of the p-value: more searching  more findings by chance (TN)(FN) True NegativeFalse Negative Control (FP)(TP) False PositiveTrue Positive Case Have RF ControlCase Original OR= TP/FP TN/FN Significance of Risk Factors

11/28/2007 Reproducibility Control Multiple-testing adjustment  Bonferroni: adjusted p = # possibilities x unadjusted p easy to compute but overly conservative   SNP’s are linked – difficult to take in account  Randomization times repeat:  Randomly permute disease status  Find the best RF using the same method adjusted p = # times RF has higher OR than found computationally expensive but ideally accurate

11/28/2007 Risk/Resistance factors Previous works model Risk/resistance factor = one SNP with fixed allele value control case case case case case control Third SNP with fixed allele value 1 is a risk factor with frequency among case individuals higher than among control individuals. present in 5 cases : 1 control

11/28/2007 Genetic Model Breaking 1 & 2 does not imply disease because of compensatory link 3 Breaking 1 & 2 & 3 imply disease = “atomic” risk factor Breaking 1 & 2 & 3 or 4 & 5 imply disease = “complex” RF Several causes of disease (ARFs) 1 & 2 & 3 or 4 & 5  ARF ↔ multi-SNP combination (MSC) Cellular Pathway Genetic Model End Product

11/28/2007 Multi-SNP Combination and Cluster Multi-SNP combination (MSC)  a subset of SNP-columns of S (set of SNPs)  With fixed values of these SNPs, 0, 1, or control case case case case case control x x 1 x x 2 x x x MSC present in 4 cases : 1 control Cluster= subset of genotypes with the same MSC

11/28/2007 MORARF formulation Maximum Odds Ratio Atomic Risk Factor  Given: genotype case-control study  Find: ARF with the maximum odds ratio Number of RF is enormous large Constrain searching among Atomic Risk Factors

11/28/2007 Exhaustive Searching Approaches Exhaustive search (ES)  For n genotypes with m SNPs there are O(3 km ) k-SNP MSCs Exhaustive Combinatorial Search (CS)  Drop small (insignificant) clusters  Search only plausible/maximal MSC’s Case-closure of MSC: MSC extended with common SNPs values in all cases Minimum cluster with the same set of cases control case case case control x x 1 x x 2 x x x Present in 2 cases : 2 controls Case-closure control case case case control x x 1 x x 2 x 0 x Present in 2 cases : 1 control i i

11/28/2007 Exhaustive Combinatorial Search Exhaustive Combinatorial Search Method (CS):  Searches only among case-closed MSCs  Avoids checking of clusters with small number of cases Alternating Combinatorial Search method (ACS):  Find significant MSCs faster than ES  Still too slow for large data  Further speedup by reducing number of SNPs Indexing: compress S by extracting most informative SNPs  Use multiple regression method CasesControlsCasesControlsCasesControls

11/28/2007 Heuristics for MORARF Clusters with less controls have higher OR => MORARF includes finding of max control-free cluster max control-free cluster contains max independent set problem => NP-hard max control-free cluster can be transformed to Red-Blue Set Cover Problem Cannot be reasonably approximated in polynomial time for an arbitrary S Red-Blue Set Cover Problem includes weighted set-cover problem The best known approximation algorithm for the weighted set- cover problem is greedy heuristic

11/28/2007 Complimentary Greedy Search (CGS) Intuition:  Greedy algorithm for finding maximum independent set by removing highest degree vertices Fixing an SNP-value  Removes controls -> profit  Removes cases -> expense Maximize profit/expense! Algorithm:  Starting with empty MSC add SNP-value removing from current cluster max # controls per case Result is maximum control free cluster  MORARF CasesControls

11/28/2007 The value of OR of ARF with 95% CI on i-th iteration of CGS on lung-cancer dataset OR after each iteration of CGS

11/28/2007 Complimentary Greedy Search (CGS) Comparison with optimum:  For the small dataset of Tick-borne encephalitis we were able to find an optimal solution for MORARF using ILP.  CGS founds the same solution.  We can assume that CGS founds the optimal or close to optimal solution.

11/28/2007 Randomized CGS CASES CONTROLS 1/4 1/2 1 Repeat 100 times and choose the best MSC Empty MSC

11/28/ Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1024 case: 378 control: 646 Tick-borne encephalitis (Barkash et al) : Location: containing gene TLR3, PKR, OAS1, OAS2, and OAS3. Number of SNPs: 41 Population Size: 75 case: 21 control: 54 Lung cancer (Dragani et al) : Number of SNPs: 141 Population Size: 500 case: 260 control: 240 Rheumatoid Arthritis (GAW15) : Number of SNPs: 2300 Population Size: 920 case: 460 control: 460

11/28/2007 Search Results Comparison of 5 methods searching ARF on 5 real datasets

11/28/2007 Validation Results DataMethod 2-fold cross- validationrandom-validationsignificance double significance Chron’s disease ES CS CGS RCGS Tick-borne encephalitis ES CS CGS RCGS Lung Cancer ES CS CGS RCGS fold Cross-validation = % of best MSC on the training validated on testing half (p < 5%) Random-validation = the same but testing is allowed to overlap with training Significance = % of best MSC on the training half significant after MT-adjustment Double Significance = % of best MSC on the training half significant after MT-adjustment that are also significant on the testing half

11/28/2007 Generalization of ARF wild typemutation (a) Atomic Risk Factor (b) 1-Relaxed Atomic Risk Factor (c) Weighted Relaxed Atomic Risk Factor P PP P PP

11/28/2007 k-Relaxed Atomic Risk Factor k-MSC  MSC with n SNPs a subset of SNP-columns of S (set of SNPs) With fixed values of these SNPs, 0, 1, or 2  Threshold k k-neighborhood of MSC = at most k mismatches control case case case case case control x x 1 x x 2 x x 2 1-MSC present in 5 cases : 1 control k-Cluster = subset of genotypes satisfying k-MSC

11/28/2007 Example of 1-MSC Sick individuals MSC 1 MSC 2 k-MSC k-Cluster

11/28/2007 MORRARF Formulation Maximum Odds Ratio k-RARF  Given: genotype case-control study and constant k  Find: k-RARF with the maximum odds ratio MORRARF includes MORARF => harder k-CGS Algorithm:  CGS with objective computed for the k-cluster instead of cluster

11/28/2007 Weighted k-Relaxed ARF Weighted k-MSC  k-MSC with weights on each SNP control w(3) case w(2)= case w(3) case w(0)= case w(3) case w(1) control w(0) x x 1 x x 2 x x 2 MSC present in 3 cases : 1 control Weighted k-cluster = subset of genotypes within a weighted distance k from weighted k-MSC and k = 2weights

11/28/2007 MORWRARF Formulation Maximum Odds Ratio WRARF  Given: genotype case-control study  Find: Weighted k-RARF with the maximum odds ratio MORWRARF includes MORARF => harder WCGS Algorithm:  Two move CGS with objective computed for the k-cluster instead of cluster

11/28/2007 H = number of controls CGS/k-CGS WCGS H = number of controls One iteration of Greedy Methods D = number of cases (∆H/∆D)  max step forward (∆D/∆H)  max Step backward ∆H ∆D ∆H ∆D D = number of cases (a)(b) (∆H/∆D)  max step forward ∆H ∆D Cluster content

11/28/2007 Cluster content Method# of Sick with RFMT-adjusted P-value# of SNPs CGS k-CGS WCGS Tick-borne encephalitis H=# Health in k-cluster S = # Sick in k-cluster

11/28/2007 (a) Lung cancer (c) Tick-borne encephalitis (b) Rheumatoid Arthritis (d) Crohn's disease Behavior of Greedy Heuristics

11/28/2007 Search Results for 3 Greedy methods

11/28/2007 Validation Results DataMethod 2-fold cross- validationrandom-validationsignificance Double significance Chron’s disease ES CS CGS RCGS WCGS Tick-borne encephalitis ES CS CGS RCGS WCGS Lung Cancer ES CS CGS RCGS WCGS Cross-validation = % best MSC on the training half validated on testing half (p < 5%) Random-validation = the same but testing is allowed to overlap with training Significance = % best MSC on the training half significant after MT-adjustment Double Significance = % of best MSC on the training half significant after MT- adjustment that are also significant on the testing half

11/28/2007 Conclusions Approximate search methods find more significant RF’s RF found by approximate searches have higher cross-validation rate  Significant MSC’s are better cross-validated WCGS has finds significant MSC’s when no other methods could find anything