Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky.

Slides:



Advertisements
Similar presentations
Association Tests for Rare Variants Using Sequence Data
Advertisements

Combinatorial Algorithms for Haplotype Inference Pure Parsimony Dan Gusfield.
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
MALD Mapping by Admixture Linkage Disequilibrium.
1 Cladistic Clustering of Haplotypes in Association Analysis Jung-Ying Tzeng Aug 27, 2004 Department of Statistics & Bioinformatics Research Center North.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
WABI 2005 Algorithms for Imperfect Phylogeny Haplotyping (IPPH) with a Single Homoplasy or Recombnation Event Yun S. Song, Yufeng Wu and Dan Gusfield University.
Association Mapping of Complex Diseases with Ancestral Recombination Graphs: Models and Efficient Algorithms Yufeng Wu UC Davis RECOMB 2007.
Evaluation.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Single nucleotide polymorphisms Usman Roshan. SNPs DNA sequence variations that occur when a single nucleotide is altered. Must be present in at least.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Optimal Tag SNP Selection for Haplotype Reconstruction Jin Jun and Ion Mandoiu Computer Science & Engineering Department University of Connecticut.
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model Dumitru Brinza joint work with Alexander Zelikovsky Department of Computer Science Georgia.
Experimental Evaluation
Single nucleotide polymorphisms and applications Usman Roshan BNFO 601.
Review for Exam 2 Some important themes from Chapters 6-9 Chap. 6. Significance Tests Chap. 7: Comparing Two Groups Chap. 8: Contingency Tables (Categorical.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Multiple Choice Questions for discussion
Haplotype Blocks An Overview A. Polanski Department of Statistics Rice University.
Linear Reduction for Haplotype Inference Alex Zelikovsky joint work with Jingwu He WABI 2004.
Combinatorial Methods for Disease Association Search and Susceptibility Prediction Alexander Zelikovsky joint work with Dumitru Brinza Department of Computer.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Gene Hunting: Linkage and Association
Informative SNP Selection Based on Multiple Linear Regression
CpSc 810: Machine Learning Evaluation of Classifier.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Confidence intervals and hypothesis testing Petter Mostad
BGRS 2006 SEARCH FOR MULTI-SNP DISEASE ASSOCIATION D. Brinza, A. Perelygin, M. Brinton and A. Zelikovsky Georgia State University, Atlanta, GA, USA 123.
1 Population Genetics Basics. 2 Terminology review Allele Locus Diploid SNP.
Estimating Recombination Rates. LRH selection test, and recombination Recall that LRH/EHH tests for selection by looking at frequencies of specific haplotypes.
Julia N. Chapman, Alia Kamal, Archith Ramkumar, Owen L. Astrachan Duke University, Genome Revolution Focus, Department of Computer Science Sources
Combinatorial Search Methods for Genotypes Associated with Lung Cancer Dumitru Brinza Department of Computer Science Georgia State University.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Linear Reduction Method for Tag SNPs Selection Jingwu He Alex Zelikovsky.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Design and Validation of Methods Searching for Risk Factors in Genotype Case- Control Studies Dumitru Brinza Alexander Zelikovsky Department of Computer.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Risk Factor Searching Heuristics for SNP Case-Control Studies Dumitru Brinza February 21, 2008 Department of Computer Science & Engineering University.
De-anonymizing Genomic Databases Using Phenotypic Traits Humbert et al. Proceedings on Privacy Enhancing Technologies 2015 (2) :
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
BIOSTATISTICS Lecture 2. The role of Biostatisticians Biostatisticians play essential roles in designing studies, analyzing data and creating methods.
Discrete Algorithms for Disease Association Search and Dumitru Brinza Department of Computer Science Georgia State University UCSD, November 29, 2006 Susceptibility.
National Taiwan University Department of Computer Science and Information Engineering Introduction to SNP and Haplotype Analysis Algorithms and Computational.
ENGR 610 Applied Statistics Fall Week 7 Marshall University CITE Jack Smith.
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
International Workshop on Bioinformatics Research and Applications, May 2005 Phasing and Missing data recovery in Family Trios D. Brinza J. He W. Mao A.
The Haplotype Blocks Problems Wu Ling-Yun
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Machine Learning: Ensemble Methods
Introduction to SNP and Haplotype Analysis
Constrained Hidden Markov Models for Population-based Haplotyping
Results for all features Results for the reduced set of features
Evaluating classifiers for disease gene discovery
Phasing of 2-SNP Genotypes Based on Non-Random Mating Model
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Approximation Algorithms for the Selection of Robust Tag SNPs
SNPs and CNPs By: David Wendel.
Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
Presentation transcript:

Genotype Susceptibility And Integrated Risk Factors for Complex Diseases Weidong Mao Dumitru Brinza Nisar Hundewale Stefan Gremalshi Alexander Zelikovsky Department of Computer Science Georgia State University

2 Outline Human genetics basics SNPs, Haplotypes and Genotypes Genetic epidemiology Prediction Methods Genetic susceptibility to complex diseases Conclusions and future plans

3 Human Genetics Basics Genetics DNA, gene, chromosome and Genome DNA = two complimentary strands of nucleotides (A-T, G-C) Length of DNA is measured in base pairs (bp) Human Genome Project ( ) 3 billion bps of human genome 15,000 genes Over 99% of the genome is identical 1% are SNPs. 3.7 million SNPs

4 Single Nucleotide Polymorphisms (SNP) Altered single nucleotide in the genome sequence. Found in at least 1% of the population. Occurs every 100 to 300 bp. Bi-allelic: wild type and mutation. AAGGCATGGCTA AACGCGTGGTTA AACGCGTGGCTA SNPs: genetic risk factors for diseases.

5 Diploid organisms = two different “copies” of each chromosome = recombined copies of parents’ chromosomes Too expensive to examine two versions of a chromosome separately Much cheaper to obtain genotype (mixed) data rather than haplotype (separated) data Haplotype = description of single copy (0=wild type,1=minor allele) Genotype = description of mixed two copies (0=00, 1=11, 2=01) Twohaplotypesper individual Genotype for the individual Twohaplotypesper individual Genotype for the individual  homorozigous haplotype SN P heterozigous ATG CTT ACAC TTTT GTGT  Genotypes, Haplotypes, 0,1,2 notations

6 Genetic Epidemiology Genetic epidemiology - searching for genetic risk factors for diseases. Monogenic disease A mutated gene is entirely responsible for the disease. Typically rare in population: < 0.1%. Complex disease Affected by the interaction of multiple genes. Common: > 0.1%. In NY city, 12% of the population has Diabetes II. Significance of risk factor is measured by risk rate or odds ratio.

7 Genetic Susceptibility to Complex Diseases Given: Genotypes of sick and healthy persons, Genotype of a testing person. Find: The testing person has the disease or not GenotypeDisease Status healthy sick testing - g t s(g t )

8 Prediction Methods Universal prediction methods: Statistical Methods: - Closest Neighbor - Genotype Statistics Support Vector Machine (SVM) Random Forest Ad hoc prediction methods: Pseudo-haplotype statistics Linear programming based prediction method. Adjacent SNP pairs

9 Statistical Methods Closest Genotype Neighbor: For the testing genotype g t, find the closest genotype g i using Hamming distance and then set s(g t ) = s(g i ). g i: ATTCTGACCGCATC g t: ATTGTGATCGCCTC H (g i, g t ) = 3 Genotype Statistics: A standard statistical method based on the allele frequency. For each SNP j =1, …, m, we compute the LRR score of risk rate (RR) as follows: For genotype g t, if the cumulative LRR score of all SNPs is greater than 0, then the output disease status s(g t ) =1, (g t is predicted to be in control population) and -1, otherwise.

10 Support Vector Machine (SVM) Algorithm Learning Task Given: Genotypes of patients and healthy persons. Compute: A model distinguishing if a person has the disease. Classification Task Given: Genotype of a new patient + a learned model Determine: If a patient has the disease or not. Linear SVMNon-Linear SVM

11 Random Forest Algorithm Random Forests grows many classification trees. To classify a new object from an input vector, put the input vector down to each tree in the forest. Each tree gives a classification, and we say the tree “votes” for that class. The forest chooses the classification having the most votes (over all the trees in the forest). Growing Tree, Split selection and Prediction. Random sub-sample of training data, Random splitter selection.

Data Set Bootstrapped sample homozygousheterozygous ….. 0 Test Genotype Random Forest Algorithm

13 Pseudo-Haplotype Statistics: Genotype pseudo-haplotype Genotype Genotypes pseudo-haplotypes ? 1 Ad hoc classification methods

14 LP-based Prediction Algorithm Certain haplotypes are susceptible to the disease while others are resistant to the disease. The genotype susceptibility is assumed to be a sum of susceptibilities of its two haplotypes. Assign a positive weight to susceptible haplotypes and a negative weight to resistant haplotypes such that for any control genotype the sum of weights of its haplotypes is negative and for any case genotype it is positive. For each vertex h i (corresponding to a haplotype) of the graph G X we wish to assign the weight p i, such that for any genotype-edge e i,j =(h i,,h j ) where s(e i,j )  {-1,1} is the disease status of genotype represented by edge e i,j. The total sum of absolute values of genotype weights is maximized.

15 Most Reliable 2 SNPs Prediction Chooses a pair of adjacent SNPs to predict the disease status of the test genotype by voting among genotypes from training set which have the same SNP values at the chosen sites. The most reliable 2 adjacent SNPs have the highest prediction rate in the training set Training set Test Genotype %100%

16 Disease Tagging Motivation: Genotyping/analysis a limited number of suspicious SNPs. Tag SNPs: The subset of genotypes, probably are responsible for diseases Tag SNPs

17 Minimal Disease Tagging Problem Given: Genotypes partitioned into groups (e.g., case/control ), Find: Minimal # of SNPs distinguishing any case from any control. Greedy algorithm: Drop a SNP if it does not collapse case and controls STOP

Decided by other methods 0.75

19 Quality Measures of Prediction Sensitivity: The ability to correctly detect disease. sensitivity = TP/(TP+FN) Specificity: The ability to avoid calling normal as disease. specificity = TN/(FP+TN) Accuracy = (TP +TN)/(TP+FP+FN+TN) Risk Rate: Measurements for risk factors. Prediction Disease +- Test + True PositiveFalse Positive (TP)(FP) - False NegativeTrue Negative (FN)(TN)

20 Cross-validation Method Leave-one-out test: The disease status of each genotype in the data set is predicted while the rest of the data is regarded as the training set Leave-many-out test: Repeat randomly picking 2/3 of the population as training set and predict the other 1/3. 1 Genotype Real Disease Status 1 Predicted Disease Status Accuracy = 80%

21 Algorithms Evaluation P-value: A measure of how much evidence we have against the null hypotheses. Null hypotheses: The observed prediction accuracy is obtained by chance. To reject the null hypotheses, p-value < 0.05 Compute p-value: randomization Randomly permute the disease status of the population to generate 1000 instances. Apply prediction methods on each instance to get prediction accuracy. Compute the probability of instances that have a higher prediction accuracy than the observed accuracy. Confidence Intervals: Using bootstrapping to compute 95% CI for each measure.

22 Data Sets Crohn's disease (Daly et al ): inflammatory bowel disease (IBD). Location: 5q31 Number of SNPs: 103 Population Size: 387 case: 144 control: 243 Autoimmune disorders (Ueda et al) : Location: containing gene CD28, CTLA4 and ICONS Number of SNPs: 108 Population Size: 1036 case: 384 control: 652

23 Experiment Results (IEEE International Conference on Granular Computing, W. Mao, et al)

24 Conclusions SNPs are genetic risk factors for complex diseases. Most known methods focus on single markers and are not applicable to complex disease. Propose several ad-hoc algorithms to predict the genetic susceptibility and integrated risk factors for complex diseases. Our algorithms are proved to have a higher statistical significance and higher prediction rate than universal methods.

25 Thank You ! Questions ?