Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF.

Slides:

Advertisements

Similar presentations

Suleyman Cetintas 1, Monica Rogati 2, Luo Si 1, Yi Fang 1 Identifying Similar People in Professional Social Networks with Discriminative Probabilistic.

Advertisements

Mining Association Rules from Microarray Gene Expression Data.

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

An Association Analysis Approach to Biclustering website:

Clustering V. Outline Validating clustering results Randomization tests.

More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.

© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.

August 26, 2008Gupta et al. KDD Quantitative Evaluation of Approximate Frequent Pattern Mining Algorithms Rohit Gupta, Gang Fang, Blayne Field, Michael.

IBM1 An Algorithm For Exploring Patterns In Clinical Genomic Data Richard Mushlin and Aaron Kershenbaum IBM T.J. Watson Research Center.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.

An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola

Data Mining BS/MS Project Clustering for Market Segmentation Presentation by Mike Calder.

Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.

ACCURATE TELEMONITORING OF PARKINSON’S DISEASE SYMPTOM SEVERITY USING SPEECH SIGNALS Schematic representation of the UPDRS estimation process Athanasios.

Simple Linear Regression

Crystal Linkletter and Derek Bingham Department of Statistics and Actuarial Science Simon Fraser University Acknowledgements This research was initiated.

National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,

Usman Roshan Machine Learning, CS 698

Measuring Association Rules Shan “Maggie” Duanmu Project for CSCI 765 Dec 9 th 2002.

A comparative study of survival models for breast cancer prognostication based on microarray data: a single gene beat them all? B. Haibe-Kains, C. Desmedt,

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Assessing the significance of (data mining) results Data D, an algorithm A Beautiful result A (D) But: what does it mean? How to determine whether the.

Chapter 9 Describing Variations in Data. A variable is…  A single characteristic that can vary and can be measured  Medical Variables:  Biological.

Mining Statistically Significant Co-location and Segregation Patterns.

Overview of Biomedical Informatics

Mining Utility Functions based on user ratings

Nucleotide variation in the human genome

Chapter 7. Classification and Prediction

By Arijit Chatterjee Dr

COMBINATION OF CSF PROTEIN BIOMARKERS AND BDNF, IL10 AND IL6 GENOTYPES IN EARLY DIAGNOSIS OF ALZHEIMER’S DISEASE Mirjana Babić Leko1, Matea Nikolac Perković2,

Lecture Notes for Chapter 2 Introduction to Data Mining

Statistics 202: Statistical Aspects of Data Mining

A Methodology for Finding Bad Data

Trees, bagging, boosting, and stacking

Bioinformatics challenges for “genome-wide association” studies

Outlier Processing via L1-Principal Subspaces

Biostatistics?.

Byung Joon Park, Sung Hee Kim

Social Science Research Design and Statistics, 2/e Alfred P

William Norris Professor and Head, Department of Computer Science

IET 603 Quality Assurance in Science & Technology

Variable Selection for Gaussian Process Models in Computer Experiments

CARPENTER Find Closed Patterns in Long Biological Datasets

The Importance of Communities for Learning to Influence

A Short Tutorial on Causal Network Modeling and Discovery

Jeffrey E. Korte, PhD BMTRY 747: Foundations of Epidemiology II

Multifactor-Dimensionality Reduction Reveals High-Order Interactions among Estrogen- Metabolism Genes in Sporadic Breast Cancer Marylyn D. Ritchie, Lance.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

females males Analyses with discrete variables

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.

An Introduction to Correlational Research

A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants Andrew.

Random-Effects Model Aimed at Discovering Associations in Meta-Analysis of Genome- wide Association Studies Buhm Han, Eleazar Eskin The American Journal.

Jon Wakefield The American Journal of Human Genetics

Volume 25, Issue 15, Pages (August 2015)

Chapter Nine: Using Statistics to Answer Questions

Making Use of Associations Tests

Approximation Algorithms for the Selection of Robust Tag SNPs

Gad Kimmel, Ron Shamir The American Journal of Human Genetics

DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004

Evaluation of power for linkage disequilibrium mapping

Data Pre-processing Lecture Notes for Chapter 2

Research Techniques Made Simple: Interpreting Measures of Association in Clinical Research Michelle Roberts PhD,1,2 Sepideh Ashrafzadeh,1,2 Maryam Asgari.

Haplotype Inference Yao-Ting Huang Kun-Mao Chao.

Low-Rank Sparse Feature Selection for Patient Similarity Learning

Detecting Treatment by Biomarker Interaction with Binary Endpoints

Presentation transcript:

Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF grant and Computational resources for this work were provided by the Minnesota Supercomputing Institute. Acknowledgements References e-coords: rohit@cs.umn.edu, steinbac@cs.umn.edu R. Mushlin, A. Kirshenbaum, S. Gallagher, T. Rebbeck, A graph-theoretical approach for pattern discovery in epidemiological research, IBM Systems Journal 46, No. 1, 135-149 (2007) Jason H. Moore; Marylyn D. Ritchie, The Challenges of Whole-Genome Approaches to Common Diseases, JAMA 2004 291: 1642-1643 L. Bastone, M. Reilly, D. L. Rader, and A. S. Foulkes, MDR and PRP: A Comparison of Methods for High-Order Genotype-Phenotype Associations, Human Heredity 58, No. 2, 2-92 (2004) A. S. Foulkes, M. Reilly, L. Zhou, M. Wolfe, and D. J. Rader, Mixed Modeling to Characterize Genotype Phenotype Associations, Statistics in Medicine 24, No. 5, 775-789 (2005) A. Hattersley and M. McCarthy, What makes a good genetic association study? The Lancet, Volume 366, Issue 9493, Pages 1315-1323, Oct. 2005 Seppänen, J. K. and Mannila, H. 2004. Dense itemsets. In Proceedings of the Tenth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Seattle, WA, USA, August 22 - 25, 2004). KDD '04. ACM Press, New York Tan, P.-N., Steinbach, M. and Kumar, V., Introduction to Data Mining, Pearson Addison-Wesley, May 2005 Rohit Gupta, Blayne Field, Michael Steinbach, Vipin Kumar, Rich Mushlin*, Fred Kulack+ Department of Computer Science and Engineering, University of Minnesota (200 Union Street SE, Minneapolis MN 55455 USA) *IBM T. J Watson Research Center, +IBM Rochester Obtaining genomic information is increasingly affordable Single Nucleotide Polymorphisms (SNPs) offer the potential to tests for disease or susceptibility for disease Electronic medical records (EMRs) are becoming increasingly common Automated analysis of patient information is now possible This revolution in genetic and medical potentially leads to Personalized medicine, i.e., using detailed genomic and medical information about a person for the detection, treatment, or prevention of disease Given: A patient data set that records Phenotypic Expression (Disease) Genetic characteristics Medical characteristics Objective: Finding patterns combining medical and genetic characteristics that best defines the phenotypic expression under study Challenges: High dimensionality and low sample size Combinatorial explosion Noise Non-linear interactions Various association analysis algorithms have been applied to find connections between genetic characteristics (SNPs) and disease Techniques for finding closed itemsets have proven effective for finding SNP patterns in synthetic data Algorithms exist for finding ETIs have shown promise, but the evaluation is not complete Computational demands of the algorithms are high Odds Ratio and P-value are found to be the best indicator of real patterns for synthetic SNP data. They are also found to be highly correlated to other similarity measures Project Motivation Problem Formulation INTRODUCTION METHODS RESULTS AND DISCUSSIONS Data Set Genetic data (SNPs) Simulated SNP data using known models has been used for this study. Approximately, 2000 cases and 6000 control records have been generated Real SNP data for Parkinson’s and Myeloma disease. Conclusions Cases Controls Row Margins With Pattern a b Nwith Without Pattern c d Nwithout Column Margins Ncases Ncontrols Ntotal a, b, c, and d are the number of cases with the pattern, controls with the pattern, cases without the pattern, and controls without the pattern, respectively. Evaluation Measures There are many different figures of merit (FOM), i.e. functions of a, b, c, d, that can be used to characterize the table We use odds ratio (OR), and P-value (P) OR quantifies how different are cases and controls for a specific pattern P quantifies the significance of the difference reflected by OR Odds Ratio, OR = a*d / b*c P is the probability of a table (shown above) with the same fixed margins having a higher (or same) OR Probability distribution, p, as a function of odds ratio, OR, for Ntotal = 1000 and several sets of margins (Full range of points is shown). The margins in the legend are in the order Ncases, Ncontrols, Nwith, Nwithout Association Analysis Patients Genetic Information (SNPs) as Binary Matrix and disease (Yes/No) as Class Label. Data Mining-based association analysis is applied to find patterns that capture the connections between SNPs and disease Frequent closed itemsets capture SNP patterns where all SNPs must be present Error-tolerant itemsets (ETIs) capture more general SNP patterns, where not all SNPs need to occur in all patients defining the pattern Existing techniques includes statistical association analysis, Logistic Regression, Multifactor Dimensionality Reduction, CART, Random Forests, etc Based on the disease variable, patients are categorized as cases or controls. First, we find patterns (closed itemsets or ETIs) in cases and then check for their presence in control patients. Odds Ratio (OR) and P-value metrics (as described below) are used to evaluate the identified patterns  = 1/4. In other words, each transaction needs to have 3/4 (75%) of the items {i1, i2, i3, i4} and {i5, i6, i7, i8} are both ETIs with a support of 4 http://www-users.cs.umn.edu/~kumar/dmbio/index.html Figures of Merit for 2 x 2 table Itemset Odds Ratio -log10(pvalue) aa1 aa2 aa3 aa4 5.442 5.452 Aa1 aa2 aa4 Aa8 1.661 3.935 aa1 Aa2 aa3 AA5 AA6 3.002 3.770 Aa1 aa2 AA5 AA6 AA7 Aa8 3.845 3.739 aa1 aa2 AA7 Aa8 1.934 3.661 aa1 aa2 aa3 AA5 2.844 3.541 aa1 aa3 AA5 AA6 1.965 3.503 aa2 aa3 AA5 Aa7 Aa8 2.177 3.448 aa2 aa3 AA5 Aa7 1.682 3.421 aa1 aa3 2.486 3.414 Find strong patterns in cases Evaluate strength of patterns in controls Rank all the patterns using OR and p-value to obtain final results