Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.

Slides:



Advertisements
Similar presentations
An Introduction to Data Mining
Advertisements

Association Rules Spring Data Mining: What is it?  Two definitions:  The first one, classic and well-known, says that data mining is the nontrivial.
Data Mining Cluster Analysis: Basic Concepts and Algorithms
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
Data Mining Sangeeta Devadiga CS 157B, Spring 2007.
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Principal Component Analysis
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Data Mining Association Analysis: Basic Concepts and Algorithms
6/23/2015CSE591: Data Mining by H. Liu1 Association Rules Transactional data Algorithm Applications.
Asssociation Rules Prof. Sin-Min Lee Department of Computer Science.
Extraction of high-level features from scientific data sets Eui-Hong (Sam) Han Department of Computer Science and Engineering University of Minnesota Research.
© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.
Bulut, Singh # Selecting the Right Interestingness Measure for Association Patterns Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer.
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
1 1 Slide Introduction to Data Mining and Business Intelligence.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Mining Approximate Frequent Itemsets in the Presence of Noise By- J. Liu, S. Paulsen, X. Sun, W. Wang, A. Nobel and J. Prins Presentation by- Apurv Awasthi.
Data Mining Find information from data data ? information.
DISCOVERING SPATIAL CO- LOCATION PATTERNS PRESENTED BY: REYHANEH JEDDI & SHICHAO YU (GROUP 21) CSCI 5707, PRINCIPLES OF DATABASE SYSTEMS, FALL 2013 CSCI.
DATA MINING Using Association Rules by Andrew Williamson.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
Chapter 8 Association Rules. Data Warehouse and Data Mining Chapter 10 2 Content Association rule mining Mining single-dimensional Boolean association.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Slides for KDD07 Mining statistically important equivalence classes and delta-discriminative emerging patterns Jinyan Li School of Computer Engineering.
Oracle Advanced Analytics
Data Mining.
Overview of Biomedical Informatics
Data Mining ICCM
Data Mining Find information from data data ? information.
What Is Cluster Analysis?
By Arijit Chatterjee Dr
DATA MINING © Prentice Hall.
Data Mining Association Analysis: Basic Concepts and Algorithms
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Mining Association Rules
Techniques for Finding Patterns in Large Amounts of Data: Applications in Biology Vipin Kumar William Norris Professor and Head, Department of Computer.
Frequent Pattern Mining
An Enhanced Support Vector Machine Model for Intrusion Detection
Data Mining Techniques For Correlating Phenotypic Expressions With Genomic and Medical Characteristics This work has been supported by DTC, IBM and NSF.
William Norris Professor and Head, Department of Computer Science
Waikato Environment for Knowledge Analysis
William Norris Professor and Head, Department of Computer Science
Extending Association Analysis
Data Mining Association Analysis: Basic Concepts and Algorithms
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis Techniques for Bioinformatics Problems
Sangeeta Devadiga CS 157B, Spring 2007
Data Mining Association Rules Assoc.Prof.Songül Varlı Albayrak
William Norris Professor and Head, Department of Computer Science
Transactional data Algorithm Applications
Data Mining Association Analysis: Basic Concepts and Algorithms
I don’t need a title slide for a lecture
Discriminative Pattern Mining
Market Basket Analysis and Association Rules
©Jiawei Han and Micheline Kamber
Data Mining: Introduction
Data Pre-processing Lecture Notes for Chapter 2
Association Analysis: Basic Concepts
Presentation transcript:

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer Science kumar@cs.umn.edu www.cs.umn.edu/~kumar Michael Steinbach Department of Computer Science

Increasing Amounts of Medical and Genomic Data Electronic medical records are becoming increasingly common Automated analysis of patient information is now possible Obtaining genomic information is increasingly affordable SNPs offer the potential of tests for disease or susceptibility for disease

Statistical Association Analysis Various Techniques Currently Used to Analyze the Relationship Between SNPs and Disease Statistical Association Analysis Chi-Square and Odds ratio Logistic Regression Multifactor Dimensionality Reduction Attribute selection and construction followed by classification

Challenges Facing Current Techniques Statistical Association Analysis Often lacks power, even for single SNP association Combinatorial explosion when used for screening pairs, triples, etc. Logistic Regression Does not work well when many attributes Multifactor Dimensionality Reduction Impractical for many attributes due to combinatorial nature General Challenges Disease or susceptibility to disease often results from interactions among many genetic, phenotypic, and environmental factors Noise and nonlinear interactions The Challenges of Whole-Genome Approaches to Common Diseases, Moore and Ritchie, JAMA, 2004

General Approach Using Data Mining Techniques Create a data set that records the presence and absence of Phenotypic characteristics Genetic characteristics (SNPs) Disease Apply association analysis to find groups of phenotypic and genetic characteristics that are highly associated with disease Uses characteristics of the patterns to prune the search space Clustering and classification can also be applied

Traditional Association Analysis Association analysis: Analyzes relationships among items (attributes) in a binary transaction data Example data: market basket data Data can be represented as a binary matrix Applications in business and science Two types of patterns Itemsets: Collection of items Example: {Milk, Diaper} Association Rules: X  Y, where X and Y are itemsets. Example: Milk  Diaper Set-Based Representation of Data Binary Matrix Representation of Data

The Need for Error-Tolerant Itemsets An error-tolerant itemset (ETI) can have a fraction  of the items missing in each transaction. Example: see the data in the table Let  = 1/4. In other words, each transaction needs to have 3/4 (75%) of the items. X = {i1, i2, i3, i4} and Y = {i5, i6, i7, i8} are both ETIs with a support of 4. Algorithms to find ETIs are still in development You can think of these ETIs as blocks in the data matrix

ETIs in For Finding Patterns in Phenotypic and Genomic Data ETIs consist of A set of patients and A set of attributes such that The block is relatively dense These blocks identify sets of patients that are highly associated with certain sets of attributes and vice-versa If most of these patients share a disease, then these attributes (genetic and/or phenotypic) are candidate markers for the disease X: Set of patients Y: Set of attributes, i.e., SNPs, medical characteristics

Example: Using ETIs to Find Rules Can use ETIs to find better association rules Example: Mushroom data set Classifies 8192 mushrooms as poisonous (3916) or not (4208) 117 other attributes, such as color, odor, etc. Comparison of rules based on frequent itemsets or ETIs One rule is {29, 48, 90} → p Individual correlations are 0.62, 0.54, 0.18 With traditional association analysis, this rule has a confidence of 1 and a support of 576. This corresponds to a correlation of 0.18 If we require only two of the three items, we still have a confidence of 1, but the support is 3312. This corresponds to a correlation of 0.86 Maximum single item correlation is 0.78

Generalizing ETIs to Blocks in a Data Matrix Dense blocks consist of A set of objects and A set of attributes such that The block is relatively dense These blocks identify sets of objects that are highly associated with certain sets of attributes and vice-versa For gene expression data, this identifies transcription modules Group of genes that are coexpressed together under a set of conditions X: Genes Y: Conditions under which genes are expressed Entries of matrix are the level of a gene’s expression under a condition

Techniques for Finding Relatively Dense Blocks in a Data Matrix Algorithms for finding Error-Tolerant Itemsets More work needed to develop algorithms We are currently using support envelopes Support Envelopes: A Technique for Exploring the Structure of Association Patterns, Steinbach, Tan, Kumar, KDD, 2004 Subspace clustering Similar in spirit to association analysis Subspace clustering for high dimensional data: A review, Parsons , Haque , Liu SIGKDD Explorations, 2004 Co-clustering Information-Theoretic Co-Clustering, Dhillon, Mallela, Modha, KDD 2003 We are currently exploring approaches to extend co-clustering for ETIs Variety of other approaches Matrix factorization Graph-partitioning

Support Envelopes A support envelope contains all association patterns involving m or more transactions and n or more items By association patterns we mean Itemsets and variants (frequent, maximal, closed) Error Tolerant Itemsets (ETIs) An example of a support envelope involving characteristics of mushrooms. One of the attributes is, ‘gill-color:buff’, which occurs in 1728 records, every one of which occurs with 13 other items (one of which is the attribute,‘poisonous’)

Visualizing Support Envelopes for Mushroom One of the support envelopes (576, 23) is denser than its surrounding neighbors.

Using Weak Associations to Find Patterns Apply the following principle: If most pairs of attributes or objects in a set have a pairwise connections, then there is likely to be a strong association among them even if the pairwise associations are weak. The hyperclique pattern uses this principle Hyperclique Pattern Discovery, Xiong Tan, and Vipin Kumar, to appear DMKD, 2006. Good for removing noise Enhancing Data Analysis with Noise Removal, Hui Xiong, Gaurav Pandey, Michael Steinbach, Vipin Kumar, TKDE, 2006 We have recently developed more general pairwise patterns Custom Itemset Patterns, Steinbach and Kumar, submitted to KDD 2006

Application of Classification Techniques Techniques must work with noisy, sparse, high- dimensional data Success of multifactor dimensionality reduction indicates the usefulness of attribute selection and attribute creation ETI and related patterns offer an alternative for feature extraction Classification based on association analysis SVM with the proper kernel