Challenges and Techniques for Mining Clinical data Wesley W. Chu Laura Yu Chen.

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES AND DATA MINING.
Mining Association Rules from Microarray Gene Expression Data.
Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
Mining Multiple-level Association Rules in Large Databases
Visual Data Mining: Concepts, Frameworks and Algorithm Development Student: Fasheng Qiu Instructor: Dr. Yingshu Li.
LOGO Association Rule Lecturer: Dr. Bo Yuan
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
1 Department of Information & Computer Education, NTNU SmartMiner: A Depth First Algorithm Guided by Tail Information for Mining Maximal Frequent Itemsets.
Livelihoods analysis using SPSS. Why do we analyze livelihoods?  Food security analysis aims at informing geographical and socio-economic targeting 
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Data Mining Techniques Cluster Analysis Induction Neural Networks OLAP Data Visualization.
Our New Progress on Frequent/Sequential Pattern Mining We develop new frequent/sequential pattern mining methods Performance study on both synthetic and.
Chapter 16 Parallel Data Mining 16.1From DB to DW to DM 16.2Data Mining: A Brief Overview 16.3Parallel Association Rules 16.4Parallel Sequential Patterns.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Drug Exposure Side Effects from Mining Pregnancy Data 1 Difficulties in finding side effects: Small number of patients suffer side effect Sensitive to.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Intelligent Information Directory System for Clinical Documents Qinghua Zou 6/3/2005 Dr. Wesley W. Chu (Advisor)
Association Rule Mining (Some material adapted from: Mining Sequential Patterns by Karuna Pande Joshi)‏
2/8/00CSE 711 data mining: Apriori Algorithm by S. Cha 1 CSE 711 Seminar on Data Mining: Apriori Algorithm By Sung-Hyuk Cha.
Chapter 5 Data mining : A Closer Look.
WPI Center for Research in Exploratory Data and Information Analysis From Data to Knowledge: Exploring Industrial, Scientific, and Commercial Databases.
LÊ QU Ố C HUY ID: QLU OUTLINE  What is data mining ?  Major issues in data mining 2.
Data Mining : Introduction Chapter 1. 2 Index 1. What is Data Mining? 2. Data Mining Functionalities 1. Characterization and Discrimination 2. MIning.
『 Data Mining 』 By Jung, hae-sun. 1.Introduction 2.Definition 3.Data Mining Applications 4.Data Mining Tasks 5. Overview of the System 6. Data Mining.
Mining Association Rules between Sets of Items in Large Databases presented by Zhuang Wang.
Apriori algorithm Seminar of Popular Algorithms in Data Mining and Machine Learning, TKK Presentation Lauri Lahti.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence From Data Mining To Knowledge.
Data Mining Chun-Hung Chou
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
1 Apriori Algorithm Review for Finals. SE 157B, Spring Semester 2007 Professor Lee By Gaurang Negandhi.
Introduction to Data Mining Group Members: Karim C. El-Khazen Pascal Suria Lin Gui Philsou Lee Xiaoting Niu.
Machine Learning CSE 681 CH2 - Supervised Learning.
Chapter 9 – Classification and Regression Trees
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
25/03/2003CSCI 6405 Zheyuan Yu1 Finding Unexpected Information Taken from the paper : “Discovering Unexpected Information from your Competitor’s Web Sites”
Treatment Learning: Implementation and Application Ying Hu Electrical & Computer Engineering University of British Columbia.
INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.
1 Knowledge Discovery Transparencies prepared by Ho Tu Bao [JAIST] ITCS 6162.
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Copyright © 2004 Pearson Education, Inc.. Chapter 27 Data Mining Concepts.
Mining various kinds of Association Rules
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Multi-Relational Data Mining: An Introduction Joe Paulowskey.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.
 Frequent Word Combinations Mining and Indexing on HBase Hemanth Gokavarapu Santhosh Kumar Saminathan.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Iowa State University Department of Computer Science Center for Computational Intelligence, Learning, and Discovery Harris T. Lin, Sanghack Lee, Ngot Bui.
An Introduction Student Name: Riaz Ahmad Program: MSIT( ) Subject: Data warehouse & Data Mining.
Data Mining and Decision Support
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
1 Top Down FP-Growth for Association Rule Mining By Ke Wang.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 28 Data Mining Concepts.
Data Mining Techniques Applied in Advanced Manufacturing PRESENT BY WEI SUN.
Differential Analysis on Deep Web Data Sources Tantan Liu, Fan Wang, Jiedan Zhu, Gagan Agrawal December.
Profiling: What is it? Notes and reflections on profiling and how it could be used in process mining.
QED : An Efficient Framework for Temporal Region Query Processing Yi-Hong Chu 朱怡虹 Network Database Laboratory Dept. of Electrical Engineering National.
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
The Concept of Maximal Frequent Itemsets
CARPENTER Find Closed Patterns in Long Biological Datasets
ECE539 final project Instructor: Yu Hen Hu Fall 2005
I don’t need a title slide for a lecture
Presentation transcript:

Challenges and Techniques for Mining Clinical data Wesley W. Chu Laura Yu Chen

Outline Introduction of SmartRule association rule mining Case I: mining pregnancy data to discover drug exposure side effects Case II: mining urology clinical data for operation decision making

SmartRule Features Generate MFIs directly from tabular data Reduce the search space and the support counting time by taking advantage of column structures User select MFIs for rule generation User can select a subset of MFIs to including certain attributes as targets in rule generation Derive rules from targeted MFIs Efficient support-counting by building inverted indices for the collection of itemsets Hierarchically organize rules into trees and use spreadsheet to present the rule trees

System overview of SmartRule TMaxMiner: Compute MFI from tabular data.  MFI  Data  Rules  Config Domain experts InvertCount: - MFIs  FIs - Count sup RuleTree: - Generate - Organize FI Supports Excel Book

Computation Complexity Efficient MFI mining: Does not require superset checking gather past tail information to determine the next node to explore during the mining process Efficient rule generation: Reduce the computation for support- counting by building inverted indices

Scalability Limitation: Microsoft Excel spreadsheet size is 65,536 rows in one spreadsheet When the dataset exceeds the spreadsheet size limit: Partition the dataset into multiple groups of the maximum spreadsheet size to derive MFIs for each spreadsheet Then join these MFIs for generating association rules

Case I: Mining Pregnancy Data Data set: Danish National Birth Cohort (DNBC) Dimension: 4455 patients x 20 attributes Each patient record contain: Exposure status : drug type, timing, and sequence of different drugs Possible confounders: vitamin intake, smoking, alcohol consumption, socio-economic status and psycho-social stress Endpoint: preterm birth, malformations and prenatal complications

Sample Pregnancy Data

Challenges Problem: discover side effects of drug exposure during pregnancy E.g.: study how the antidepressants and confounders influence the preterm birth of the new-born Difficulties in finding side effects: Small number of patients suffer side effect Sensitive to the drug exposure time Exposure to sequence of multiple drugs

Derive Drug Side Effects via SmartRule(1): low-support low-confidence rules Low support or low confidence rules could still be significant because of their contrast to normal pregnant woman For example: If patients exposed to cita in the 3rd trimester, then have preterm birth with support=0.0011, confidence= If patients not exposured to cita, then have preterm birth with support=0.0433, confidence=0.0444

Derive Drug Side Effects via SmartRule(2): temporal sensitive rules Divide the pregnancy period into time slots (e.g. trimester) and combine drug exposure by time: If patients exposed to cita in the 1st trimester and drink alcohol, then have preterm birth with support= and confidence=0.132 If patients exposed to cita in the 2nd trimester and drink alcohol, then have preterm birth with support= and confidence=0.417 If patients exposed to cita in the 3rd trimester and drink alcohol, then have preterm birth with support= and confidence=0.364 Flexible in time slot division, domain user can control granularity

Rule Presentation Hierarchically organize rules into trees View general rules and then extend to specific rules Use spreadsheet to present the rule trees Easy to sort, filter or extend the rule trees to search for the interesting rules 2) If exposed to cita in the 1 st trimester, then preterm birth (sup=0.0016, conf=0.0761) 6) If exposed to cita in the 1 st trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.132) 7) If exposed to cita in the 2 nd trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.417) 3) If exposed to cita in the 2 nd trimester, then preterm birth (sup=0.0013, conf=0.1714) 4) If exposed to cita in the 3 rd trimester, then preterm birth (sup=0.0011, conf=0.1786) A part of the rule hierarchy for the exposure to the antidepressant citalopram and alcohol at different time period of pregnancy with preterm birth 8) If exposed to cita in the 3 rd trimester and drink alcohol, then preterm birth (sup=0.0009, conf=0.364) 1) In general, patients have preterm birth (sup=0.0454, conf=0.0454) 5) If no exposure to cita, then preterm birth (sup=0.0433, conf=0.0444)

Knowledge Discovery from Data Mining Results Challenges: Examining the vast number of rules manually is too labor-intensive Exploring knowledge (rules) without specific goal

Existing approach: Top-down in Rule Hierarchy Association rules are represented in general rules, summaries and exception rules (GSE patterns). The GSE pattern presents the discovered rules in a hierarchical fashion. Users can browse the hierarchy from top-down to find interesting exception rules. Due to the low occurance of drug side effects, interesting rules are exception rules and reside at the lower level of the hierarchy. Without user guidance, it requires exploration of the entire GSE hierarchy to locate the interesting exception rules. Reference: B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the discovered rules," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug, 2000, Boston, USA. B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and exceptions.“ Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, Austin, Texas, USA.

New effective bottom up technique to find exception rules Derive a set of seed attributes from high-confidence rules For example, given high-conf rule: If exposed to Anxio in the pre, in and post time and use tobacco and have symptoms of depression, then have preterm birth with confidence = 0.6 List of seed attributes: Anxio_pre, Anxio_in, Anxio_post, tobacco and symptoms of depression

Using seed attributes to explore exception rules via rule hierarchy Explore more rules based on these seed attributes in the rule hierarchies First look for rules that represent effect of each single seed attribute on preterm birth Then further explore the combination of multiple seed attributes High-confidence rule Seed attributes Rule hierarchy

New Findings from Data Mining Finding: combined exposure to citalopram and alcohol in pregnancy is associated with an increased risk of preterm birth Not initially discovered by epidemiology study due to the large number of combinations among all the attributes and their values 2) If exposed to cita in the 1 st trimester, then preterm birth (sup=0.0016, conf=0.0761) 6) If exposed to cita in the 1 st trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.132) 7) If exposed to cita in the 2 nd trimester and drink alcohol, then preterm birth (sup=0.0011, conf=0.417) 3) If exposed to cita in the 2 nd trimester, then preterm birth (sup=0.0013, conf=0.1714) 4) If exposed to cita in the 3 rd trimester, then preterm birth (sup=0.0011, conf=0.1786) 8) If exposed to cita in the 3 rd trimester and drink alcohol, then preterm birth (sup=0.0009, conf=0.364) 1) In general, patients have preterm birth (sup=0.0454, conf=0.0454) 5) If no exposure to cita, then preterm birth (sup=0.0433, conf=0.0444)

Statistical Analysis VS. Data Mining Statistical analysis Infeasible to test all potential hypotheses for large number of attributes Testing hypotheses with small sample size has limited statistical power Data mining  No hypothesis, mine association in large dataset with multiple temporal attributes  Can generate association rules independent of the sample size  Derive rules with temporal information of drug exposure

Case II: Mining Urology Clinical Data Data set: urology surgeries operated during 1995 to 2002 at the UCLA Pediatric Urology Clinic Dimension: 130 patients x 28 attributes

Bladder Body & Bladder Neck

Training Data Attributes Each patient record contain: Pre-operative conditions: Demography data: age, gender, etc. patient ambulatory status (A) catheterizing skills (CS) amount of creatinine in the blood (SerumCrPre) leak point pressure (LPP) urodynamics, such as the minimum volume of saline infused into a bladder when its pressure reached 20 cm of water (20%min) Type of surgery performed: Op-1 Bladder Neck Reconstruction with Augmentation Op-2 Bladder Neck Reconstruction without Augmentation Op-3 Bladder Neck Closure without Augmentation Op-4 Bladder Neck Closure with Augmentation Post-op complications: infection, complication, etc. Final outcome of the surgery: urine continence  wet or dry

Sample of Urology Clinical Data

Goals and Challenges Goal: Derive a set of rules from the clinical data set (training set) that summarize the outcome based on patients’ pre-op data Predict operation outcome based on a given patient’s pre-op data (test set), and recommend the best operation to perform Challenge: Small sample size, large number of attributes Continuous-value attributes such as uro-dynamics measurements

Data Mining Steps 1. Separate the patients into four groups based on their type of surgery performed 2. In each group, partition the continuous value attributes into discrete intervals or cells. Since the sample size is very small, we use a hybrid technique to determine the optimal number of cells and cell sizes. 3. Generate association rules for each patient group based on the partitioned continues value attributes 4. For a given patient with a specific set of pre-op conditions, the generated rules from the training set can be used to predict success or failure rate for a specific operation

Partitioning Continuous Value Attributes Current approach to partition continuous attribute: Using domain expert guidance can be biased and inconsistent Statistical clustering technique fails when the training set size is small and the number of attributes is large New hybrid approach: Using data mining technique to select a small set of key attributes Using statistical classification technique to perform the optimal partition (determine the cell sizes and the number of cells) from the small set of key attributes

Hybrid Clustering Technique Select a small key attribute set (via data mining): Use domain expert partition to perform mining on the training set Select a set of key attributes that contribute to high confidence and support rules Optimal partition (via statistical classification) Use statistical classification techniques (e.g. CART) to determine the optimal number of cells and their corresponding cell sizes for the attributes Mining optimally partitioned attribute data yields better quality rules

Partition of continuous variables for operations Partition of continuous variables into optimal number of discrete intervals (cells) and cell sizes for four types of operations. Cell#LPPSerumCrPre 1 [0, 19] [0, 0.75] 2 (19, 33.5] [0.75, 2.2] 3 (33.5,40] n/a 4normaln/a Operation Type 1 Operation Type 4 Cell#LPP20% mean 1 [0, 19] [0, 33.37] 2 (19, 69] (33.37, 37.5] 3normal(37.5, 52] 4n/a(52, 110] Cell#20%min20%mean30%min30%meanLPP SerumCrPr e 1[80, 118][50, 77][100, 170][51, 51][12, 20][0, 0.5] 2[145, 178][88, 104][206, 241][94, 113][24, 36][0.7, 1.4] 3[221, 264][135, 135]n/a[135, 135]normaln/a Operation Type 2 Cell#20%min20%mean30%min30%meanLPPSerumCrPre 1[103,130][57, 75][129, 157][86, 93][6, 29][0.3, 0.7] 2[156,225][92, 105][188, 223][100,121][30,40][1.0, 1.5] Operation Type 3

Recommending operation based on rules derived from training set Transform the patient’s pre-op data of the continues value attributes using the optimal partitions for each operation Find a set of rules (from the training set) that matches the patients’ pre-op data Compare the matched rules from each operation, recommend the type of sugary that provides the best match

Example: Prediction for Matt Ambulatory Status (A) Cath Skills (CS) Serum CrPre 20% min 20% mean(M) 30% min 30% mean LPPUPP unkown Patient Matt’s pre-operative conditions Ambulatory Status (A) Cath Skills (CS) Serum CrPre 20% min 20% mean(M) 30% min 30% mean LPP Op-1411n/a 2 Op-2411<1 2 Op-3411<1 1 Op-441n/a 1 2 Discretized pre-operative conditions of patient Matt’s pre-op conditions. The attributes not used in rule generation are denoted as n/a

Rule trees selected from the knowledge base that match patient Matt’s pre-op profile SurgeryConditionsOutcomeSupportSupport(%)Confidence Op-1 CS=1 Success CS=1 and LPP=2 Success Op-2 CS=1 and LPP=2 Fail %min=1 and LPP=2 Fail Op-3 CS=1 and SerumCrPre=1 Success CS=1, SerumCrPre=1 and LPP=1 Success 2201 Op-4 A=4 Success A=4 and CS=1 Success A=4, CS=1 and LPP=2 Success A=4, CS=1 and M=1 Success A=4, CS=1, M=1 and LPP=2 Success Based on the rule tree, we note that Operations 3 and 4 both match patient Matt’s pre-op conditions. However, Operation 4 matches more attributes in Matt’s pre-op conditions than Operation 3. Thus, Operation 4 is more desirable for patient Matt.

Representing rules in a hierarchical structure A 4 CS 1 Lpp 2  Success A 4 CS 1 M 1 Lpp 2  Success sup=32.55%,conf=0.78 sup=13.95%,conf=1 sup=18.6%,conf=0.8 sup=13.95%,conf=1 sup=25.58%,conf=0.79 A 4  Success A 4 CS 1 M 1  Success A 4 CS 1  Success Represent rule trees for Op-4 by spreadsheetRule tree for Op-4 Favorable user feedback in using the spreadsheet interface because of its ease in rule searching and sorting

Lesson learn from mining data with small sample size For small sample size, hybrid clustering yield better than conventional unsupervised clustering techniques Hybrid clustering enables us to generate useful rules for small sample sizes, which could not be done using data mining or statistical classifying methods alone

Conclusion Mining pregnancy data: Discover drug exposure side effects (association) Advantage over traditional statistical approaches: Independent of hypotheses Independent of the sample size Derive rules with temporal information Using seed attribute approach to effectively discover exception rules via rule hierarchy Mining urology clinical data: Deriving association rules based on patient’s pre-op conditions and their operation outcomes according to different type of operations Hybrid clustering technique to derive optimal partition for continuous value attributes. This technique is critical for deriving high quality rules for small sample size with large number of attributes

Reference Qinghua Zou, Yu Chen, Wesley W. Chu and Xinchun Lu. Mining association rules from tabular data guided by maximal frequent itemset. Book Chapter in “Foundations and Advances in Data Mining”, edited by Wesley W. Chu and T.Y. Lin, Springer, Yu Chen, Lars Henning Pedersen, Wesley W. Chu and Jorn Olsen. "Drug Exposure Side Effects from Mining Pregnancy Data" In SIGKDD Explorations (Volume 9, Issue 1), June 2007, Special Issue on Data Mining for Health Informatics, Guest Editors: Raymond Ng and Jian Pei. Q. Zou, W.W. Chu, and B. Lu. SmartMiner: A depth-first search algorithm guided by tail information for mining maximal frequent itemsets. In Proc. of the IEEE Intl. Conf. on Data Mining, R. Agrawal and R. Srikant: Fast algorithms for mining association rules. In Proceedings of the 20th VLDB Conference, Santiago, Chile, D. Burdick, M. Calimlim, and J. Gehrke: MAFIA: a maximal frequent itemset algorithm for transactional databases. In Intl. Conf. on Data Engineering, Apr K. Gouda and M.J. Zaki: Efficiently Mining Maximal Frequent Itemsets. Proc. of the IEEE Int. Conference on Data Mining, San Jose, 2001.

Reference B. Liu, M. Hu, and W. Hsu, "Multi-level organization and summarization of the discovered rules," Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Aug, 2000, Boston, USA. B. Liu, M. Hu, and W. Hsu, "Intuitive representation of decision trees using general rules and exceptions.“ Proceedings of Seventeeth National Conference on Artificial Intellgience (AAAI-2000), July 30 - Aug 3, 2000, Austin, Texas, USA. Frequent Itemset Mining Implementations Repository,