Mark A. Iwen Department of Mathematics Willis Lang, Jignesh M. Patel EECS Department University of Michigan, USA (ICDE 2008) Scalable Rule-Based Gene Expression.

Slides:



Advertisements
Similar presentations
Data Mining Classification: Alternative Techniques
Advertisements

Scalable Mining For Classification Rules in Relational Databases מוצג ע ” י : נדב גרוסאוג Min Wang Bala Iyer Jeffrey Scott Vitter.
Feature Grouping-Based Fuzzy-Rough Feature Selection Richard Jensen Neil Mac Parthaláin Chris Cornelis.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
PAPER BY : CHRISTOPHER R’E NILESH DALVI DAN SUCIU International Conference on Data Engineering (ICDE), 2007 PRESENTED BY : JITENDRA GUPTA.
Lecture 22: Evaluation April 24, 2010.
Ch5 Stochastic Methods Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2011.
An Approach to Evaluate Data Trustworthiness Based on Data Provenance Department of Computer Science Purdue University.
Data Mining Classification: Alternative Techniques
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Lazy Associative Classification By Adriano Veloso,Wagner Meira Jr., Mohammad J. Zaki Presented by: Fariba Mahdavifard Department of Computing Science University.
Feature Selection for Regression Problems
A Framework for Using Materialized XPath Views in XML Query Processing Dapeng He Wei Jin.
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Rule Induction with Extension Matrices Yuen F. Helbig Dr. Xindong Wu.
Feature Selection and Its Application in Genomic Data Analysis March 9, 2004 Lei Yu Arizona State University.
Rule Induction with Extension Matrices Leslie Damon, based on slides by Yuen F. Helbig Dr. Xindong Wu, 1998.
Today Evaluation Measures Accuracy Significance Testing
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Evaluation February 23,
Basic Data Mining Techniques
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
1 Chapter 2 Program Performance – Part 2. 2 Step Counts Instead of accounting for the time spent on chosen operations, the step-count method accounts.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
 A databases is a collection of data organized to make it easy to search and easy to retrieve in a useful, usable form.
Genetic Algorithms CS121 Spring 2009 Richard Frankel Stanford University 1.
Discrete Mathematical Structures (Counting Principles)
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
Learning from Observations Chapter 18 Through
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Do Now 1. Which of the following is not a rational number? 2 4 – 3 7 π – Evaluate the expression – 62 + ÷ 4 4. ( 5734 –– )2)2 ANSWER 12.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science & Information Engineering.
Introduction to: Programming CS105 Lecture: Yang Mu.
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
Chapter 6 Classification and Prediction Dr. Bernard Chen Ph.D. University of Central Arkansas.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
Association Rule Mining. 2 Clearly not limited to market-basket analysis Associations may be found among any set of attributes – If a representative.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Improving Support Vector Machine through Parameter Optimized Rujiang Bai, Junhua Liao Shandong University of Technology Library Zibo , China { brj,
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
Logical Agents Chapter 7. Outline Knowledge-based agents Propositional (Boolean) logic Equivalence, validity, satisfiability Inference rules and theorem.
10.1 – Estimating with Confidence. Recall: The Law of Large Numbers says the sample mean from a large SRS will be close to the unknown population mean.
Automatic Categorization of Query Results Kaushik Chakrabarti, Surajit Chaudhuri, Seung-won Hwang Sushruth Puttaswamy.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
On Reducing Classifier Granularity in Mining Concept-Drifting Data Streams Peng Wang, H. Wang, X. Wu, W. Wang, and B. Shi Proc. of the Fifth IEEE International.
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
Learning From Observations Inductive Learning Decision Trees Ensembles.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Classification Using Top Scoring Pair Based Methods Tina Gui.
DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.
Ning Jin, Wei Wang ICDE 2011 LTS: Discriminative Subgraph Mining by Learning from Search History.
Rule Induction for Classification Using
Presented by: Dr Beatriz de la Iglesia
Chapter 6 Classification and Prediction
CARPENTER Find Closed Patterns in Long Biological Datasets
Introduction to Inference
Computational Learning Theory
Discriminative Frequent Pattern Analysis for Effective Classification
Classification and Prediction
Computational Learning Theory
Exploiting the Power of Group Differences to Solve Data Analysis Problems Classification Guozhu Dong, PhD, Professor CSE
Confidence Intervals Usually set at 95 % What does that mean ?
Presentation transcript:

Mark A. Iwen Department of Mathematics Willis Lang, Jignesh M. Patel EECS Department University of Michigan, USA (ICDE 2008) Scalable Rule-Based Gene Expression Data Classification 1

Outline Introduction Preliminaries Method – Boolean Structure Table (BST) Boolean Structure Table Classifier (BSTC) Experiments Conclusions 2

Introduction association rule-based classifiers for gene expression data operate in two phases: (i) Association rule mining from training data followed by (ii) Classification of query data using the mined rules require an exponential search of the training data set’s samples heuristic rule-based gene expression data classifier 3

Introduction 4 g1,g2 → cancer g5, g6 → healthy

Preliminaries 5 A finite set G of genes and N collections of subsets from G Ci : class type or class label s i,j ⊂ G as a sample every element g ∈ G as a gene C 1 ={s 1,1,..., s 1,m1 },..., C N = {s N,1,..., s N,mN } if g ∈ G, g ∈ s i,j we will say that sample s i,j expresses gene g

Preliminaries 6 Conjunctive association rule (CAR) If a query sample s contains all genes g j 1,..., g jr then it should be grouped with class type Cn. g j 1,..., g jr ⇒ n Support: Confidence:

Preliminaries 7 CAR g1, g3 ⇒ Cancer Support = 2, conf = 1

Preliminaries 8 Boolean association rule (BAR) if B(s[g1],..., s[gn]) evaluates to true for a given sample s,then s should belong to class Ci.” Support Confidence B(x1, x2, x3, x4, x5, x6) = (x1 ∧ x3) ∨ (x2 ∧ x4) support 3 and confidence 1

Boolean Structure Table (BST) 9 if g ∈ G, g ∈ s i,j we will say that sample s i,j expresses gene g

Boolean association rule (BAR) 10 Gene Row BARs with 100% Confidence Values Gene g1: (g1 expressed) ⇒ Cancer. Gene g2: (g2 expressed AND [EITHER (g1 expressed) OR (either g5 or g3 not expressed)] ) ⇒ Cancer. Gene g3: (g3 expressed AND [EITHER {(g1 expressed) AND (either g4 or g6 not expressed)} OR { (either g2 or g5 not expressed) AND (either g4 or g5 not expressed)} ] ) ⇒ Cancer. … exclusion list

Why BARs with 100% Confidence? 11 We remove all exclusion list clauses related to sample row s5 (g3 expressed AND [EITHER (g1 expressed) OR (either g2 or g5 not expressed) ] ⇒ Cancer Confidence = theorem CAR, exists a 100% confident BST generated BAR B ⇒ C supp( ) = supp, non-C samples.

Boolean Structure Table Classifier (BSTC) 12 BSTEC (T(i),Q) Q = {g1 expressed, g2 not expressed, g3 not expressed, g4 expressed, g5 expressed, g6 not expressed} Healthy classification value of 3/8

Runtime Complexity for BST Creation 13 C 1,..., C N is BSTs can be constructed for all C i s in time

Runtime Complexity for BST 14 construct all the BSTs T(1),..., T (N) Thus, BSTC requires time and space O (|S|^2 ・ |G|) during classification BSTC must calculate BSTCE(T(i),Q) for 1 ≤ i ≤ N. BSTCE runs in O ((|S| − |Ci|) ・ |G| ・ |Ci|) time per query sample worst case evaluation time is also O(|S|^2 ・ |G|) per query sample

Experiments 15

Conclusions BSTC retains the classification accuracy of current association rule-based methods while being orders of magnitude faster than the leading classifier RCBT on large datasets Rulebased classifiers: (i) BSTC is easy to use (requires no parameter tuning) (ii) BSTC can easily handle datasets with any number of class types 16