Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.

Slides:



Advertisements
Similar presentations
Bayesian Treatment of Incomplete Discrete Data applied to Mutual Information and Feature Selection Marcus Hutter & Marco Zaffalon IDSIA IDSIA Galleria.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Correlation Search in Graph Databases Yiping Ke James Cheng Wilfred Ng Presented By Phani Yarlagadda.
Mining Multiple-level Association Rules in Large Databases
Frequent Closed Pattern Search By Row and Feature Enumeration
Yue Han and Lei Yu Binghamton University.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Pattern Recognition and Machine Learning
COM (Co-Occurrence Miner): Graph Classification Based on Pattern Co-occurrence Ning Jin, Calvin Young, Wei Wang University of North Carolina at Chapel.
Reduced Support Vector Machine
1 Introduction to Kernels Max Welling October (chapters 1,2,3,4)
Ensemble Learning: An Introduction
1 How to be a Bayesian without believing Yoav Freund Joint work with Rob Schapire and Yishay Mansour.
Machine Learning CMPT 726 Simon Fraser University
Dimension Reduction and Feature Selection Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Seven Lectures on Statistical Parsing Christopher Manning LSA Linguistic Institute 2007 LSA 354 Lecture 7.
Ensemble Learning (2), Tree and Forest
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Selective Sampling on Probabilistic Labels Peng Peng, Raymond Chi-Wing Wong CSE, HKUST 1.
©2003/04 Alessandro Bogliolo Background Information theory Probability theory Algorithms.
Basic Data Mining Techniques
Rule Generation [Chapter ]
1 Associative Classification of Imbalanced Datasets Sanjay Chawla School of IT University of Sydney.
1 A Presentation of ‘Bayesian Models for Gene Expression With DNA Microarray Data’ by Ibrahim, Chen, and Gray Presentation By Lara DePadilla.
by B. Zadrozny and C. Elkan
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
VLDB 2012 Mining Frequent Itemsets over Uncertain Databases Yongxin Tong 1, Lei Chen 1, Yurong Cheng 2, Philip S. Yu 3 1 The Hong Kong University of Science.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
3. Rough set extensions  In the rough set literature, several extensions have been developed that attempt to handle better the uncertainty present in.
Xiangnan Kong,Philip S. Yu Department of Computer Science University of Illinois at Chicago KDD 2010.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.
Xiangnan Kong,Philip S. Yu Multi-Label Feature Selection for Graph Classification Department of Computer Science University of Illinois at Chicago.
Stefan Mutter, Mark Hall, Eibe Frank University of Freiburg, Germany University of Waikato, New Zealand The 17th Australian Joint Conference on Artificial.
Mohammad Hasan, Mohammed Zaki RPI, Troy, NY. Consider the following problem from Medical Informatics Healthy Diseased Damaged Tissue Images Cell Graphs.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Effective Anomaly Detection with Scarce Training Data Presenter: 葉倚任 Author: W. Robertson, F. Maggi, C. Kruegel and G. Vigna NDSS
Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Personalization Services in CADAL Zhang yin Zhuang Yuting Wu Jiangqin College of Computer Science, Zhejiang University November 19,2006.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.
Learning Kernel Classifiers 1. Introduction Summarized by In-Hee Lee.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Mismatch String Kernals for SVM Protein Classification Christina Leslie, Eleazar Eskin, Jason Weston, William Stafford Noble Presented by Pradeep Anand.
Next, this study employed SVM to classify the emotion label for each EEG segment. The basic idea is to project input data onto a higher dimensional feature.
1 Discriminative Frequent Pattern Analysis for Effective Classification Presenter: Han Liang COURSE PRESENTATION:
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
Classification Using Top Scoring Pair Based Methods Tina Gui.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Chapter 3: Maximum-Likelihood Parameter Estimation
By Arijit Chatterjee Dr
Mining Frequent Itemsets over Uncertain Databases
Association Rule Mining
Discriminative Frequent Pattern Analysis for Effective Classification
LECTURE 23: INFORMATION THEORY REVIEW
Machine Learning: Lecture 5
Presentation transcript:

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle

Introduction Pattern Example Patterns –ABCD –ABCF –BCD –BCEF Frequency –A = 2 –B = 4 –C = 4 –D = 2 –E = 1 –F = 2 –AB = 2 –BC = 4 –CD = 2 –CE = 1 –CF = 2

Motivation Why are frequent patterns useful for classification? Why do frequent patterns provide a good substitute for the complete pattern set? How does frequent pattern-based classification achieve both high scalability and accuracy for the classification of large datasets? What is the strategy for setting the minimum support threshold? Given a set of frequent patterns, how should we select high quality ones for effective classification?

Information Fisher Score Definition In statistics and information theory, the Fisher Information is the variance of the score. The Fisher information is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ upon which the likelihood function of θ, L(θ) = f(X, θ), depends. The likelihood function is the joint probability of the data, the Xs, conditional on the value of θ, as a function of θ.

Introduction Information Gain Definition In probability theory and information theory Information Gain is a measure of the difference between two probability distributions: from a “true” probability distribution P to an arbitrary probability distribution Q. The expected Information Gain is the change in information entropy from a prior state to a state that take some information as given. Usually an attribute with high information gain should be preferred to other attributes.

Model Combined Feature Definition Each (attribute, value) pair is mapped to a distinct item in I = {o 1,…,o d }. A combined feature α = {o α1,…,o αk } is a subset of I, where o α i {o 1,…,o d }, 1 ≤ i ≤ k o i I is a single feature. Given a dataset D = {x i }, the set of data that contains α is denoted as D α = {x i |x iα j = 1, o α j α}.

Model Frequent Combined Feature Definition For a dataset D, a combined feature α is frequent if θ = |D α |/|D| ≥ θ 0, where θ is the relative support of α, and θ 0 is the min_sup threshold, 0 ≤ θ 0 ≤ 1. The set of frequent defined features is denoted as F.

Model Information Gain For a patter α represented by a random variable X, the information gain is IG(C|X) = H(C)-H(C|X) Where H(C) is the entropy And H(C|X) is the conditional entropy Given a dataset with a fixed class distribution, H(C) is a constant.

Model Information Gain Upper Bound The information gain upper bound IG ub is IG ub (C|X) = H(C) - H lb (C|X) Where H lb is the lower bound of H(C|X)

Model Fisher Score Fisher score is defined as Fr = (∑ c i=1 n i (u u i -u) 2 )/ (∑ c i=1 n i σ i 2 ) where n i is the number of data samples in class i, u u i is the average feature value in class i σ i is the standard deviation of the feature value in class i u is the average feature value in the whole dataset.

Model Relevance Measure S A relevance measure S is a function mapping a pattern α to a real value such that S(α) is the relevance w.r.t. the class label. Measures like information gain and fisher score can be used as a relevance measure.

Model Redundancy Measure A redundancy measure R is a function mapping two patterns α and ß to a real value such that R(α, ß) is the redundancy between them. R(α, ß) = (P(α, ß) / (P(α) + P(ß) – P(α,ß) ))x min(S(α),S(ß)) P is the predicate function from the Jaccard measure.

Model information gain The gain of a pattern α given a set of already selected patterns F s is g(α)=S(α)-maxR(α, ß) Where ß F s

Algorithm framework of frequent pattern-based classification 1.Feature generation 2.Feature selection 3.Model learning

Algorithm 1. Feature Generation 1.Compute information gain (or Fisher score) upper bound as a function of support θ. 2.Choose an information gain threshold IG 0 for feature filtering purposes. 3.Find θ* = arg max θ (IG ub (θ)≤IG 0 ) 4.Mine frequent patterns with min_sup = θ*

Algorithm 2. Feature Selection Algorithm MMRFS

Algorithm 3. Model Learning Use the resulting features as input to the learning model of your choice. –They experimented with SVM and C4.5

Contributions Propose a framework of frequent pattern-based classification by analyzing the relationship between pattern frequency and its predictive power. Frequent pattern-based classification could exploit the state-of-the-art frequent pattern mining algorithms for feature generation with much better scalability. Suggest a strategy for setting a minimum support. An effective and efficient feature selection algorithm is proposed to select a set of frequent and discriminative patterns for classification.

Experiments Accuracy with SVN and C4.5

Experiments Accuracy and Time Measures

Related Work Associative Classification –The association between frequent patterns and class labels is used for prediction. A classifier is built based on high-confidence, high-support association rules. Top-K rule mining –A recent work on top-k rule mining discovers top-k covering rule groups for each row of gene expression profiles. Prediction is perfomed based on a classification score which combines the support and confidence measures of the rules. HARMONY (mines classification rules) –It uses an instance-centric rule-generation approach and assures for each training instance, that one of the highest confidence rules covering the instance is included in the rule set. This is the more efficient and scalable than previous rule-based classifiers. On several datasets the classifier accuracy was significantly higher, i.e % on Waveform and 3.4% on Letter Recognition. All of the following use frequent patterns –String kernels –Word combinations (NLP) –Structural features in graph classification

Differences between Associative Classification and Discriminative Frequent Pattern Analysis Classification Frequent Patterns are used to represent the data in a different feature space. Associative classification builds a classification using rules only. In associative classification, the prediction process is to find one or several top ranked rule(s) for prediction. In this process, the prediction is made by the classification model. The information gain is used to discriminate the patterns being used by using it to determine the min_sup and in the selection of the frequent patterns.

Pros and Cons Pros –Reduces Time –More accurate Cons –Space concerns on large datasets because it uses the entire Pattern set, initially.