Summarizing Itemset Patterns: A Profile-Based Approach

Slides:



Advertisements
Similar presentations
The Helmholtz Machine P Dayan, GE Hinton, RM Neal, RS Zemel
Advertisements

BY ANISH D. SARMA, XIN DONG, ALON HALEVY, PROCEEDINGS OF SIGMOD'08, VANCOUVER, BRITISH COLUMBIA, CANADA, JUNE 2008 Bootstrapping Pay-As-You-Go Data Integration.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
gSpan: Graph-based substructure pattern mining
A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Date : 2014/04/15 Source : KDD’13 Authors : Chi Wang, Marina Danilevsky, Nihit.
Mining Frequent Patterns in Data Streams at Multiple Time Granularities CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and.
Adaptive Load Shedding for Mining Frequent Patterns from Data Streams Xuan Hong Dang, Wee-Keong Ng, and Kok-Leong Ong (DaWaK 2006) 2008/3/191Yi-Chun Chen.
1 IncSpan :Incremental Mining of Sequential Patterns in Large Database Hong Cheng, Xifeng Yan, Jiawei Han Proc Int. Conf. on Knowledge Discovery.
Mining Frequent Itemsets from Uncertain Data Presented by Chun-Kit Chui, Ben Kao, Edward Hung Department of Computer Science, The University of Hong Kong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Association Rule Mining Part 2 (under construction!) Introduction to Data Mining with Case Studies Author: G. K. Gupta Prentice Hall India, 2006.
Reducing the collection of itemsets: alternative representations and combinatorial problems.
Clustering over Multiple Evolving Streams by Events and Correlations Mi-Yen Yeh, Bi-Ru Dai, Ming-Syan Chen Electrical Engineering, National Taiwan University.
Mining Frequent patterns without candidate generation Jiawei Han, Jian Pei and Yiwen Yin.
Data Mining Association Analysis: Basic Concepts and Algorithms
Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang.
Statistical Analysis of Transaction Dataset Data Visualization Homework 2 Hongli Li.
Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu.
林俊宏 Parallel Association Rule Mining based on FI-Growth Algorithm Bundit Manaskasemsak, Nunnapus Benjamas, Arnon Rungsawang.
Mining High Utility Itemsets without Candidate Generation Date: 2013/05/13 Author: Mengchi Liu, Junfeng Qu Source: CIKM "12 Advisor: Jia-ling Koh Speaker:
AR mining Implementation and comparison of three AR mining algorithms Xuehai Wang, Xiaobo Chen, Shen chen CSCI6405 class project.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Mining High Utility Itemset in Big Data
Online Learning for Collaborative Filtering
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu, Zhihong Chong, Hongjun Lu, Aoying.
Alva Erwin Department ofComputing Raj P. Gopalan, and N.R. Achuthan Department of Mathematics and Statistics Curtin University of Technology Kent St. Bentley.
Experiments Faerie: Efficient Filtering Algorithms for Approximate Dictionary-based Entity Extraction Entity Extraction A Document An Efficient Filter.
Parallel Mining Frequent Patterns: A Sampling-based Approach Shengnan Cong.
LCM ver.3: Collaboration of Array, Bitmap and Prefix Tree for Frequent Itemset Mining Takeaki Uno Masashi Kiyomi Hiroki Arimura National Institute of Informatics,
MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:
2015/11/271 Efficiently Clustering Transactional data with Weighted Coverage Density M. Hua Yan, Keke Chen, and Ling Liu Proceedings of the 15 th International.
1 AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Advisor : Dr. Koh Jia-Ling Speaker : Tu Yi-Lang Date : Hong.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.
Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih- Wei Hsu Presented by Mary Biddle.
Two-Dimensional Filters Digital Image Processing Instructor: Dr. Cheng-Chien LiuCheng-Chien Liu Department of Earth Sciences National Cheng Kung University.
1 Online Mining (Recently) Maximal Frequent Itemsets over Data Streams Hua-Fu Li, Suh-Yin Lee, Man Kwan Shan RIDE-SDMA ’ 05 speaker :董原賓 Advisor :柯佳伶.
1 The Strategies for Mining Fault-Tolerant Patterns Jia-Ling Koh Department of Information and Computer Education National Taiwan Normal University.
Δ-Tolerance Closed Frequent Itemsets James Cheng,Yiping Ke,and Wilfred Ng ICDM ’ 06 報告者:林靜怡 2007/03/15.
1 Parallel Mining of Closed Sequential Patterns Shengnan Cong, Jiawei Han, David Padua Proceeding of the 11th ACM SIGKDD international conference on Knowledge.
10/23/ /23/2017 Presented at KDD’09 Classification of Software Behaviors for Failure Detection: A Discriminative Pattern Mining Approach David Lo1,
CFI-Stream: Mining Closed Frequent Itemsets in Data Streams
Root Finding Methods Fish 559; Lecture 15 a.
Reducing Number of Candidates
DATA MINING © Prentice Hall.
Data Mining Association Analysis: Basic Concepts and Algorithms
New ideas on FP-Growth and batch incremental mining with FP-Tree
Summary Presented by : Aishwarya Deep Shukla
Frequent Pattern Mining
False Positive or False Negative: Mining Frequent Itemsets from High Speed Transactional Data Streams Jeffrey Xu Yu , Zhihong Chong(崇志宏) , Hongjun Lu.
Data Mining Association Analysis: Basic Concepts and Algorithms
Statistical Learning Dong Liu Dept. EEIS, USTC.
Mining Frequent Itemsets over Uncertain Databases
An Efficient Algorithm for Incremental Mining of Association Rules
A Parameterised Algorithm for Mining Association Rules
T test.
Maximally Informative k-Itemsets
Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, ICDM 2004.
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Discriminative Pattern Mining
Jim Hahn Associate Professor
BIRCH: Balanced Iterative Reducing and Clustering Using Hierarchies
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Closed Itemset Mining CSCI-7173: Computational Complexity & Algorithms, Final Project - Spring 16 Supervised By Dr. Tom Altman Presented By Shahab Helmi.
DENSE ITEMSETS JOUNI K. SEPPANEN, HEIKKI MANNILA SIGKDD2004
K.L Ong, W. Li, W.K. Ng, and E.P. Lim
Dynamically Maintaining Frequent Items Over A Data Stream
Presentation transcript:

Summarizing Itemset Patterns: A Profile-Based Approach Xifeng Yan, Hong Cheng, Jiawei Han, Dong Xin ACM KDD 05’ Advisor:Jia-Ling Koh Speaker:Yu-Jiun Liu 2006/01/06

Introduction Ⅰ Closed frequent pattern no super-pattern with the same support. Maximal no frequent super-pattern. Top-K V.S. K representatives

Introduction Ⅱ The format of these representatives. How to find these representatives? The measure of their quality.

Definition Bernoulli Distribution Vector Pattern Profile

Equations The relative frequency of item οi in D’. Estimated Support

Pattern Profile Example Both of the above datasets can be summarized by <abcd>, but the quality is better for D1. p(a) = (50+1000)/(50+100+1000) = 0.91 Mabc = <[0.91,0.96,1], abcd, 0.87> M = <[0.91,0.96,1,1], abcd, 1>

Pattern Summarization First, construct a special profile for each pattern that only contains that pattern itself. Use the Kullback-Leibler divergence to merge similar patterns. KL-divergence

Hierarchical Agglomerative Clustering

K-means Clustering

Optimization Heuristics Closed Itemset vs. Frequent Itemsets Given patterns α and β, if and their supports are equal, then Approximate Profiles Using the following two equations to instead of original profile updating. for Algorithm 1 for Algorithm 2

Quality Evaluation Definition (Restoration Error) T is a testing pattern set. T’ is the collection of the itemsets generated by the master patterns in profiles and .

Quality Evaluation J tests “frequent patterns”, some of which may be estimated as “infrequent”. Jc tests “estimated frequent patterns”, some of which are actually “infrequent”. Therefore J and Jc are complementary to each other.

Quality Evaluation Lemma For any frequent itemset π, there must exist a profile Mk such that , where ψk is the master itemset of Mk.

Optimal Number of Profiles How to determine K? M = (p, ψ , ρ) Ex: require for any i such that p~q α~β  Dα~Dβ~Dα∪Dβ Checking the derivative of the quality over K , If J increase suddenly from K* to K* - 1, K* is likely to be a good choice.

Optimal Number of Profiles

Experiment Three real datasets and a series of synthetic datasets. Language: Visual C++ CPU: Intel 3.2GHz Memory: 1GB OS: Windows XP

Mushroom ※688 closed patterns

BMS-Webview1 & Replace ※threshold = 0.1% ※4195 closed patterns ※many small frequent itemsets ※threshold = 3% ※4315 closed patterns ※many small frequent itemsets

Synthetic Datasets Provided by IBM 7 datasets, each has 10000 transactions. Choose top-500. K = 50 and 100