Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang.

Slides:

Advertisements

Similar presentations

Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.

Advertisements

Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.

Minimizing Seed Set for Viral Marketing Cheng Long & Raymond Chi-Wing Wong Presented by: Cheng Long 20-August-2011.

Supervised Learning Recap

Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,

S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.

CS774. Markov Random Field : Theory and Application Lecture 17 Kyomin Jung KAIST Nov

Decision Tree under MapReduce Week 14 Part II. Decision Tree.

1 Fast Primal-Dual Strategies for MRF Optimization (Fast PD) Robot Perception Lab Taha Hamedani Aug 2014.

Yang Xiang, Ruoming Jin, David Fuhry, Feodor F. Dragan

Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.

1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

4. Ad-hoc I: Hierarchical clustering

Clustering Color/Intensity

Zoë Abrams, Ashish Goel, Serge Plotkin Stanford University Set K-Cover Algorithms for Energy Efficient Monitoring in Wireless Sensor Networks.

Overlapping Matrix Pattern Visualization: a Hypergraph Approach Ruoming Jin Kent State University Joint with Yang Xiang, David Fuhry, and Feodor F. Dragan.

Classification 10/03/07.

WiOpt’04: Modeling and Optimization in Mobile, Ad Hoc and Wireless Networks March 24-26, 2004, University of Cambridge, UK Session 2 : Energy Management.

Online Data Gathering for Maximizing Network Lifetime in Sensor Networks IEEE transactions on Mobile Computing Weifa Liang, YuZhen Liu.

Achieving Minimum Coverage Breach under Bandwidth Constraints in Wireless Sensor Networks Maggie X. Cheng, Lu Ruan and Weili Wu Dept. of Comput. Sci, Missouri.

Efficiently Answering Reachability Queries on Large Directed Graphs Ruoming Jin Kent State University Joint work with Yang Xiang (KSU), Ning Ruan (KSU),

Cartesian Contour: A Concise Representation for a Collection of Frequent Sets Ruoming Jin Kent State University Joint work with Yang Xiang and Lin Liu.

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

A Sparsification Approach for Temporal Graphical Model Decomposition Ning Ruan Kent State University Joint work with Ruoming Jin (KSU), Victor Lee (KSU)

Birch: An efficient data clustering method for very large databases

Comp 540 Chapter 9: Additive Models, Trees, and Related Methods

Ensemble Learning (2), Tree and Forest

Time series analysis and Sequence Segmentation

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:

Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.

Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.

Topology Design for Service Overlay Networks with Bandwidth Guarantees Sibelius Vieira* Jorg Liebeherr** *Department of Computer Science Catholic University.

6.1 What is Statistics? Definition: Statistics – science of collecting, analyzing, and interpreting data in such a way that the conclusions can be objectively.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Network Aware Resource Allocation in Distributed Clouds.

Introduction Osborn. Daubert is a benchmark!!!: Daubert (1993)- Judges are the “gatekeepers” of scientific evidence. Must determine if the science is.

Jointly Optimized Regressors for Image Super-resolution Dengxin Dai, Radu Timofte, and Luc Van Gool Computer Vision Lab, ETH Zurich 1.

Text Clustering.

1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.

1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.

Clustering Moving Objects in Spatial Networks Jidong Chen, Caifeng Lai, Xiaofeng Meng, Renmin University of China Jianliang Xu, and Haibo Hu Hong Kong.

Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”

This paper was presented at KDD ‘06 Discovering Interesting Patterns Through User’s Interactive Feedback Dong Xin Xuehua Shen Qiaozhu Mei Jiawei Han Presented.

Multivariate Dyadic Regression Trees for Sparse Learning Problems Xi Chen Machine Learning Department Carnegie Mellon University (joint work with Han Liu)

CS774. Markov Random Field : Theory and Application Lecture 02

Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

An Introduction to Variational Methods for Graphical Models

Lecture 2: Statistical learning primer for biologists

Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.

1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.

Machine learning optimization Usman Roshan. Machine learning Two components: – Modeling – Optimization Modeling – Generative: we assume a probabilistic.

Pattern Recognition and Machine Learning

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Sporadic model building for efficiency enhancement of the hierarchical BOA Genetic Programming and Evolvable Machines (2008) 9: Martin Pelikan, Kumara.

Classification and Regression Trees

6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.

SemiBoost : Boosting for Semi-supervised Learning Pavan Kumar Mallapragada, Student Member, IEEE, Rong Jin, Member, IEEE, Anil K. Jain, Fellow, IEEE, and.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Learning Portfolio Analysis and Mining for SCORM Compliant Environment Pattern Recognition (PR, 2010)

Data Transformation: Normalization

Introduction to Machine Learning and Tree Based Methods

Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University

Text Categorization Berlin Chen 2003 Reference:

Summarizing Itemset Patterns: A Profile-Based Approach

Presentation transcript:

Efficient and Effective Itemset Pattern Summarization: Regression-based Approaches Ruoming Jin Kent State University Joint work with Muad Abu-Ata, Yang Xiang, and Ning Ruan (KSU)

Problem Definition Given a large collection of frequent itemsets and their supports, how we can concisely represent them? –Coverage criterion The Spanning Set Approach [F. Arati, A. Gionis, Mannila, Approximating a collection of frequent sets, KDD’04]. –Frequency criterion The Profile-based Approach [X. Yan, H. Cheng, J. Han, and D. Xin, Summarizing itemset patterns, a profile-based approach, KDD’05.] The Markov Random Field Approach [C. Wang and S. Parthasarathy, Summarizing itemset patterns using probabilistic models, KDD’06.]

Frequency Criterion The restoration function of a set of itemsets S is a function The restoration error: We use 2-norm in this study.

Probabilistic Restoration Function Applying the independence probabilistic model for a set of itemsets S: An example,

Problem 1: Optimal Parameters What are the optimal parameters, p(S),p(a),p(c),p(d), minimizing the restoration error:

Non-Linear Regression We introduce the independent variable We have |S| data points.

Linear Regression Approximation Using Taylor expansion, we show the restoration error from linear regression is very close to the error by using the non-linear regression!

Problem 2: Optimal Partition To reduce the restoration error, we adopt the partition strategy –Partition the entire collection of frequent itemsets into K disjoint subsets, and build the restoration function for each subset How to optimally partition a set of itemsets into K disjoint subsets so that the total restoration error can be minimized?

Our Approaches NP-hard problem Two heuristic algorithms –K-Regression –Tree Regression

K-Regression A k-means type clustering procedure: 1.Random partition the set of itemsets S into K partition 2.[Regression Step] Apply regression to find the optimal parameters on each partition 3.[Re-assignment Step] For each itemset, assign it to the partition which minimizes its restoration error based on the optimal parameters discovered by Step 2 4.Repeat 2 and 3 until the total restoration error does not increase or the improvement is small Just as k-means, k-regression is guaranteed to converge!

Tree Regression Using Regression to find optimal parameters for each subset of itemsets S={{a},{b},{c},{d},{a,b},{a,c},{b,c},{a,d},{c,d}, {a,b,c},{a,b,d},{a,c,d}}

Tree Regression Construction A Decision-type of construction algorithm –Question 1: How to find K subsets of itemsets? –Question 2: How to find the optimal splitting? Answer for Q1 –Maintain a queue for the “current” leaf node, and always pick up the leaf nodes with the maximal average restoration error to split Answer for Q2 –Maximally reduce the total restoration error Min E(S)-E(S_1)-E(S_2)

An Interesting Connection Jerome H. Friedman’s 1977 paper, “A tree-structured Approach to nonparametric multiple regression”. Unfortunately, this work seems never got enough attention. However, it seems part of the inspiration for the CART (regression tree) and MARS (Multivariate Adaptive Regression Spline).

Experimental Results

Chess Restoration Error

BMS-POS Restoration Error

BMS-POS Running Time

Conclusion Using linear regression to identify optimal parameters of the probabilistic restoration function (based on the independence assumption) for a set of itemsets Two algorithms to optimally partition the set of itemsets into K parts –K-regression –Tree regression

Thanks!!