Investigation of sub-patterns discovery and its applications

Slides:

Advertisements

Similar presentations

Recap: Mining association rules from large datasets

Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.

Frequent Closed Pattern Search By Row and Feature Enumeration

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

11-1 Copyright  2006 McGraw-Hill Australia Pty Ltd Revised PPTs t/a Auditing and Assurance Services in Australia 3e by Grant Gay and Roger Simnett Slides.

BA 427 – Assurance and Attestation Services

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

Statistical Significance of Data

CSC 211 Data Structures Lecture 13

An Investigation of Subspace Outlier Detection Alex Wiegand Supervisor: Jiuyong Li.

Detecting Group Differences: Mining Contrast Sets Author: Stephen D. Bay Advisor: Dr. Hsu Graduate: Yan-Cheng Lin.

PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology Principles of Parameter Estimation.

MINING COLOSSAL FREQUENT PATTERNS BY CORE PATTERN FUSION FEIDA ZHU, XIFENG YAN, JIAWEI HAN, PHILIP S. YU, HONG CHENG ICDE07 Advisor: Koh JiaLing Speaker:

Investigation of sub-patterns discovery and its applications By: Xun Lu Supervisor: Jiuyong Li.

UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.

COMMONWEALTH OF AUSTRALIA Copyright Regulations 1969 WARNING This material has been reproduced and communicated to you by or on behalf of Monash University.

MAXIMALLY INFORMATIVE K-ITEMSETS. Motivation  Subgroup Discovery typically produces very many patterns with high levels of redundancy  Grammatically.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

Searching for Pattern Rules Guichong Li and Howard J. Hamilton Int'l Conf on Data Mining (ICDM),2006 IEEE Advisor ： Jia-Ling Koh Speaker ： Tsui-Feng Yen.

Virtual University of Pakistan

Physics 114: Lecture 16 Least Squares Fit to Arbitrary Functions

Cross Tabulation with Chi Square

Audit Sampling: An Overview and Application

Audit Sampling: An Overview and Application to Tests of Controls

Hypothesis Tests l Chapter 7 l 7.1 Developing Null and Alternative

Analysis Manager Training Module

Chapter 7. Classification and Prediction

Nonparametric test Nonparametric tests are decoupled from the distribution so the tested attribute may also be used in the case of arbitrary distribution,

12. Principles of Parameter Estimation

Present: Disease Past: Exposure

Unit 3 Hypothesis.

Statistical Quality Control, 7th Edition by Douglas C. Montgomery.

Chapter 11 Audit sampling

Chapter 11 Audit sampling

G10 Anuj Karpatne Vijay Borra

Association Rules Repoussis Panagiotis.

Introductory Mathematics & Statistics

Categorical Data Aims Loglinear models Categorical data

A Brief Introduction of RANSAC

Frequent Pattern Mining

Auditing & Investigations I

William Norris Professor and Head, Department of Computer Science

Data Mining Lecture 11.

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Data Mining Association Analysis: Basic Concepts and Algorithms

Association Rule Mining

Algorithm An algorithm is a finite set of steps required to solve a problem. An algorithm must have following properties: Input: An algorithm must have.

Multi-Way Search Trees

Data Mining Association Analysis: Basic Concepts and Algorithms

Indexing and Hashing Basic Concepts Ordered Indices

Maximally Informative k-Itemsets

Discriminative Frequent Pattern Analysis for Effective Classification

Discriminative Pattern Mining

PLEASE DELETE THIS SLIDE WHEN YOUR PRESENTATION IS COMPLETE

FP-Growth Wenlong Zhang.

Clustering Wei Wang.

Data measurement, probability and statistical tests

Mark Crowther – Empirical Pragmatic Tester

Data Mining for Finding Connections of Disease and Medical and Genomic Characteristics Vipin Kumar William Norris Professor and Head, Department of Computer.

Psych 231: Research Methods in Psychology

Interpreting Epidemiologic Results.

Introduction to Sampling Distributions

12. Principles of Parameter Estimation

Association Analysis: Basic Concepts

Version Space Machine Learning Fall 2018.

Presentation transcript:

Investigation of sub-patterns discovery and its applications Presenter: Xun Lu Supervisor: Jiuyong Li

Content Overview Brief Introduction Basic Definitions STUCCO Algorithm MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Overview of this research study To examine and differentiate various kinds of contrast patterns The scope of this research is in an attempt to understand the principles and algorithms involved in sub-patterns, i.e. contrast sets discovery This thesis, ultimately, is trying to adopt the techniques applied in STUCCO to improve the efficiency of MORE algorithm.

Content Overview Brief Introduction Basic Definitions STUCCO Algorithm MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

What is Contrast data mining? Contrast – “To compare or appraise in respect to differences” (Merriam Webster Dictionary) Contrast data mining – The mining of patterns and models contrasting two or more classes/conditions.

Why Contrast data mining? “Sometimes it's good to contrast what you like with something else. It makes you appreciate it even more” Darby Conley, Get Fuzzy, 2001

Content Overview Brief Introduction Basic Definitions STUCCO Algorithm MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Some definitions for STUCCO Contrast set: a conjunction of attribute-value pairs defined on groups with no attribute occurring more than once. Support of a cset: the ratio of the record number containing cset to the number of all records in the data set. supp(cset) ≈ prob(cset) Group: cset with the same prefix are placed in one group Upper bound: the support of an itemset consisting of the head of the group and one item Lower bound: the support of an itemset consisting all the items the group

Some definitions for MORE Contingency table Relative Risk Present and Absent can be treated as class labels (head/prefix) whereas Smoking and Non-Smoking can be seen as the rest of elements of a contrast- set. (here we only have two attributes) Risk Disease Status Present Absent Smoking a b Non-Smoking c d

Content Overview Brief Introduction Basic Definitions STUCCO Algorithm MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

STUCCO Search Testing for Understandable Consistent Contrast Developed by Bay and Pazzani. It aims to efficiently mine all the contrast sets which are significant and large, without predefined support thresholds It defines the support by finding the maximum difference between upper bound and lower bound within a group.

STUCCO pruning strategies Effective size pruning this equation ensures effect size pruning by pruning the cset with the upper bound below Statistical significance pruning Chi-square Alternative techniques: leverage/lift/relative risk/odds ratio Interest based pruning Contrast sets are not interesting when they represent no new information E.g. marital_status=husband Λ sex=male

Interest based pruning cont’ STUCCO prunes the cset that do not satisfy either one of following conditions (1) (2) is normally set to a very small number,say δ/2 If A and B are itemsets where A⊂B, we also prune the following: If A is infrequent, prune A and B A={1,4,6} B={1,3,4,6}, supp(B)must be less than supp(A).

Filtering Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Content Overview Brief Introduction Basic Definitions STUCCO Algorithm MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

MORE Algorithm Mining Optimal Risk pattErn sets Input: data set, minimum support and the minimum relative risk threshold. Output: optimal risk pattern set DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

MORE cont’ Advantage: it makes use of the anti-monotone property to efficiently prune the search space anti-monotone: if (supp(Px|¬a))=supp(P|¬a)), then pattern PX and all its super patterns do not occur in the optimal risk pattern set Deficiencies: MORE requires a predefined minimum support; The Relative Ratio results fail to show statistical error and residuals (details next slide); Needs to apply more techniques from STUCCO to determine superfluous patterns. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Statistical error and residuals Given this Risk pattern result generated by MORE: RR=2.00. But this value is calculated from sample mean, which may not represent the truth of unobservable population mean. Hence, we need an acceptable range value, say [1.84, 2.47], instead of a singe value for RR. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

Questions? DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.