Presentation is loading. Please wait.

Presentation is loading. Please wait.

Investigation of sub-patterns discovery and its applications

Similar presentations


Presentation on theme: "Investigation of sub-patterns discovery and its applications"— Presentation transcript:

1 Investigation of sub-patterns discovery and its applications
Presenter: Xun Lu Supervisor: Jiuyong Li

2 Content Overview Brief Introduction Basic Definitions STUCCO Algorithm
MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

3 Overview of this research study
To examine and differentiate various kinds of contrast patterns The scope of this research is in an attempt to understand the principles and algorithms involved in sub-patterns, i.e. contrast sets discovery This thesis, ultimately, is trying to adopt the techniques applied in STUCCO to improve the efficiency of MORE algorithm.

4 Content Overview Brief Introduction Basic Definitions STUCCO Algorithm
MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

5 What is Contrast data mining?
Contrast – “To compare or appraise in respect to differences” (Merriam Webster Dictionary) Contrast data mining – The mining of patterns and models contrasting two or more classes/conditions.

6 Why Contrast data mining?
“Sometimes it's good to contrast what you like with something else. It makes you appreciate it even more” Darby Conley, Get Fuzzy, 2001

7 Content Overview Brief Introduction Basic Definitions STUCCO Algorithm
MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

8 Some definitions for STUCCO
Contrast set: a conjunction of attribute-value pairs defined on groups with no attribute occurring more than once. Support of a cset: the ratio of the record number containing cset to the number of all records in the data set. supp(cset) ≈ prob(cset) Group: cset with the same prefix are placed in one group Upper bound: the support of an itemset consisting of the head of the group and one item Lower bound: the support of an itemset consisting all the items the group

9 Some definitions for MORE
Contingency table Relative Risk Present and Absent can be treated as class labels (head/prefix) whereas Smoking and Non-Smoking can be seen as the rest of elements of a contrast- set. (here we only have two attributes) Risk Disease Status Present Absent Smoking a b Non-Smoking c d

10 Content Overview Brief Introduction Basic Definitions STUCCO Algorithm
MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

11 STUCCO Search Testing for Understandable Consistent Contrast
Developed by Bay and Pazzani. It aims to efficiently mine all the contrast sets which are significant and large, without predefined support thresholds It defines the support by finding the maximum difference between upper bound and lower bound within a group.

12 STUCCO pruning strategies
Effective size pruning this equation ensures effect size pruning by pruning the cset with the upper bound below Statistical significance pruning Chi-square Alternative techniques: leverage/lift/relative risk/odds ratio Interest based pruning Contrast sets are not interesting when they represent no new information E.g. marital_status=husband Λ sex=male

13 Interest based pruning cont’
STUCCO prunes the cset that do not satisfy either one of following conditions (1) (2) is normally set to a very small number,say δ/2 If A and B are itemsets where A⊂B, we also prune the following: If A is infrequent, prune A and B A={1,4,6} B={1,3,4,6}, supp(B)must be less than supp(A).

14 Filtering Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

15 Content Overview Brief Introduction Basic Definitions STUCCO Algorithm
MORE Algorithm DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

16 MORE Algorithm Mining Optimal Risk pattErn sets
Input: data set, minimum support and the minimum relative risk threshold. Output: optimal risk pattern set DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

17 MORE cont’ Advantage: it makes use of the anti-monotone property to efficiently prune the search space anti-monotone: if (supp(Px|¬a))=supp(P|¬a)), then pattern PX and all its super patterns do not occur in the optimal risk pattern set Deficiencies: MORE requires a predefined minimum support; The Relative Ratio results fail to show statistical error and residuals (details next slide); Needs to apply more techniques from STUCCO to determine superfluous patterns. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

18 Statistical error and residuals
Given this Risk pattern result generated by MORE: RR=2.00. But this value is calculated from sample mean, which may not represent the truth of unobservable population mean. Hence, we need an acceptable range value, say [1.84, 2.47], instead of a singe value for RR. DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.

19 Questions? DO NOT REMOVE THIS NOTICE. Reproduced and communicated on behalf of the University of South Australia pursuant to Part VB of the copyright Act 1968 (the Act) or with permission of the copyright owner on (DATE) Any further reproduction or communication of this material by you may be the subject of copyright protection under the Act. DO NOT REMOVE THIS NOTICE.


Download ppt "Investigation of sub-patterns discovery and its applications"

Similar presentations


Ads by Google