Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu.

Slides:



Advertisements
Similar presentations
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Advertisements

Frequent Itemset Mining Methods. The Apriori algorithm Finding frequent itemsets using candidate generation Seminal algorithm proposed by R. Agrawal and.
CSE 634 Data Mining Techniques
Frequent Closed Pattern Search By Row and Feature Enumeration
Chapter 4: Trees Part II - AVL Tree
10 -1 Lecture 10 Association Rules Mining Topics –Basics –Mining Frequent Patterns –Mining Frequent Sequential Patterns –Applications.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
FP-Growth algorithm Vasiljevic Vladica,
FP (FREQUENT PATTERN)-GROWTH ALGORITHM ERTAN LJAJIĆ, 3392/2013 Elektrotehnički fakultet Univerziteta u Beogradu.
Sampling Large Databases for Association Rules ( Toivenon’s Approach, 1996) Farzaneh Mirzazadeh Fall 2007.
1 of 25 1 of 45 Association Rule Mining CIT366: Data Mining & Data Warehousing Instructor: Bajuna Salehe The Institute of Finance Management: Computing.
Data Mining Association Analysis: Basic Concepts and Algorithms
Association Analysis. Association Rule Mining: Definition Given a set of records each of which contain some number of items from a given collection; –Produce.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Aki Hecht Seminar in Databases (236826) January 2009
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
1 Mining Frequent Patterns Without Candidate Generation Apriori-like algorithm suffers from long patterns or quite low minimum support thresholds. Two.
4/3/01CS632 - Data Mining1 Data Mining Presented By: Kevin Seng.
Pattern Lattice Traversal by Selective Jumps Osmar R. Zaïane and Mohammad El-Hajj Department of Computing Science, University of Alberta Edmonton, AB,
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
B + -Trees (Part 1) Lecture 20 COMP171 Fall 2006.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
Fast Algorithms for Association Rule Mining
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
Mining Association Rules in Large Databases. What Is Association Rule Mining?  Association rule mining: Finding frequent patterns, associations, correlations,
1 Efficiently Mining Frequent Trees in a Forest Mohammed J. Zaki.
Introduction to Directed Data Mining: Decision Trees
USpan: An Efficient Algorithm for Mining High Utility Sequential Patterns Authors: Junfu Yin, Zhigang Zheng, Longbing Cao In: Proceedings of the 18th ACM.
B-Tree. B-Trees a specialized multi-way tree designed especially for use on disk In a B-tree each node may contain a large number of keys. The number.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Ch5 Mining Frequent Patterns, Associations, and Correlations
Sequential PAttern Mining using A Bitmap Representation
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining By Tan, Steinbach, Kumar Lecture.
Modul 7: Association Analysis. 2 Association Rule Mining  Given a set of transactions, find rules that will predict the occurrence of an item based on.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
Efficient Data Mining for Calling Path Patterns in GSM Networks Information Systems, accepted 5 December 2002 SPEAKER: YAO-TE WANG ( 王耀德 )
Mining Sequential Patterns Rakesh Agrawal Ramakrishnan Srikant Proc. of the Int ’ l Conference on Data Engineering (ICDE) March 1995 Presenter: Sam Brown.
Expert Systems with Applications 34 (2008) 459–468 Multi-level fuzzy mining with multiple minimum supports Yeong-Chyi Lee, Tzung-Pei Hong, Tien-Chin Wang.
B + -Trees. Motivation An AVL tree with N nodes is an excellent data structure for searching, indexing, etc. The Big-Oh analysis shows that most operations.
CS 8751 ML & KDDSupport Vector Machines1 Mining Association Rules KDD from a DBMS point of view –The importance of efficiency Market basket analysis Association.
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 Efficient Algorithms for Incremental Update of Frequent Sequences Minghua ZHANG Dec. 7, 2001.
Data Mining Find information from data data ? information.
Association Rule Mining
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
B-TREE. Motivation for B-Trees So far we have assumed that we can store an entire data structure in main memory What if we have so much data that it won’t.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Association Analysis (3)
Advanced Topics in Data Mining: Web Mining. Web Mining.
BINARY TREES Objectives Define trees as data structures Define the terms associated with trees Discuss tree traversal algorithms Discuss a binary.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
A Scalable Association Rules Mining Algorithm Based on Sorting, Indexing and Trimming Chuang-Kai Chiou, Judy C. R Tseng Proceedings of the Sixth International.
Adversarial Search 2 (Game Playing)
Indexing and Mining Free Trees Yun Chi, Yirong Yang, Richard R. Muntz Department of Computer Science University of California, Los Angeles, CA {
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Mining Complex Data COMP Seminar Spring 2011.
1 Query Processing Part 3: B+Trees. 2 Dense and Sparse Indexes Advantage: - Simple - Index is sequential file good for scans Disadvantage: - Insertions.
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
DATA MINING: ASSOCIATION ANALYSIS (2) Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Data Mining Association Analysis: Basic Concepts and Algorithms
TT-Join: Efficient Set Containment Join
Data Mining Association Analysis: Basic Concepts and Algorithms
Farzaneh Mirzazadeh Fall 2007
COMP5331 FP-Tree Prepared by Raymond Wong Presented by Raymond Wong
Mining Sequential Patterns
Mining Path Traversal Patterns with User Interaction for Query Recommendation 龚赛赛
CS 261 – Data Structures Trees.
Presentation transcript:

Efficient Data Mining for Path Traversal Patterns CS401 Paper Presentation Chaoqiang chen Guang Xu

Overview Introduction Problem Formulation Algorithm for Traversal Pattern 1. Identifying Maximal Forward References 2. Determining Large Reference Sequences Performance Results 1. Generation of Synthetic Traversal Paths 2. Performance Comparison Between FS and SS 3. Sensitivity Analysis Advantages and Disadvantages Conclusions

Introduction Analysis of past transaction data can provide very valuable information on customer buying behavior, and thus improve the quality of business decisions. It is essential to collect a sufficient amount of sales data before any meaningful conclusion can be drawn. Efficient algorithms are needed to conduct mining on these huge data. One of the most important data mining problems is mining association rules. The presence of some items in a transaction will imply the presence of other items in the same transaction.

Introduction Mining access patterns in a distributed information providing environment where objects are linked together to facilitate interactive access. 1. Improve the system design: provide efficient access between highly correlated objects etc. 2. Better marketing decisions: putting advertisements in proper places etc. Algorithms for mining traversal patterns: 1. Algorithm MF (standing for maximal forward references) to convert the original sequence of log data into a set of maximal forward references. 2. Algorithms FS (full-scan) and SS (selective-scan) for determining large reference sequences.

Problem Formulation Traversal path for a user: {A,B,C,D,C,B,E,G,H,G, W,A,O,U,O,V} Maximal forward references: {ABCD, ABEGH, ABEGW, AOU, AOV}

Problem Formulation Some nodes might be revisited because of its location, rather than its content. Assume that a backward reference is mainly for ease of traveling but not for browsing When a backward references occur, a forward reference path terminates, this resulting forward reference path is termed as maximal forward reference. A large reference sequence is a reference sequence that appeared in a sufficient number of times in a set of maximal forward references The number of times a reference sequence has to appear in order to be qualified as a large reference sequence is called the minimal support.

Problem Formulation A large k-reference is a large reference sequence with k elements. Set of large k-references is denoted as L K, its candidate set as C K. A maximal reference sequence corresponds to a “hot” access pattern in an information providing service A maximal reference sequences is a large reference sequence that is not contained in any other maximal reference sequence. Suppose that {AB, BE, AD, CG, GH, BG} is the set of large 2-references (i.e. L 2 ) and {ABE, CGH} is the set of large 3-references (i.e. L 3 ), then, the resulting maximal reference sequences are {AD, BG, ABE, CGH}

Problem Formulation Procedure for mining traversal patterns: Step 1: Determine maximal forward references from the original log data. Step 2: Determine large reference sequences (i.e., L K, K >= 1) from the set of maximal forward references. Step 3: Determine maximal reference sequences from large reference sequences. Extraction of maximal reference sequences from large reference sequences (i.e. Step 3) is straightforward Focus on Steps 1 and 2

Problem Formulation L i : set of large reference sequences Support = 2 D: set of maximal forward references C i : candidate set of large reference sequences

Algorithm for Traversal Pattern --- Identifying Maximal Forward References A traversal log database contains, for each link traversed, a pair of (source, destination) The traversal log database is sorted by user id’s, resulting in a traversal path, {(s 1,d 1 ),(s 2,d 2 ), …,(s n,d n )}, for each user, where pairs of (s i,d i ) are ordered by time. Then algorithm MF is applied to determine the maximal forward references.

Algorithm for Traversal Pattern --- Identifying Maximal Forward References

Algorithm for Traversal Pattern --- Determining large Reference Sequences (FS) Algorithm FS utilizes key ideas of the DHP (i.e., hashing and pruning) technique DHP: efficient generation for large itemsets, effective reduction on transaction database size after each scan. C k can be generated from joining L k-1 with itself Different from DHP, in FS, for any two distinct reference sequences in L k-1, say r 1,…,r k-1 and s 1,…,s k-1, we join them to form a k-reference sequence only if either r 1,…,r k- 1 contains s 1,…,s k-2 or s 1,…,s k-1 contains r 1,…,r k-2

Algorithm for Traversal Pattern --- Determining large Reference Sequences (FS) D: set of maximal forward references C i : candidate set of large reference sequences L i : set of large reference sequences Support = 2

Algorithm for Traversal Pattern --- Determining large Reference Sequences (SS) Utilizing the information in candidate reference in prior passes to avoid database scans in some passes, thus further reducing the disk I/O cost. Generate a C 3 ’ from C 2 * C 2, instead of from L 2 * L 2, and both C 2 and C 3 ’ can stored in the main memory, we can find L 2 and L 3 together when the next scan of the database is performed. If | C k+1 ’|>| C k ’ | for some k >= 2, it is usually beneficial to have a database scan to obtain L k+1 before the set of candidate references becomes too big.

Performance Results --- Generation of Synthetic Traversal Paths

The browsing scenario in a World Wide Web (WWW) environment is simulated. A traversal tree is constructed to mimic WWW structure whose starting position is a root node of the tree. The traversal tree consists of internal nodes and leaf nodes The number of child nodes at each internal node, referred to as fanout, is determined from a uniform distribution with a given range The height of a subtree whose subroot is a child node of the root node is determined from a Poisson distribution A traversal path consists of nodes accessed by a user. The size of each traversal path is picked from a Possion distribution.

Performance Results --- Generation of Synthetic Traversal Paths With the first node being the root node, a traversal path is generated probabilistically within the traversal tree. 1. For internal nodes: p 0 : probability go back to parent node p 1, p 2, p 3, p 4 …: probability go to the child nodes p j : probability jump to another internal node 2. For leaf node: 25% to parent, 75% jump to internal node The number of internal nodes with internal jumps is denoted by N J, which is set to 3% of all the internal nodes in general cases. Sensitivity of varying N J will be analyzed.

Performance Results --- Generation of Synthetic Traversal Paths

Performance Results --- Performance Comparison between FS and SS |D| = 200,000, N J = 3%, and p j = 0.1 The fanout at each internal node is between 4 and 7. The root node consists of 7 child nodes The number of internal nodes is 16,200, the number of leaf nodes is 73,006 Algorithm SS in general outperforms FS, and their performance difference becomes prominent when the I/O cost is taken into account

Performance Results --- Performance Comparison between FS and SS

Algorithm SS consistently outperforms FS as the database size increases

Performance Results --- Sensitivity Analysis |D| = 200,000, |P| = 10, the minimum support is 0.75% As the probability to backward at an internal node, p 0, increases, the number of large reference sequences decreases because the possibility of having forward traveling becomes smaller.

Performance Results --- Sensitivity Analysis The number of large reference sequences decreases as the number of child nodes of internal nodes (fanout) increases Because with a larger fanout the traversal paths are more likely to be dispersed to several branches, thus resulting in fewer large reference sequences.

Performance Results --- Sensitivity Analysis The probability of traveling to each child node from an internal node is determined from a Zipf-like distribution The number of large reference sequences increases when the corresponding probabilities are more skewed..

Performance Results --- Sensitivity Analysis

Advantages and Disadvantages Advantages 1. Make use of the concept of DHP (i.e., hashing and pruning) technique, thus the proposed algorithms are very efficient in generating set of large reference sequences L k and in reducing size of the database of maximal forward references after each scan. By this way, both CPU and I/O costs are reduced. 2. By utilizing the information in candidate references in prior passes, algorithm SS avoids database scans in some passes, thus further reducing the disk I/O cost.

Advantages and Disadvantages Disadvantages 1. Since user’s backward browsing is treated as for ease of traveling, there exist a possibility that some information about the user’s browsing behavior get lost. It is better that the backward traversal pattern can also be mined to reflect the user’s real behavior. 2. When traversal log record only contains destination references instead of a pair of references. The MF algorithm can not identify the breakpoint where the user picks a new URL to begin a new traversal path. This could increase the computational complexity because the paths considered become longer, especially in an environment where users jump from sites to sites frequently. Thus some complement method should be employed to deal with this problem.

Conclusions A new data mining capability is explored. It involves mining traversal patterns in an information providing environment where documents or objects are linked together to facilitate interactive access. Algorithm MF is employed to convert the original sequence of log data into a set of maximal forward references. Algorithms FS and SS are developed to determine large reference sequences from the maximal forward references obtained. FS is based on some hashing and pruning techniques, and SS is a further improvement of FS. Performance of FS and SS has been comparatively analyzed. Algorithm SS in general outperforms algorithm FS. Sensitivity analysis on various parameters was also conducted.