Mining Access Pattrens Efficiently from Web Logs Jian Pei, Jiawei Han, Behzad Mortazavi-asl, and Hua Zhu 2000년 5월 26일 DE Lab. 윤지영
1. Introduction Web Mining Useful Knowledge of Web log Web content mining Web usage mining (Web log minig) Useful Knowledge of Web log Improving designs of web site Analyzing system performance Understanding user reaction and motivation Building adaptive Web sites 2018-11-28 DE Lab. 윤지영
Sequential Pattern mining Web Access Pattern is sequential pattern in a large set of pieces of Web logs -> using sequential pattern mining Sequential Pattern mining Given a sequence database Each sequence is a list of transactions Ordered by transaction time and each transactions of a set of items Find all sequential patterns with a user-specified minimum support The support is the number of data sequences that contain the pattern 2018-11-28 DE Lab. 윤지영
Contents Problem defined WAP-tree structure is designed WAP-mine Algorithm - WAP : Web Access Pattern Performance study - vs. Apriori-based mining(GSP) 2018-11-28 DE Lab. 윤지영
2. Problem Statement Web log - a sequence of pairs of user id and event - Sequential pattern over support threshold Let E be a set of events Web Access Pattern S = e1,e2…en (ei ∈E) for (1≤i ≤n) - n : length of the access sequence (called an n-sequence) - ei ≠ ej for ( i ≠ j) ( ‘aab’ and ‘ab’ is two different access sequences ) 2018-11-28 DE Lab. 윤지영
Subsequence & super_sequence Subsequence of S = e1,e2…en S’ = e’1e’2…e’l S is super_sequence of S’ (S’⊂S) In S’⊂S, If (1≤i1<i2<…<il≤n) , such that e’j = eij for (1 ≤j ≤l) 2018-11-28 DE Lab. 윤지영
suffix & prefix S = e1e2…ek+1…en Pattern P = e’1,e’2,…e’l, and ek+1 = e’1 Suffix of S with respect to pattern P - Ssuffix = ek+1…en Prefix of S with respect to pattern P - Sprefix = e1e2…ek 2018-11-28 DE Lab. 윤지영
ξ -pattern Web Access Pattern Database (WAS) - WAS = { S1, S2,…,Sm} , Si (1≤i ≤m) Support of S in WAS - supwas(S) = sup(S) = |{Si|S⊆Si} | / m ξ -pattern : supwas(S) ≥ ξ 2018-11-28 DE Lab. 윤지영
2. Problem Statement Given Web access sequence database WAS and a support threshold ξ, mine the complete set of ξ -pattern of WAS < Example 1> Table1. (ex) fc is 50%-pattern 5-sequence 2018-11-28 DE Lab. 윤지영
3. WAP-mine : property 1 <Property 1> (Sequential Pattern Apriori) Let SEQ be a sequence database, if G is not a ξ -pattern of SEQ , any super-sequence of G can not be a ξ -pattern of SEQ < Example > “ f ” is not a 75%-pattern of WAS in Example 1, thus any access sequence containing “ f ”, cannot be a 75%-pattern 2018-11-28 DE Lab. 윤지영
3. WAP-mine : property 2 <Property 2> (Suffix heuristic) If e is a frequent event in the set of prefixes of sequences in WAS, w.r.t. pattern P, sequence eP is an access pattern of WAS < Example > - b is a frequent event within the set of prefixes w.r.t. frequent sequence ‘abac’ in Example 1, so we can claim that bac is an access pattern. 2018-11-28 DE Lab. 윤지영
WAP-tree Data structure, WAP-tree have Access sequences and corresponding counts Tedious support counting can be avoided -> No large candidate generation and creating the patterns with enough support much smaller than Original WAP-tree is built by simplly scanning twice The philosophy is conditional search instead of level-wise 2018-11-28 DE Lab. 윤지영
Conditional Search Looking for patterns with the same suffix Count frequent events in the set of prefixes with same suffix condition Partition-based divide-and-conquer method instead of bottom-up generation of combinations Avoids generating large candidate sets 2018-11-28 DE Lab. 윤지영
Algorithm 1 : WAP-mine (miming access patterns in Web access sequence database) Input : access sequence database WAS and support threshold ξ(0≤ ξ ≤1) Output : the complete set of ξ-patterns in WAS Method : 1. Scan Once, find all frequent event 2. Scan Twice, construct a WAP-tree 3. Recursively mine the WAP-tree using conditional search 2018-11-28 DE Lab. 윤지영
4. Construction of WAP-tree <Observation of WAP-tree> <1> If an event e is not in the set of frequent 1-sequences, there is no need to include e in the construction of a Web access pattern tree <2> If two access sequences share a common prefix P, the prefix P can be shard in the WAP-tree 2018-11-28 DE Lab. 윤지영
WAP-tree Structure Label and count in each node - Labeled by an event and count the number of occurrences of the corresponding prefix Constructure of WAP-tree - filter out any nonfrequent event - insert the resulting frequent subsequence into WAP-tree Auxiliary node linkage structures are constructed to assist node travel - heder table H, event node queue of with label ei (ei-queue) 2018-11-28 DE Lab. 윤지영
Example 2 (Do reference Table 1 ) - Support threshold is set to 75% - Tthe set of frequent 1-event : {a,b,c} <fig 1>The WAP-tree and conditional WAP-tree for frequent subsequences in Table1 2018-11-28 DE Lab. 윤지영
Algorithm 2 ( WAP-tree Construction) Input : database WAS and the set of frequent events FE Output : an WAP-tree T ; Method : 1. Create a root node for T 2. For each access sequence S in WAS do - Extract frequent subsequence S’ (S’ = s1s2…sn, si(1≤ i ≤ n) and For I = 1 to n, - if current_node has a child labeled si, increase the count of si by 1 - else create a new child node (si:1) 3. Return(T); 2018-11-28 DE Lab. 윤지영
Analysis of WAP-tree The height of the WAP-tree - 1 + maximum length of the frequent subsequences the width of the WAP-tree - the number of leaves of the tree = the number of access sequences => much smaller than the size of WAS 2018-11-28 DE Lab. 윤지영
Lemma 1 For any access sequence in WAS , there exists a unique path in the WAP-tree about all labels of nodes as the events in the database The number of distinct leaf node as well as paths in an WAP-tree - cannot more than distinct frequent subsequences and height is short => implemented by B+tree and even in pure SQL 2018-11-28 DE Lab. 윤지영
5. Mining Web Access Patterns from WAP-tree <Property 3> ( Node-link property) All the frequent subsequences contain ei can be visited by following the ei-queue => can lookahead suffix node of ei A prefix sequence of ei may contain another prefix sequence of ei < example > - in abab, prefix sequences of b => aba, a 2018-11-28 DE Lab. 윤지영
Concept of unsubsumed count Let G and H be two prefix sequences of ei , if G is subpath of H then G is a sub-prefix sequence of H and H is a super-prefix sequence of G For ei without any super-prefix sequence, we define the unsubsumed count of that sequence if ei with some super-prefix sequences, the unsubsumed count of it => the count of that sequence minus unsubsumed counts of all its super-prefix sequences 2018-11-28 DE Lab. 윤지영
What’s Conditional Search <Property 4> (Prefix sequence unsubsumed count property) the count of a sequence G ended with ei is the sum of unsubsumed counts of all prefix sequences of ei which is a super-sequence of G can count all frequent events in the set of sequences with same suffix Instead of searching all at a time, it turns to search web access patterns with same suffix as the condition 2018-11-28 DE Lab. 윤지영
Conditional Search paradigm 1. PS |ei : contains all prefix sequences of ei and count 2. for each prefix sequence of ei with count c , When it insert into PS |ei and all of its sub-prefix sequences of ei are inserted into PS |ei with count – c 3. A prefix sequence in PS |ei holds its unsubsumed count 4. It can be mined by concatenating 2018-11-28 DE Lab. 윤지영
Algorithm 3 (Mining all Web access patters in a WAP-tree) Input : a WAP-tree T and support threshold ξ Output : the complete set of ξ-patterns Method : 1. If the WAP-tree T has only one branch, return all the unique combinations of node in that branch 2. Initialize Web access pattern set Wap = 0. Insert patterns into WAP 2018-11-28 DE Lab. 윤지영
Algorithm 3 (Mining all Web access patters in a WAP-tree) 3. For each event ei in WAP-tree T, - Construct PS |ei by following the ei- queue and count conditional frequent events at same time - if the set of conditional frequent events is not empty, build a conditional WAP-tree - Web access pattern from conditional WAP- tree concatenate ei to it and insert it into WAP 4. Return WAP 2018-11-28 DE Lab. 윤지영
6. Performance Evaluation and Conclusions <Theorem 1> WAP-mine returns the complete set of access patterns without redundancy <fig 2> GSP vs. WAP-mine 2018-11-28 DE Lab. 윤지영