Discovery of Significant Usage Patterns from Clusters of Clickstream Data Lin Lu, Margaret Dunham, and Yu Meng Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275-0122 llu(mhd,ymeng)@engr.smu.edu WebKDD’05 1
Beginning/ending Web page(s) Introduction Significant Usage Patterns (SUP) - SUP is extracted from clusters of abstracted user sessions - Use a unique two-phase abstraction technique - With desired beginning and/or ending Web pages - With normalized probability Clustering Abstraction Beginning/ending Web page(s) Normalized Sequential Pattern N Y* - Maximal Frequent Sequence Maximal Frequent Forward Sequence User Preferred Navigational Trail [1,2] Significant Usage Pattern Y WebKDD’05 2
Model WebKDD’05 3 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery SUPs per Cluster WebKDD’05 3
Alignment of Web sessions Create sub-abstracted Web sessions URL -> {<Concept hierarchy keyword> <Unique ID> <|>} JCPenney Homepage D1 … Dn C1 Cn I1 In Department level Category level Item level Fig 1. Hierarchy of J.C. Penney Web site D2 Example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 WebKDD’05 4
Alignment of Web sessions Computing the similarity between any two Web pages The higher the level in the hierarchy, the more importance in determining the similarity of two Web pages, should give more weight. Scoring scheme - step 1: determine the longer page representation string in the two Web page representations. - step 2: weight is assigned to each level in the hierarchy: the lowest level in longer page representation string is given weight 2 to its abstract level, the second to the lowest level is given weight 4 to its abstract level, and so on. The corresponding ID is always given weight 1. WebKDD’05 5
Alignment of Web sessions Computing the similarity between any two Web pages - step 1: compare the two Web page representation strings from the left to the right and stopped at the first pair which they are different. - step 2: compute the ratio of the sum of the weights of those matching parts to the weight of longer page representation string. Example: Page 1: D0|C875|I Weight=6+1+4+1+2=14 Page 2: D0|C875 Weight=6+1+4+1=12 Similarity=12/14=0.857 WebKDD’05 6
Model WebKDD’05 7 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 7
Alignment of Web sessions Computing optimal alignment of two sequences using Needleman-Wunsch algorithm Y1 … Yj-1 Yj Yn -d -(j-1)d -jd -nd X1 Xi-1 -(i-1)d Xi -id Xm -md A(m, n) A(i-1, j-1) A(i-1, j) A(i, j-1) A(i, j) A(i, j) = max[A(i-1, j-1)+s(Xi, Yj); A(i-1, j)-d; A(i, j-1)-d] where s(Xi, Yj) is the similarity between Xi and Yj, d is the score of aligning Xi (Yj) with a gap WebKDD’05 8
Alignment of Web sessions Apply Needleman-Wunsch global alignment algorithm Scoring scheme [3] if (matching) score = 20; //a pair of Web pages with similarity 1 else if (mis-matching) score = –10; //a pair of Web pages with similarity 0 else if (gap) score = –10; //a Web page aligns with a gap else score = –10 ~ 20; //the pair of Web pages with similarity between 0 and 1 Example: P47104 D0|C0|I D469|C469 D2652|C2652 D469|C16758|I D0|C0|I D469|C469 P47104 D0|C0|I D469|C469 D2652|C2652 -10 -20 -30 -40 D469|C16758|I 5.7 -4.3 -14.3 10 17.1 7.1 30 32.1 Thus, session similarity = 32.1/4 = 8.025 WebKDD’05 9
Model WebKDD’05 10 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 10
Model WebKDD’05 11 Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Sub-Abstracted Sessions Apply Needleman-Wunsch global alignment algorithm Similarity Matrix Apply Nearest neighbor clustering algorithm Clusters of User Sessions Abstraction Hierarchy Concept-based Abstracted URLs Concept-based Abstracted Sessions per Cluster Build Markov model for each cluster Transition Matrix per Cluster Pattern Discovery Patterns per Cluster WebKDD’05 11
Create Concept-based Abstracted Sessions Represent the abstracted page accesses in a session as a sequence like: P1 D1 C1 I1 P2 D2 C2 I2 … In a session, the same Pi, Di, Ci, and Ii (i=1, 2…) represents the same page. However, in different sessions, the same page may be represented by different elements. Example: Original session: D7107|C7121 D7107|C7126|I076bdf3 D7107|C7131|I084fc96 D7107|C7131 P55730 P96 P27 P14 P27592 P28 P33711 -505884861 Abstracted session: C1 I1 I2 C2 P1 P2 P3 P4 P5 P6 P7 -505884861 WebKDD’05 12
Generating Significant Usage Patterns Use Markov model to represent sessions in each cluster Example: 0.4 0.17 0.2 0.5 0.33 0.25 0.75 1 S 2 5 3 4 E (1) 1, 2, 3, 5, 4 (2) 2, 4, 3, 5 (3) 3, 2, 4, 5 (4) 1, 3, 4, 3 (5) 4, 2, 3, 4, 5 The probability of a path normalized where Pti is the transition probability between two adjacent states WebKDD’05 13
Generating Significant Usage Patterns Example: > 0.4, end state is 4 > 0.4, beginning state is 1, end state is 4 SUP S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 WebKDD’05 14
Experimental Result sessions without purchase WebKDD’05 15 On average purchase sessions are longer than those sessions without purchase - review the information, compare the price, the quality and etc. - fill out the billing and shipping information to commit the purchase WebKDD’05 15
Average Session Length Experimental Result SUPs in non-purchase cluster Cluster No. No. of Sessions Threshold () Average Session Length No. of States SUPs 1 1746 0.3 9.6 98 1. S-C1-C1-C2-C3-C4-C5-C6-C7-E 2. S-C1-C1-C2-C3-C4-C5-E 3. S-C1-C1-C2-C3-E 4. S-C1-C2-C3-C3-C4-C5-C6-C7-E 5. S-C1-C2-C3-C4-C4-C5-C6-C7-E … 2 241 0.37 6.6 38 1. S-P1-P2-P3-P3-E 2. S-P1-P2-P3-P4-P4-P5-E 3. S-P1-P2-P3-P4-P4-E 4. S-P1-P2-P3-P4-P5-P4-E 5. S-P1-P2-P3-P4-P5-P5-E 3 13 3.0 6 1. S-C1-P1-P2-E 2. S-C1-P1-E 3. S-I1-P1-P1-P2-E 4. S-I1-P1-P1-E 5. S-I1-P1-E Interested in gathering information of products in different categories. S-C1-C1-C2-C3-C4-C5-C5-I1-E S-C1-C1-I1-C1-C2-C3-C4-C5-E S-I1-C1-C2-C3-C4-C5-C6-C7-E Interested in reviewing general pages (to gather general information). Not serious visitors (the average session length is 3) WebKDD’05 16
Experimental Result WebKDD’05 17 Cluster No. No. of Sessions Average Length States Threshold () Beginning Web page SUPs in BNF Notation Non- Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E review the information, compare among products, and fill out the payment and shipping information The average length of SUPs is longer in the purchase cluster than in non-purchase cluster SUPs in the purchase cluster have higher probability than those in non-purchase cluster. have purchase in mind vs. random browsing behavior WebKDD’05 17
Conclusion and Future Work Summary - By applying clustering to abstracted user sessions, it is more likely to find groups of users with similar motivations for visiting a specific website. - By giving the flexibility for user to specify the beginning and/or ending Web page(s), users can have more control in generating patterns of their interests. Future - Scalability - Cluster to identify different user groups - Online identification of user to predefined cluster WebKDD’05 18
References [1] J. Borges and M. Levene, “Data Mining of User Navigation Patterns”, In Proc. the Workshop on Web Usage Analysis and User Profiling (WEBKDD'99), 31-36, San Diego, August 15, 1999. [2] J. Borges and M. Levene, “An average linear time algorithm for web data mining”, International Journal of Information Technology and Decision Making, 3, (2004), 307-320. [3] W. Wang and O. R. Zaïane, “Clustering Web Sessions by Sequence Alignment”, Third International Workshop on Management of Information on the Web in conjunction with 13th International Conference on Database and Expert Systems Applications DEXA'2002, pp 394-398, Aix en Provence, France, September 2-6, 2002.
Thank you Questions? WebKDD’05 20