Discovery of Significant Usage Patterns from Clickstream Data Margaret H. Dunham, Lin Lu CSE Department Southern Methodist University Dallas, Texas 75275 mhd@engr.smu.edu This material is based upon work supported by the National Science Foundation under Grant No. IIS-0208741 05/04/05 , Travelocity
Web Usage Mining Overview Our Work: Significant Usage Patterns OUTLINE Web Usage Mining Overview Our Work: Significant Usage Patterns Ongoing/Future Research 05/04/05 , Travelocity
Web Usage Mining Applications Personalization Improve structure of a site’s Web pages Aid in caching and prediction of future page references Improve design of individual pages Improve effectiveness of e-commerce (sales and advertising) 05/04/05 , Travelocity
Web Usage Mining Activities Preprocessing Web log Cleanse Remove extraneous information Sessionize Session: Sequence of pages referenced by one user at a sitting. Pattern Discovery Count patterns that occur in sessions Pattern is sequence of pages referenced in session. Pattern Analysis 05/04/05 , Travelocity
Pattern Types Association Rules None of the properties hold Episodes Only ordering holds Sequential Patterns Ordered and maximal Forward Sequences Ordered, consecutive, and maximal Maximal Frequent Sequences All properties hold User Preferred Navigation Trail Not a true pattern, but representative of many 05/04/05 , Travelocity
Web Usage Mining Issues Identification of exact user not possible. Exact sequence of pages referenced by a user not possible due to caching. Session not well defined Security, privacy, and legal issues 05/04/05 , Travelocity
CAN’T SEE THE FOREST FOR THE TREES The BIG PICTURE 2003-10-0515:49:20050721435700000026210000000000 0265202652 000000000 2003-10-0516:40:49050832595900000872710001142380 0710707107 000000000 2003-10-0504:55:10050767799900000191300000670518 0000000000 000000000 2003-10-0509:43:10050781766100000603030000000000 0365700469 000000000 2003-10-0514:49:360508182420000007066200000000000811a39 0914207107 000000000 2003-10-0521:23:57050759031600000465050002794335 1199207107 000000000 2003-10-0511:30:16050730512600000465050000195747 1684600597corduroy+coats CAN’T SEE THE FOREST FOR THE TREES S-P1-P2-P3-P4-P5-P6-C1-C2-E S-P1-P2-P3-P4-P5-C4-I6-I7-I8-E 05/04/05 , Travelocity
SIGNIFICANT USAGE PATTERNS Solution Clustering Abstraction User Preferred Navigation Trails SIGNIFICANT USAGE PATTERNS 05/04/05 , Travelocity
Interests… Motivations… Web Log Web Server Preprocess Web Data: Cleanse Sessionize … Markov Model per Cluster Markov Model URL Abstraction User defined beginning/ending Web pages Significant Usage Pattern User Preferred Navigation Trail Cluster Web Sessions Normalized Probability
Significant Usage Pattern (SUP): SUP is a path that is extracted from a Markov model with user defined starting and ending states, and its corresponding normalized product of probabilities along the path satisfies a given threshold. Differences from previous research: - SUP is extracted from clusters of user sessions - user sessions are abstracted sessions - starting and ending with specific Web pages of user interests Need not be an exact pattern found in any session, but rather is representative of patterns found. 05/04/05 , Travelocity
Model 05/04/05 , Travelocity Sessionized Web Log Abstraction Hierarchy Sub-Abstracted Sessions Clusters of User Sessions Similarity Matrix Concept-based Abstracted Sessions per Cluster Apply Needleman-Wunsch global alignment algorithm Apply Nearest neighbor clustering algorithm Concept-based Abstracted URLs Transition Matrix per Cluster Sessionized Web Log Abstraction Hierarchy Sub-abstract URLs Patterns per Cluster Pattern Discovery Build Markov model for each cluster 05/04/05 , Travelocity
Abstract Web session data JCPenney Homepage D1 … D2 Dn C1 Cn I1 In Department level Category level Item level Fig 2. Hierarchy of JCPenney Web site Web session example: D0|C875|I D0|C875|I P27593 P27592 P28 -507169015 05/04/05 , Travelocity
Alignment of Web Sessions Compute the similarity between any two Web pages The higher the level in the hierarchy, the more importance it is in determining the similarity of two Web pages, should give more weight. - step 1: compare the two Web page representation strings from left to right and stop at the first pair where they are different. - step 2: compute the ratio of sum of the weights of those matching parts to the sum of total weights . Example Page 1: D0|C875|I weight=6+1+4+1+2=14 Page 2: D0|C875 weight=6+1+4+1=12 Similarity=12/14=0.857 05/04/05 , Travelocity
Generating Significant Usage Patterns 1 2 5 4 3 0.4 0.2 0.5 E 0.6 05/04/05 , Travelocity
> 0.4, beginning state is 1, end state is 4 Examples > 0.4, end state is 4 > 0.4, beginning state is 1, end state is 4 SUP S1234 0.45 1234 0.46 S12354 0.53 12354 0.56 S124 124 0.5 S134 0.43 134 S1354 1354 0.58 S2354 S354 05/04/05 , Travelocity
Average Session Length Experimental Result Cluster Cluster No. No. of Sessions Average Session Length No. of States Threshold () Beginning Web page SUPs in BNF Notation Non-Purchase 1 1746 9.6 98 0.3 S S-{C}-E 0.25 P86806 P86806-{C}-E 2 241 6.6 38 0.37 S-{P}-[C]-E 0.34 P86806-[I]-{P}-E 3 13 3.0 6 S-<C | I>-{P}-E 0.2 P86806-[{P}- [P86806]]-E Purchase 1858 14.9 55 0.47 S-[C]-[I]-{P}-E 0.51 132 39.1 100 0.457 S -[{{C}|{I}}]-{P}-E 0.434 P86806-[{C }]-{P}-E 10 31.6 47 0.52 S-{P}-[{I}]-[{P}]-{C}-E 0.43 P86806-[I]-[{P}]-{C}-E 05/04/05 , Travelocity
Future/Ongoing Research Scalability Fewer patterns Smaller patterns MM less space than table Clusters to identify Behaviors Business vs Leisure Cloaked Crawler Online Identification of Cluster 05/04/05 , Travelocity