Nov, 2002Banerjee and Ghosh1 Characterizing Visitors to a Website Across Multiple Sessions NGDM Workshop, Nov 2002 Arindam Banerjee Joydeep Ghosh
Nov, 2002Banerjee and Ghosh2 Motivation Why Characterize or Predict web user behavior? Site-centric view: Personalization, sticky websites User-centric view: personal agents for information acquisition Universalist approaches: Pagerank, web metrics,…
Nov, 2002Banerjee and Ghosh3 Clustering Users from Web Logs Wide variety of web behavior segment users based on surfing behavior as a first step to further analysis. User: set of sessions Session: sequence of –(page I.d., time spent on that page) tuples –How to cluster sets of sequences?
Nov, 2002Banerjee and Ghosh4 The Approach Cluster Sessions –Session Similarity Measure –Session Similarity Graph Outlier Detection –Graph Partitioning Create a Cluster Space Cluster users in this Space
Nov, 2002Banerjee and Ghosh5 A Similarity Measure for Sessions 1.Overlap between two sessions represented by the longest common subsequence (LCS) 2.Obtain session similarity using LCS and time information session similarity = (time similarity in LCS) x (importance of LCS) The similarity component : –Average min-max similarity for each page in the LCS The importance component : –Average of the fraction of overall session time spent in the LCS
Nov, 2002Banerjee and Ghosh6 Session Clustering Find the pairwise similarity values between all pair of sessions; record only similarities > Incrementally construct similarity graph G –the vertices are the sessions, the edge weights are the session similarity values –no isolated vertices (discard “outliers”) Balanced Graph Partitioning –we used Metis [Karypis, Kumar]
Nov, 2002Banerjee and Ghosh7 The Cluster Space Given: each session assigned to one of k clusters (sets) Sessions of a user are distributed among the k sets –vector u = [u 1 u 2 … u k ] T where u i = number of sessions of the user belonging to cluster I Stage II : User Clustering –find pairwise similarity values using the extended Jaccard measure –partition similarity graph Gives l user clusters and a set of outlier users
Nov, 2002Banerjee and Ghosh8 The Dataset : Sulekha.com
Nov, 2002Banerjee and Ghosh9 Dataset details Logs over a one month period Raw log size 184 Mb 453,953 files accessed 37,753 sessions in all 23,310 sessions after some preprocessing/filtering 2,493 users
Nov, 2002Banerjee and Ghosh10 Results : Session Clusters Cluster 1 – interest in coffeehouse, contests Cluster 2 – glance through home, articles -(/,12)(/movies,6)(/contests,178) -(/contests,142) -(/coffeehouse,5)(/contests,183) -(/contests,172) -(/,10)(/contests,143) -(/,22)(/articles,22) -(/,20)(/articles,20) -(/,21)(/articles,21) -(/,19)(/articles,19) -(/,20)(/articles,19) Cluster 3 – interest in author, articles Cluster 4 – read articles -(/,148)(/authors,6)(/articles,77) -(/authors,290)(/articles,290) -(/authors,295)(/articles,295) -(/,33)(/authors,90)(/articles,475) -(/,32)(/authors,91)(/articles,425) -(/,39)(/articles,98)(/misc,17) (/articles,2649) -(/,9)(/articles,2666) -(/authors,26)(/articles,2561) -(/misc,20)(/articles,77)(/misc 32)(/articles,43)(/authors,16) (/articles,2373.1)
Nov, 2002Banerjee and Ghosh11 Results : User Clusters user : [( xxx.xxx)] –(/authors,3)(/articles,129) –(/authors,8)(/articles,8) –(/authors,80)(/articles,2141) user : [( xxx.xxx)] –(/home,77)(/articles,111)(/authors,93)(/articles,629)(/m isc,58) (/coffeehouse,75)(/wo-men,967) –(/articles,2627) user : [( xxx.xxx)] –(/home,323)(/articles,24)(/authors,45)(/articles,1290) A user cluster : people who read the articles
Nov, 2002Banerjee and Ghosh12 Results : User Clusters user : [( xxx.xxx)] –(/home,21)(/wo-men,1075)(/philosophy,52) user : [( xxx.xxx)] –(/home,5)(/coffeehouse,94)(/wo-men,75)(/movies,75)(/wo- men,31) –(/home,52)(/philosophy,67)(/wo-men,955)(/philosophy, 26)(/coffeehouse,382)(/biztech,298)(/philosophy,290) –(/home,17)(/coffeehouse,12)(/wo-men,15)(/personal,6) (/biztech,94)(/coffeehouse,2)(/philosophy,1093) A user cluster : people interested in wo-men, philosophy, coffeehouse
Nov, 2002Banerjee and Ghosh13 Results : User Clusters user : [( xxx.xxx)] –(/coffeehouse,12)(/biztech,25)(/books,48) –(/coffeehouse,13)(/biztech,26)(/books,19) user : [( xxx.xxx)] –(/coffeehouse,162) –(/coffeehouse,40) user : [( xxx.xxx)] –(/coffeehouse,12)(/contests 12) –(/coffeehouse,43)(/contests 44) A user cluster : people interested in coffeehouse – bookmarked it !
Nov, 2002Banerjee and Ghosh14 Result Visualization using CLUSION [Strehl &Ghosh 01] Sessions Users
Nov, 2002Banerjee and Ghosh15 Conclusions Segmentation: a basic pre-processing step for Web Mining Similarity measure + Cluster Space Concept: applicable to clustering of sets of any data-structure For certain websites, time spent on the pages matters –not handled by current commercial tools Outlier detection before clustering is important Results QA-ed by human subjects –Results for clusters & outliers at both levels were subjectively good No good way to find cluster quality analytically Formation of similarity graph is a slow process
Nov, 2002Banerjee and Ghosh16 Future Work Improve the present method by: –using cluster seeds for cluster growing –using alternative clustering algorithms for each stage –studying the effect of thresholds, number of clusters on performance –studying the importance of order of page-visits –studying the importance of balanced clustering
Nov, 2002Banerjee and Ghosh17 Backup
Nov, 2002Banerjee and Ghosh18 Issues : Choice of Parameters Number of session clusters, k, should be chosen appropriately Thresholds for forming session & user similarity graphs : –threshold value should be chosen after looking at the distribution of edge weights
Nov, 2002Banerjee and Ghosh19 Related Work Research in Web Mining : –Extraction of navigational patterns : Spiliopoulou, Faulstich –Ordering relationships : Mannila, Meek –Surfing prediction : Pitkow, Pirolli –Clustering web usage sessions : Fu, Sandhu, Shih
Nov, 2002Banerjee and Ghosh20 Example Sessions : –Session 1 = [(a,8) (b,100) (d,8) (c,5) (e,23) (a,5)] –Session 2 = [(b,5) (d,12) (f,1) (a,7) (c,5)] LCS pages = [(b)(d)(c)] Corresponding Index, Times Sequences : –Index 1 = [(1)(2)(3)], Time 1 = [(100) (8) (5)] –Index 2 = [(0)(1)(4)], Time2 = [ (5) (12) (5)] Similarity over each LCS page : of the two times –Similarity on page b = 5/100 = 0.05 –Similarity on page d = 8/12 = 0.67 –Similarity on page c = 5/5 = 1.00
Nov, 2002Banerjee and Ghosh21 Example (contd.) The similarity component = ( )/3 = 0.57 The importance component : –Fraction of time spent in the LCS by Session 1 = 113/149 = 0.76 –Fraction of time spent in the LCS by Session 2 = 22/30 = 0.73 –The mean = ( )/2 = 0.75 The overall similarity = 0.57 x 0.75 = 0.43
Nov, 2002Banerjee and Ghosh22 Issues : Session Resolution Generate coarse resolution paths making use of the concept hierarchy of the website Reduces computations; Increases interpretability of results Original PathConcept-level Path (/authors/ramesh_mahadevan.html,3) (/articles/rm_phattas.html,75) (/articles/rm_desidads.html,39) (/authors,3) (/articles,114) (/authors/arun_sampath.html,109) (/philosophy/messages/1951.html,102) (/philosophy/messages/1953.html,46) (/,3) (/philosophy/messages/1954.html,69) (/authors,109) (/philosophy,148) (/,3) (/philosophy,69)
Nov, 2002Banerjee and Ghosh23 Comments Results QA-ed by human subject –Results for clusters & outliers at both levels were subjectively good –No good way to find cluster quality analytically Clustering algorithms for the two stages –Stage I : Graph partitioning works well for large sparse graphs, so it is desirable in this stage –Stage II : Since the space is not high-dimensional, any reasonable clustering algorithm should be adequate Cluster space –Gives a general framework for mapping any non-vector clustering problem to an equivalent vector clustering problem