Download presentation
Presentation is loading. Please wait.
Published byLynette Lawson Modified over 9 years ago
1
Mining the Structure of User Activity using Cluster Stability Jeffrey Heer, Ed H. Chi Palo Alto Research Center, Inc. 2002.04.13 – SIAM Web Analytics Workshop
2
Motivation Want to understanding the composition of web user traffic. –What are users’ information goals? –Leads to improved site design, content, and performance Strategy: Content, Usage, and Topology
3
User Session Clustering Cluster user sessions into common activities such as product browsing and job seeking. A number of approaches have been proposed ([Shahabi97], [Fu99], [Banerjee01], and [Heer01]) These require specifying the number of clusters in advance or browsing a large cluster hierarchy. Can we automatically infer the structure of user activity?
4
Overview System Description –Clustering Method –Stability Analysis Case Studies Discussion
5
System Description Use web access logs and web site content to generate a user profile for each site visitor. –How: Build a multi-featured vector space model of user activity (multi-modal clustering). Group user profiles into common activities like “product browsing” and “job seeking” –How: Apply clustering algorithms to user profiles
6
System Description Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles 1.Process Access Logs 2.Crawl Web Site 3.Build Document Model 4.Extract User Sessions 5.Build User Profiles 6.Cluster Profiles
7
Document Model Web site is crawled, relevant pages listed in web logs are retrieved. Retrieved data is represented as feature vectors: Content: TF.IDF weighted keyword vector URL: Tokenized and TF.IDF weighted Inlinks: Column vectors in topology matrix Outlinks: Row vectors in topology matrix These are concatenated to form a single multi-modal vector P d for each document. Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
8
User Sessions Sessions are extracted from web logs, and represented by an attribute vector –For path i = A B D, s i = »(For site with 5 documents ) Experimented with various weightings for s, including viewing-times and path position. Viewing times achieved highest accuracy in empirical studies. –A 10s B 20s D 15s, s i = Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
9
User Profiles User profiles are created by linearly combining the document and session models: Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
10
Clustering Similarity Metric is a weighted cosine measure Clustering is then done by recursive bisection, using K-Means to perform the bisections [Karypis00, Zhao01]. The corresponding criterion function is: Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
11
User population breakdown Detailed stats Keywords describing user groups Frequent documents accessed by group
12
Clustering Evaluation Ran user study on www.xerox.com to evaluate effectiveness of method [Heer02]. 15 tasks, 5 task categories (104 user traces) Using certain modalities and weighting schemes we were able to achieve accuracies as high as 99%! Found that page content and page viewing time significantly contribute to clustering accuracy.
13
OK, Great, but… In real-world applications the number of clusters is an undetermined variable. Want a method for automatically choosing the number of clusters. After review of literature, decided to apply a cluster stability technique recently proposed by [BenHur02].
14
Measuring Clustering Similarity For a given clustering of a data set X, define C ij = { Two clusterings can then be compared using a dot product: This dot product can be normalized to get a cosine metric: 1 if x i, x j are in the same cluster and i j 0 otherwise
15
Cluster Stability for k = 2 to kmax –for i = 1 to n »S i = Subsample of data set X using sampling ratio f »C i = cluster( S i, k ) –Perform pairwise comparisons of all C i, generating a distribution of similarity values for the current k Analyze the resulting distributions to determine the most stable clusterings.
16
Example Stability Analysis Example using 4 Gaussians [BenHur02] Graph on right shows plot of the cumulative similarity distribution
17
Case Study 1 – www.xerox.com User Study 8/2001; 104 sessions n = 15, f = 0.8, k = 2 to 10
18
Case Study 2 – guir.berkeley.edu Nov. 1-16, 2001; 7700 sessions n = 30, f = 0.8, k = 2 to 15
19
Case Study 2 – guir.berkeley.edu n = 30, f = 0.8, k = 3 to 7
20
Cluster Contents (guir, k=5) Cluster 1: DENIM Web Design Tool Cluster 2: Research projects & publications Cluster 3: Quiz-Bowl Competition Site Cluster 4: CSCW (1 project + 1 course) Cluster 5: Random pubs + project JavaDoc At higher values of k, more concentrated clusters appear –Personal pages (faculty, students) cluster emerges –JavaDoc separates into it’s own cluster
21
Discussion Stability method shows some utility, but results are far from conclusive… perhaps web data is not particularly structured? User Goals –Does the user have a specific goal? Web Site Structure –Does the web site support user goals? Task Structure –Level of generality
22
Possible Cases User has task - Site supports task –www.xerox.com study User has task - Site doesn’t support it User w/o singular goals - Well designed site –Possibly guir.berkeley.edu User w/o task - Poorly designed site
23
The Future… More actionable empirical data –Need more users over a range of sites –Larger user study already begun Alternative approaches –Human supervision –Augmented stability metric / criterion function –Other clustering methods »Fuzzy Clustering
24
Questions? Suggestions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.