CHI Web Behavior Patterns1 Separating the Swarm Categorization Methods for User Sessions on the Web Jeffrey Heer, Ed H. Chi Palo Alto Research Center – CHI Web Behavior Patterns
CHI Web Behavior Patterns 2 Web Analytics: What can you measure? - content - page traffic Marketing Infrastructure - load testing - user intent - usability - user experience Site Design Want to improve site design, content, and performance
CHI Web Behavior Patterns 3 The Change in Web Sites: What should you measure? Page-based websites Activity-based websites Time Site Complexity Products Management Team I’d like information on used cars. Search for a car dealer in my neighborhood. TRAFFIC USER EXPERIENCE
CHI Web Behavior Patterns 4 Motivation What are users’ information goals? Understanding the composition of web user traffic. Strategy: Use all available data to discover user goals. (Content, Usage, Topology) System Description Evaluation Implications Conclusion
CHI Web Behavior Patterns 5 System Description Generate a user profile for each user session. –How: Use access logs and site content to to build a multi-featured model of user activity (multi-modal clustering). Group user profiles into common activities like “product browsing” and “job seeking” –How: Apply clustering algorithms to user profiles
CHI Web Behavior Patterns 6 System Description Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles Steps: 1.Process Access Logs 2.Crawl Web Site 3.Build Document Model 4.Extract User Sessions 5.Build User Profiles 6.Cluster Profiles
CHI Web Behavior Patterns 7 Document Model Site is crawled –Pay special attention to pages in logs. Documents described by feature vectors: Content: TF.IDF weighted keyword vector URL: Tokenized and TF.IDF weighted Inlinks: Column vectors in topology matrix Outlinks: Row vectors in topology matrix Vectors are concatenated to form a single multi-modal vector P d for each document. Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
CHI Web Behavior Patterns 8 User Sessions Sessions extracted and represented by a vector s: –For path i = A B D, s i = (For site with 5 documents ) Different weightings can be employed in creating the session vector s: Frequency: number of times each page is accessed. A B D, s = TF.IDF: hits / # paths including page Position: Use order of pages within surfing path. A B D, s = View Time: Use time spent viewing pages. A 10s B 20s D 15s, s = Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
CHI Web Behavior Patterns 9 User Profiles User profiles are linear combination of the viewed pages. –“You are what you see.” User Profiles Session weights Document Vectors Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles
CHI Web Behavior Patterns 10 Clustering Clustering is a form of statistical analysis which organizes data into individual clusters. –Groupings are determined by a shared similarity. –Similarity is defined by a computable similarity metric. Clustering proceeds by recursive bisection, using K-Means to perform the bisections [Zhao01]. Web CrawlAccess Logs Document Model User Sessions User Profiles Clustered Profiles weights w m specify the contribution of each modality
CHI Web Behavior Patterns 11 User population breakdown Detailed stats Keywords describing user groups Frequent documents accessed by group
CHI Web Behavior Patterns 12 Clustering Results Users reached end of tutorial, had nowhere to go.
CHI Web Behavior Patterns 13 System Evaluation Does the system correctly infer user intentions? Logs System User Intent Groupings User Intent Compare
CHI Web Behavior Patterns 14 User Study Asked users to surf specific tasks on –captured actions using the WebQuilt proxy logger [Hong01] –done at their leisure. 15 unique tasks: –Tasks developed after exploring xerox.com and reading user feedback –5 task groups with 3 tasks per group. –Products, TechSupport, Supplies, Company Info, and Jobs Participation: –21 users signed up, 18 went through, 104 usable sessions.
CHI Web Behavior Patterns 15 Results: Results: 340 combinations of clustering schemes Outlink-based schemes performed poorly (omitted).
CHI Web Behavior Patterns 16 Analysis: Modalities Linear Contrast shows Content sig. different: (unimodal) F(1,105)=32.51, MSE= , p< (multimodal) F(1,35)=33.36, MSE= , p< Content is King! Mean=0.96, StdDev=0.07
CHI Web Behavior Patterns 17 Analysis: Path Weighting Paired t-Test between Time-based and non-Time based weightings: n=60, t(59)=4.85, p=4.68e-6 V.T.mean=89.5%, s.d.=12.7%, non-V.T.mean=83.2%, s.d.=12.0% View Time is best!
CHI Web Behavior Patterns 18 Observation: Multi-Modal vs. Unimodal In practice, Multi-Modal should be more robust –Some pages don’t have much content »Images, Audio, Video »PDF, PS (if you don’t have necessary software) –URL Tokens: All pages have URLs. –Inlinks: don’t depend on any features of a page! In our experience, Content-based Multi-Modal Clustering retains accuracy. Linear Contrast shows no significant difference between multi- modal and uni-modal schemes: F(1,77)=1.63, MSE= , p=.21
CHI Web Behavior Patterns 19 Findings Incorporating View Time improves clustering accuracy. Though it involves extra work, extracting Content can provide very high accuracy. Adding other modalities make clustering more robust. Modalities should be chosen carefully, and tailored for each specific site.
CHI Web Behavior Patterns 20 Implications for Designers Good design means understanding your users. It’s possible to understand trends of user activities accurately. –Requires well-defined user tasks doable on the site. Now you can design and tailor user experience. –Address discovered usability issues. –Update design to facilitate common tasks.
CHI Web Behavior Patterns 21 Summary: “You are what you see.” User Information Goals Web site Page Content Topology InfoScent Clustering Observed Usage Users follow the best Information Scent to accomplish their goals.
CHI Web Behavior Patterns 22 Future Work Determining # of clusters –Currently done semi-manually Model unstructured task more directly Directly recommend design changes Integrate with –Clustering Visualization –User Path Visualization Lots of Commercial Interest, Licensing
CHI Web Behavior Patterns 23 Conclusion Performed first known user study to characterize the analytic space of session clustering techniques. Found that session clustering can be highly accurate with respect to user intentions. Demonstrated our method is scalable and useful in real-world scenarios. This should prove to be a useful tool for web designers and researchers!
CHI Web Behavior Patterns 24 Acknowledgements Peter Pirolli, Stu Card, Adam Rosien, Pam Schraedley and the the UIR and Bloodhound Team at PARC. George Karypis for CLUTO software Participants in our user study Office of Naval Research Contact: Jeff Heer Ed H. Chi Separating the Swarm Categorization Methods for User Sessions on the Web