Web Analytics Xuejiao Liu INF 385F: WIRED Fall 2004
Outline Introduction What is Web Analytics Why Web Analytics matter Secondary readings Log files analysis Web usage mining Data preparation KDD process Document access in repositories
Log File Lowdown (Michael Calore, 2001 ) Log file What are in log file Traffic Audience Browsers/Platforms Errors Referers
Log File Lowdown Sample Log File adsl ilm.bellsouth.net - - [09/May/2001:13:42: ] "GET /about.htm HTTP/1.1" “ "Mozilla/4.0 (compatible; MSIE 5.0; Windows 98)" Log File Analyzers WebTrends, Sawmill, Analog, Webalizer, HTTP-analyze
WebTrends log file analyzer Advantages Fast and effective User-friendly interface Feature-rich Support different operating systems Disadvantages Not free
WebTrends
The KDD Process for Extracting Useful Knowledge from Volumes of Data (Fayyad, U., G. Piatetsky-Shapiro, et al. 1996) KDD: Knowledge Discovery in Databases The value of data Definitions KDD Data mining
The KDD Process The KDD process 1.Creating a target dataset 2.Preprocessing and data cleaning 3.Data reduction and projection 4.Data mining Choosing the data mining function Choosing the data mining algorithm 5.Interpretation and evaluation
The KDD Process Data Mining Data mining involves fitting models to or determining patterns from observed data Data mining algorithms The model The preference criterion The search algorithm
The KDD Process Data Mining Model functions Classification Regression Clustering Dependency modeling Link anlysis Goals of Data Mining Predictive and descriptive
Data Preparation for Mining World Wide Web Browsing Patterns (Cooley, R. W., B. Mobasher, et al. 1999) Web Usage Mining vs. data mining The WEBMINER process Preprocessing Mining algorithms Pattern Analysis
Data Preparation Preprocessing Data cleaning User identification Session identification Path completion Formatting
Data Preparation
Tracking the Growth of a Site ( Nielsen, Jakob, 1998 ) Exponential growth of the web and the internet Statistical method Logarithmic convert to get linear regression Statistical analysis Hypothesis: the site is growing (number of pageviews and date are correlated) R 2 and significance
Tracking the Growth of a Site R 2 = 0.96, p<0.001
Tracking the Growth of a Site Predict growth rate Clean noise Confident interval
Predicting Document Access in Large, Multimedia Repositories (by Recker, M. R. and J. E. Pitkow, 1996 ) patterns of document requests in network- accessible multimedia databases Main idea Two related domains: Human memory and libraries Borrow models and research results from them
Predicting Document Access The model – human memory (Anderson and Schooler) The relationship of recency and performance is a power function The relationship of frequency and performance is a power function Tow parameters for performance Need probability p and Need odds p/(1-p) The linear function: Log(Need odds) = a Log(Frequency) + b
Predicting Document Access Apply Human Memory Analysis in Document Requests Model Dataset: log file of Georgia Tech WWW repository A dynamic information ecology Frequency analysis Regression equation: Log(Need Odds) =.99 Log (Frequency) – 1.30 Recency analysis Regression equation: Log(Need Odds) = Log(days) +.41 Combining recency and frequency
Predicting Document Access Conclusion Recency and frequency of past document access are strong predictors of future document access Recency probed to be a stronger predictor than frequency Applications for the design of information systems Determine optimal ordering of retrieved items Inform design decisions Design of caching algorithms