Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
Sources for Ranking Data: The Link Graph
Sources for Ranking Data: Dynamic Sources Network flow data Web server logs Toolbars and plugins
Sources for Ranking Data: Web Server Logs
Sources for Ranking Data: Toolbars and Plugins
ISP ~100 K users Sources for Ranking Data: Packet Inspection
Data Collection HostPathRefererUser-AgentTimestamp HTTP (80) peak anonymizer GET requests from IU only FULLh/p/r/a/t HUMANh/p/r/a/t {
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
Structural properties: Degree
Caveat: Sampling Bias
Structural properties: Strength (Site Traffic)
Structural properties: Weights (Link Traffic)
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
Behavioral patterns (HUMAN) (Proportion of total out-strength)
Ratios are stable Requests (x 10 6 )
Ratios are stable
In-degree ~ PageRank Page traffic Googlearchy: search engines amplify rich- get-richer bias of the Web Surfing without search engines: popularity reflects rich-get- richer bias of the Web Data: search mitigates rich- get-richer bias of the Web PNAS 2006
Does search mitigate the rich-get-richer dynamics?
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
Validation of PageRank PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph Compare with actual site traffic (in-strength) From an application perspective, we care about the resulting ranking of sites rather than the actual values
Kendall’s Rank Correlation
PageRank Assumptions 1. Equal probability of teleporting to each of the nodes 2. Equal probability of teleporting from each of the nodes 3. Equal probability of following each link from any given node
Kendall’s Rank Correlation
Local Link Heterogeneity perfect concentration perfect homogeneity HH Index of concentration or disparity
Teleportation Target Heterogeneity
Teleportation Source Heterogeneity (“hubness”) s out < s in teleport sources browsing sinks -2 s out > s in popular hubs
Navigation vs. Jumps: Sources of Popularity
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)
Predict future host graph (clicks) from current one, as a function of delay Generalized temporal precision and recall: Temporal patterns
HUMAN host graph (FULL is about 10% more predictable)
Summary Heterogeneity: incoming and outgoing site traffic, link traffic Less than half of traffic is from following links Only 5% of traffic is directly from search engines High temporal regularity PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated
Next Sampling bias and search bias From host graph to page graph Modeling traffic: Beyond random walk?
THANKS! Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL?