Download presentation
Presentation is loading. Please wait.
Published byGerald Daniel Modified over 9 years ago
1
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008
2
Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL
3
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
4
Sources for Ranking Data: The Link Graph
5
Sources for Ranking Data: Dynamic Sources Network flow data Web server logs Toolbars and plugins
6
Sources for Ranking Data: Web Server Logs
7
Sources for Ranking Data: Toolbars and Plugins
8
ISP ~100 K users Sources for Ranking Data: Packet Inspection
9
Data Collection HostPathRefererUser-AgentTimestamp HTTP (80) 30% @ peak anonymizer GET requests from IU only FULLh/p/r/a/t HUMANh/p/r/a/t {
11
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
12
Structural properties: Degree
13
Caveat: Sampling Bias
14
Structural properties: Strength (Site Traffic)
15
Structural properties: Weights (Link Traffic)
16
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
17
Behavioral patterns (HUMAN) (Proportion of total out-strength)
18
Ratios are stable Requests (x 10 6 )
19
Ratios are stable
20
In-degree ~ PageRank Page traffic Googlearchy: search engines amplify rich- get-richer bias of the Web Surfing without search engines: popularity reflects rich-get- richer bias of the Web Data: search mitigates rich- get-richer bias of the Web PNAS 2006
21
Does search mitigate the rich-get-richer dynamics?
22
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
23
Validation of PageRank PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph Compare with actual site traffic (in-strength) From an application perspective, we care about the resulting ranking of sites rather than the actual values
24
Kendall’s Rank Correlation
25
PageRank Assumptions 1. Equal probability of teleporting to each of the nodes 2. Equal probability of teleporting from each of the nodes 3. Equal probability of following each link from any given node
26
Kendall’s Rank Correlation
27
Local Link Heterogeneity perfect concentration perfect homogeneity HH Index of concentration or disparity
28
Teleportation Target Heterogeneity
29
Teleportation Source Heterogeneity (“hubness”) s out < s in teleport sources browsing sinks -2 s out > s in popular hubs
30
Navigation vs. Jumps: Sources of Popularity
31
Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns
32
How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)
33
Predict future host graph (clicks) from current one, as a function of delay Generalized temporal precision and recall: Temporal patterns
34
HUMAN host graph (FULL is about 10% more predictable)
35
Summary Heterogeneity: incoming and outgoing site traffic, link traffic Less than half of traffic is from following links Only 5% of traffic is directly from search engines High temporal regularity PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated
36
Next Sampling bias and search bias From host graph to page graph Modeling traffic: Beyond random walk?
37
THANKS! Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.