Presentation is loading. Please wait.

Presentation is loading. Please wait.

Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining.

Similar presentations


Presentation on theme: "Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining."— Presentation transcript:

1 Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining Stanford, California February 11, 2008

2 Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL

3 Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

4 Sources for Ranking Data: The Link Graph

5 Sources for Ranking Data: Dynamic Sources Network flow data Web server logs Toolbars and plugins

6 Sources for Ranking Data: Web Server Logs

7 Sources for Ranking Data: Toolbars and Plugins

8 ISP ~100 K users Sources for Ranking Data: Packet Inspection

9 Data Collection HostPathRefererUser-AgentTimestamp HTTP (80) 30% @ peak anonymizer GET requests from IU only FULLh/p/r/a/t HUMANh/p/r/a/t {

10

11 Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

12 Structural properties: Degree

13 Caveat: Sampling Bias

14 Structural properties: Strength (Site Traffic)

15 Structural properties: Weights (Link Traffic)

16 Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

17 Behavioral patterns (HUMAN) (Proportion of total out-strength)

18 Ratios are stable Requests (x 10 6 )

19 Ratios are stable

20 In-degree ~ PageRank Page traffic Googlearchy: search engines amplify rich- get-richer bias of the Web Surfing without search engines: popularity reflects rich-get- richer bias of the Web Data: search mitigates rich- get-richer bias of the Web PNAS 2006

21 Does search mitigate the rich-get-richer dynamics?

22 Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

23 Validation of PageRank PR is a stationary distribution of visit frequency by a modified random walk (with jumps) on the Web graph Compare with actual site traffic (in-strength) From an application perspective, we care about the resulting ranking of sites rather than the actual values

24 Kendall’s Rank Correlation

25 PageRank Assumptions 1. Equal probability of teleporting to each of the nodes 2. Equal probability of teleporting from each of the nodes 3. Equal probability of following each link from any given node

26 Kendall’s Rank Correlation

27 Local Link Heterogeneity perfect concentration perfect homogeneity HH Index of concentration or disparity

28 Teleportation Target Heterogeneity

29 Teleportation Source Heterogeneity (“hubness”) s out < s in teleport sources browsing sinks -2 s out > s in popular hubs

30 Navigation vs. Jumps: Sources of Popularity

31 Outline Data collection Structural properties Behavioral patterns PageRank validation Temporal patterns

32 How predictable are traffic patterns? -- Cache refreshing (e.g. proxies) -- Capacity allocation (e.g. peering and provisioning for spikes) -- Site design (e.g. expose content based on time of day)

33 Predict future host graph (clicks) from current one, as a function of delay Generalized temporal precision and recall: Temporal patterns

34 HUMAN host graph (FULL is about 10% more predictable)

35 Summary Heterogeneity: incoming and outgoing site traffic, link traffic Less than half of traffic is from following links Only 5% of traffic is directly from search engines High temporal regularity PageRank is a poor predictor of traffic: random walk and random teleportation assumptions violated

36 Next Sampling bias and search bias From host graph to page graph Modeling traffic: Beyond random walk?

37 THANKS! Mark Meiss Filippo Menczer Santo Fortunato Alessandro Vespignani Alessandro Flammini CNLL?


Download ppt "Ranking Web Sites with Real User Traffic Mark Meiss Filippo Menczer Santo Fortunato Alessandro Flammini Alessandro Vespignani Web Search and Data Mining."

Similar presentations


Ads by Google