Understanding Crowds’ Migration on the Web Yong Wang Komal Pal Aleksandar Kuzmanovic Northwestern University
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web MSN CNN (5.8M) (6.1M) (14.3M) (4.3M) (19M) (2.3M) (1.3M) (4.7M) (2M) A User-Driven Web Network Node: #unique visitors to website. Edge: #Common visitors between endpoints. Fig: Target graph 2
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Motivation Study the Web from the point of view of its users –Evaluate properties of network Analyze user movement among websites Determine properties of the user-driven Web network Compare to Online Social Networks and “classical” Web networks –Mine data to serve – Online advertisers Search engines 3
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Our Contributions Generate the user-driven Web network Study the user-driven Web Apply the user-driven Web 4
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Outline Generate the user-driven Web network Study the user-driven Web Apply the user-driven Web 5
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Information Reconstruction Fact –Plethora of information made publicly available on a daily basis E.g., Google Trends, AdPlanner, Analytics, ALEXA, etc. Problem –The publicly available information snippets are not comprehensive Approach –Combine multiple data sources and develop methods to reconstruct globally meaningful information 6
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web 7
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Parent node Child/edge nodes Generating a User-Driven Web 8
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Crawling Breadth First Search for 15 days 3 seeds – nytimes.com, sina.com.cn, timesofindia.com US centric network : ~297K nodes and 2M edges China centric network : ~290K nodes and 2.7M edges India centric network : ~297K nodes and 2.8M edges Captured information: Unique #users – Google AdPlanner Shared users – Google Trends 9
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Problems without Normalization Network without Normalization (Problems!!!) C F B G D C E A D C E A C F B G Fig: Sub-graph A Fig: Sub-graph B Fig: Merged graphs A&B without normalization Weight to the first child is always set to
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Ideal Normalized Network D C E A C F B G Fig: Normalized graph – Target scenario Weights scaled w.r.t weight(AD)
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Normalization Process Parent nodes Relationship between Website 2 and child nodes of Website 1 12
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Normalization Process Phase 1: Select a starting point (a node with max in-degree – say C) –Select parent (A) of C, and child of A (D). –Normalize all other parent nodes to weight of AD (by querying the parent nodes together with A) Normalized nodes: Nodes whose all edges are normalized 13 A B F G C D Normalized node Child of a normalized node
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Phase 2: Back link from a child of a normalized node to its parent –The weight of the forward link must be equal to the weight of the backward link 14 Normalization Process A C BD E Normalized node Child of a normalized node
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Phase 3: A child of a normalized node (D) shares a child (C) with a normalized node (A) –We can normalize D (by querying it together with node A) –Note: the shared child (green) could itself either be a normalized node or a child of a normalized node 15 Normalization Process A B E C D Either normalized node or a child of a normalized node Normalized node Child of a normalized node
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Phase 4: A node (D) shares a child (C) with a normalized node (A) –We can normalize D (by querying it together with node A) –Note: Node D (black) is initially neither a normalized node nor a child of a normalized node 16 Normalization Process A B E C D Neither normalized node nor a child of a normalized node Either normalized node or a child of a normalized node Normalized node Child of a normalized node
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Normalization Process Validation –Popularity ranking of our normalized network compared to Google AdPlanner –The two tanking results match in 91.66% of cases Adding absolute traffic –Google AdPlanner for #unique users Unifying two scale systems –Top 10 children are sufficient –Relative weight -> Absolute weight 17
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Outline Generate the user-driven Web network Study the user-driven Web Apply the user-driven Web 18
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Weighted Degree Distribution –The sum of link weights for each node –Log-normal distribution OSN and WWW follow a power-law distribution –Small-traffic sites filtered by Google Trends –Seed-free properties with distinctions Extreme values 19 Minimum degree nodes Maximum degree nodes High peak => strong connectedness US networkIndia network China network
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Average Path Length and Diameter –User-Driven Web has properties closer to Online Social Networks than to WWW The human component makes the network more connected –Larger average path length for the Chinese network Because high-degree clusters in the core are loosely connected with low-degree clusters at edges For the other 2 networks, high-degree clusters in the core are well connected to the nodes at the edges 20
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web –High clustering coefficients 4 orders of magnitude higher than the corresponding random graphs –Clustering coefficients uniform for the three networks China: –High-degree and low-degree nodes are separately clustered and loosely connected US: –High-degree nodes are clustered in the core while low degree nodes are not well clustered India: –A smaller difference between high- and low-degree node clusters 21 Clustering Coefficient
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web User Driven Web is closer to Online Social Networks than to WWW in all properties –The human component prevails Seed-free properties –Independent from the starting crawling point Scale-free properties –Independent from the network scale 22 Network Properties
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Outline Generate the user-driven Web network Study the user-driven Web Apply the user-driven Web 23
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Online Advertising MSN CNN (5.8M) (6.1M) (14.3M) (4.3M) (19M) (2.3M) (1.3M) (1700) (2M) 24
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Website Selector Problem: Find the best selection of websites (ad hosts) that provide maximum visibility at minimum cost Target users – –Independent advertisers –Ad commissioners Alternative approaches: –Greedy Choose the websites in descending order of their popularity –Sub-optimal Linear optimization without shared user information 25
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Modeling Inputs – –CPI model – random normal distribution –User-driven web –Budget Output – –List of potential ad hosts providing maximum visibility within budget constraints 26
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Optimization Problem Maximize : Σ i u i x i – Σ j Σ k(j≠k) s jk x j x k subject to linear constraint : Σ i c i x i < = B where – x i – website (node) i u i – unique #users on node x i s jk – #shared users between x j and x k c i – CPI for node x i B – budget constraint 27
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Performance Results Greedy approach used as a baseline Sub-optimal approach lacks shared-user information –And hence doesn’t perform well in improving ads visibility Website Selector improves performance by % 28
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Eliminating High-Volume Websites 5% of top 1,000 websites eliminated (volume >= 1M) Several cases of high volume nodes being ignored due to significant number of shared users MSN CNN (2.9M) (11M) (23M) (1.2M) (0.7M) CPI~$42 CPI~$49 CPI~$53 ✗ 29
Yong Wang, Komal Pal Understanding Crowd’s Migration on the Web Conclusions Generated user-driven web –Used publicly available information –Designed methods to fuse pieces into a global network Studied user-driven web and its properties –Scale- and seed-free network properties –User-driven web different from “classical Web” but similar to Online Social Networks Designed website selector –Incorporates idea of “shared visitors” between websites –Increases visibility of ads by 22-25%, increases revenue –Tailored for ad commissioners 30
Thank You