The Mobile Web is Structurally Different Apoorva Jindal USC Chris Crutchfield MIT Samir Goel Google Inc Ravi Jain Google Inc Ravi Kolluri Google Inc
The Mobile Web is Structurally Different The Mobile Web? Web pages designed for consumption on mobile wireless devices CHTML, XHTML, WML All other pages referred to as fixed web Becoming more important Better devices Better networks Cheaper plans Different from fixed web? Smaller pages Fewer hyperlinks Fewer images
Web graph pages ↔ nodes hyperlinks ↔ edges Properties of this graph In-degree distribution Out-degree distribution Strongly connected component size distribution …. Importance Used in basic algorithms to implement search Crawling Ranking the web pages Studied in detail for fixed web
Bow-tie Structure [Broder et al 2000] Model to describe the structure of the fixed web.
Methodology Collapse all pages in a domain to one node Use Tools based on Mapreduce Google’s mobile web index, June 2007 CHTML XHTML + WML Webbase 2001 Google’s fixed web index, July 2007 In-degree & out-degree distributions Tools based on mapreduce Use [Clauset et al 2006] to infer the power law coefficient Determine bow-tie structure properties Use COSIN tools [Donato et al 2004] Limitations Cannot handle Google fixed web 2007 at page level
Mobile web is sparser Page-level Graph properties – Degree Distributions CorpusAvg Node Degree In-degreeOut-degree XHTML+WML CHTML Webbase Coefficient of power-law distribution CHTML lies between XHTML+WML and fixed web Out-degree distribution falls off faster for mobile web
Mobile web Smaller SCC Larger IN and smaller OUT Bigger Disconnected + Tendrils Connectivity: Fixed Web > CHTML > XHTML/WML Page-level Graph properties – Bow-tie structure CorpusSCCINOUTTendrilsDiscon nected XHTML +WML 10.5%18%10.4%18.3%42.7% CHTML22%25.9%14.2%22%15.8% Webba se 33%11%39%13%4%
Language Properties Sub-graph of pages that share a common trait Like keyword, location. Called Thematically Unified Clusters (TUCs). In fixed web, they retain the structural properties of the entire graph. Mobile web? CorpusLanguageFraction of Nodes XHTML Chinese42.6% English22.3% Russian13.4% French3.4% German2.3% CHTMLJapanese92.3% English5.9% CorpusSCCINOUTTendrilsDisconn ected XHTML +WML 10.5%18%10.4%18.3%42,7% Chinese13%22%9%14%42% English2%3%7%25%63% Russian22%40%8%11%19% Don’t study Japanese: Properties same as CHTML
Domain-level Graph Properties Domain-level graph Collapse all nodes for a domain into a single super-node Compare mobile web 2007 and fixed web 2007 Advantages Allows us to understand the differences at a much coarser level Allows us to compare present day fixed and mobile webs CorpusAvg Node Degree SCCINOUTTendrils + Disconn. XHTML +WML %40.7%2.73%15.9% CHTML5.5683%16.4%0.22%0.36% Fixed web %5.62%0.4%0.03% Observations Domain-level graphs are better connected. XHMTL + WML has a much larger Disconnected component CHTML properties lies between XTHML+WML and Fixed web. Structural differences between domain-level fixed web and mobile web same as the differences between page-level fixed web and mobile web.
Application: Impact on Crawling Crawling is resource-intensive. Efficiency is important Higher level of disconnectedness Need a larger and a more diverse seed set Covering the IN component requires special care Depth-first strategy risks spending a disproportionate time in Tendrils and Disconnected components Different languages have different levels of disconnectedness Require a larger seed set for English pages than Russian pages Crawl depth can be reduced for Russian sub-graph Sparseness also can give an advantage Chances of encountering the page again during a crawl is smaller
Conclusions Mobile web graph is structurally different Sparser, more disconnected Smaller SCC and OUT CHTML properties lies between XHTML+WML and Fixed web Surprising preponderance of Chinese pages English sub-graph extremely disconnected
Future Work Only a first step Results motivate the need of a deeper and more extensive analysis Propose alternatives to bow-tie model for mobile web Better understanding of language sub-graphs Quantitatively characterize the impact of differences in structure on different search algorithms