Characterizing the Web CSCI 572: Information Retrieval and Search Engines Summer 2011
May-18-11CS572-Summer2011CAM-2 Outline The web –Scale –Complexity –Growth Differences between then and now Where the web is headed
May-18-11CS572-Summer2011CAM-3 The Web Massive scale directed graph Driven by the underlying REST architecture –The key abstraction of information is a resource, named by an URL. –The representation of a resource is a sequence of bytes, plus representation metadata to describe those bytes. –All interactions are context-free: each interaction contains all of the information necessary to understand the request. –Components perform only a small set of well-defined methods on a resource producing a representation to capture the current or intended state of that resource and transfer that representation between components. –Representation metadata are encouraged in support of caching and representation reuse. –The presence of intermediaries is promoted. Copyright © Richard N. Taylor, Nenad Medvidovic, and Eric M. Dashofy. All rights reserved.
May-18-11CS572-Summer2011CAM-4 Scale GYBA = Sorted on Google, Yahoo!, Bing and Ask YGBA = Sorted on Yahoo!, Google, Bing and Ask
May-18-11CS572-Summer2011CAM-5 How is the scale measured? # of indexed web pages by search engines? –Is this an accurate representation? Published data from major ISPs? –Is this accurate information? What’s missing? –The “deep” web, or dynamic pages –Pages behind security firewalls
May-18-11CS572-Summer2011CAM-6 Why is scale important? Has many influential drivers on the ultimate use cases of the web –Discovery and retrieval of information via: Search Engines Web Services and Grid Computing Targeted communities like Social Networking and the growing field of Analytics Has many influential drivers on the way we build software for web-scale systems –New programming paradigms, e.g., Map Reduce –New technologies to handle huge scale computing, or “Big Data”
May-18-11CS572-Summer2011CAM-7 Complexity
May-18-11CS572-Summer2011CAM-8 Proliferation of content types available By some accounts, 16K to 51K content types* What to do with content types? –Parse them How? Extract their text and structure –Index their metadata In an indexing technology like Lucene, Solr, or Compass, or in Google Appliance –Identify what language they belong to Ngrams *
May-18-11CS572-Summer2011CAM-9 Growth Steady growth, on logarithmic scale since mid 90’s Well into the 100s of M of website and 10s of B of web page scale (even without the deep web)
May-18-11CS572-Summer2011CAM-10 What does growth mean to us (you)? Need for efficient algorithms for all sorts of things –Mining the web for information on you to target ads –Mining the web for information on you to decide whether to hire you or not –Disseminating news effectively (to you) –Disseminating media effectively (to you) –Providing rich browser experiences to lure you to web sites so that you can be sold products NOTE: I underlined you everywhere above for those that missed it, we’ll get back to this
May-18-11CS572-Summer2011CAM-11 The Web: Then and Now Before –The purpose of the web was for geeks to exchange , post on bulletin boards regarding their favorite D&D games, to send files to one another –Scope was limited to geeks, broad infection was many years away –Search* since 1996: Hotbot, Excite, WebCrawler, AskJeeves, Yahoo!, Google, DogPile, Altavista, Lycos, MSN Search, AOL Search, Infoseek, Netscape, Metacrawler, AllTheWeb *
May-18-11CS572-Summer2011CAM-12 The Web: Then and Now Now –The purpose is limitless Computation with services, semantic description of content, proliferation of content, rich browsers, clients, interaction, media Social web is next big thing –Scope is (I kid you not, a 2 year old on up) –Search* now: Google, with competitors like Yahoo and Bing pulling up the rear, and trying to build out open source computational infrastructures to compete *
May-18-11CS572-Summer2011CAM-13 The movement towards the social web Social Networking companies have figured out that mining info about you guys can help build the “semantic” information that was once dreamed about by the likes of Tim Berners-Lee in his Scientific American article in the late 90’s, early 2000’s Why did semantic web fail to gain acceptance but social web has succeeded? –The realization that machines are poor annotators of information and that they are even worse trust establishers –And that you guys are the experts at this!
May-18-11CS572-Summer2011CAM-14 Social Web and “Big Data” Many challenges induced by the complexity, scale, and growth of the traditional web are only increased when the social web is taken into account The development of algorithms to crawl the social graph have led to several Ph.D.s and are huge money makers for existing businesses –Analytics is what they call this nowadays Search is a HUGE challenge and interesting research problem within the social web –Instead of using information retrieval to deduce a “rank” for a page, use the trust value assigned via your social graph
May-18-11CS572-Summer2011CAM-15 Wrapup Web has changed dramatically in the last 10 years Understand the different dimensions of the web and the variation points –Scale, complexity and growth are only a selected few Understand where the web is going and why