Information Retrieval (9) Prof. Dragomir R. Radev
IR Winter 2010 … 14. Webometrics The Bow-tie model …
Brief history of the Web FTP/Gopher WWW (1989) Archie (1990) Mosaic (1993) Webcrawler (1994) Lycos (1994) Yahoo! (1994) Google (1998)
Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –20 Billion Web pages indexed [now] Amount of data –roughly 200 TB [Lyman et al. 2003]
Zipfian properties In-degree Out-degree Visits to a page
Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page
Measuring the size of the web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP addresses
Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them
Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
What makes Web IR different? Much bigger No fixed document collection Users Non-human users Varied user base Miscellaneous user needs Dynamic content Evolving content Spam Infinite sized – size is whatever can be indexed!
IR Winter 2010 … 15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …
Web crawling The HTTP/HTML protocols Following hyperlinks Some problems: –Link extraction –Link normalization –Robot exclusion –Loops –Spider traps –Server overload
Example U-M’s root robots.txt file: –User-agent: * –Disallow: /~websvcs/projects/ –Disallow: /%7Ewebsvcs/projects/ –Disallow: /~homepage/ –Disallow: /%7Ehomepage/ –Disallow: /~smartgl/ –Disallow: /%7Esmartgl/ –Disallow: /~gateway/ –Disallow: /%7Egateway/
Example crawler E.g., poacher – /examples/poacher –Included in clairlib
&ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, ' ' => $ , 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }
Focused crawling Topical locality –Pages that are linked are similar in content (and vice- versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis –given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling –Keeping a priority queue of the most relevant pages
Challenges in indexing the web Page importance varies a lot Anchor text User modeling Detecting duplicates Dealing with spam (content-based and link-based)
Duplicate detection Shingles TO BE OR BE OR NOT OR NOT TO NOT TO BE The use the Jaccard coefficient (size of intersection/size of union) to determine similarity Hashing Shingling (separate lecture)
Document closures for Q&A capital P LP Madrid spain capital
Document closures for IR Physics P LP Physics Department University of Michigan
The link-content hypothesis Topical locality: page is similar ( ) to the page that points to it ( ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 1 =1.8, 2 =0.6,