Download presentation
Presentation is loading. Please wait.
Published byBritney Holt Modified over 9 years ago
1
Information Retrieval (9) Prof. Dragomir R. Radev radev@umich.edu
2
IR Winter 2010 … 14. Webometrics The Bow-tie model …
3
Brief history of the Web FTP/Gopher WWW (1989) Archie (1990) Mosaic (1993) Webcrawler (1994) Lycos (1994) Yahoo! (1994) Google (1998)
4
Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –20 Billion Web pages indexed [now] Amount of data –roughly 200 TB [Lyman et al. 2003]
5
Zipfian properties In-degree Out-degree Visits to a page
6
Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page
7
Measuring the size of the web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP addresses
8
Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them
9
Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
10
What makes Web IR different? Much bigger No fixed document collection Users Non-human users Varied user base Miscellaneous user needs Dynamic content Evolving content Spam Infinite sized – size is whatever can be indexed!
11
IR Winter 2010 … 15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …
12
Web crawling The HTTP/HTML protocols Following hyperlinks Some problems: –Link extraction –Link normalization –Robot exclusion –Loops –Spider traps –Server overload
13
Example U-M’s root robots.txt file: http://www.umich.edu/robots.txt –User-agent: * –Disallow: /~websvcs/projects/ –Disallow: /%7Ewebsvcs/projects/ –Disallow: /~homepage/ –Disallow: /%7Ehomepage/ –Disallow: /~smartgl/ –Disallow: /%7Esmartgl/ –Disallow: /~gateway/ –Disallow: /%7Egateway/
14
Example crawler E.g., poacher –http://search.cpan.org/~neilb/Robot- 0.011/examples/poacher –Included in clairlib
15
&ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }
16
Focused crawling Topical locality –Pages that are linked are similar in content (and vice- versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis –given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling –Keeping a priority queue of the most relevant pages
17
Challenges in indexing the web Page importance varies a lot Anchor text User modeling Detecting duplicates Dealing with spam (content-based and link-based)
18
Duplicate detection Shingles TO BE OR BE OR NOT OR NOT TO NOT TO BE The use the Jaccard coefficient (size of intersection/size of union) to determine similarity Hashing Shingling (separate lecture)
19
Document closures for Q&A capital P LP Madrid spain capital
20
Document closures for IR Physics P LP Physics Department University of Michigan
21
The link-content hypothesis Topical locality: page is similar ( ) to the page that points to it ( ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 1 =1.8, 2 =0.6,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.