Presentation is loading. Please wait.

Presentation is loading. Please wait.

Information Retrieval (9) Prof. Dragomir R. Radev

Similar presentations


Presentation on theme: "Information Retrieval (9) Prof. Dragomir R. Radev"— Presentation transcript:

1 Information Retrieval (9) Prof. Dragomir R. Radev radev@umich.edu

2 IR Winter 2010 … 14. Webometrics The Bow-tie model …

3 Brief history of the Web FTP/Gopher WWW (1989) Archie (1990) Mosaic (1993) Webcrawler (1994) Lycos (1994) Yahoo! (1994) Google (1998)

4 Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –20 Billion Web pages indexed [now] Amount of data –roughly 200 TB [Lyman et al. 2003]

5 Zipfian properties In-degree Out-degree Visits to a page

6 Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

7 Measuring the size of the web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP addresses

8 Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them

9 Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

10 What makes Web IR different? Much bigger No fixed document collection Users Non-human users Varied user base Miscellaneous user needs Dynamic content Evolving content Spam Infinite sized – size is whatever can be indexed!

11 IR Winter 2010 … 15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …

12 Web crawling The HTTP/HTML protocols Following hyperlinks Some problems: –Link extraction –Link normalization –Robot exclusion –Loops –Spider traps –Server overload

13 Example U-M’s root robots.txt file: http://www.umich.edu/robots.txt –User-agent: * –Disallow: /~websvcs/projects/ –Disallow: /%7Ewebsvcs/projects/ –Disallow: /~homepage/ –Disallow: /%7Ehomepage/ –Disallow: /~smartgl/ –Disallow: /%7Esmartgl/ –Disallow: /~gateway/ –Disallow: /%7Egateway/

14 Example crawler E.g., poacher –http://search.cpan.org/~neilb/Robot- 0.011/examples/poacher –Included in clairlib

15 &ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, 'EMAIL' => $EMAIL, 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }

16 Focused crawling Topical locality –Pages that are linked are similar in content (and vice- versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis –given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling –Keeping a priority queue of the most relevant pages

17 Challenges in indexing the web Page importance varies a lot Anchor text User modeling Detecting duplicates Dealing with spam (content-based and link-based)

18 Duplicate detection Shingles TO BE OR BE OR NOT OR NOT TO NOT TO BE The use the Jaccard coefficient (size of intersection/size of union) to determine similarity Hashing Shingling (separate lecture)

19 Document closures for Q&A capital P LP Madrid spain capital

20 Document closures for IR Physics P LP Physics Department University of Michigan

21 The link-content hypothesis Topical locality: page is similar (  ) to the page that points to it (  ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001  1 =1.8,  2 =0.6,


Download ppt "Information Retrieval (9) Prof. Dragomir R. Radev"

Similar presentations


Ads by Google