Information Retrieval (9) Prof. Dragomir R. Radev

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Crawling the WEB Representation and Management of Data on the Internet.
CS 345A Data Mining Lecture 1 Introduction to Web Mining.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
A Topic Specific Web Crawler and WIE*: An Automatic Web Information Extraction Technique using HPS Algorithm Dongwon Lee Database Systems Lab.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 Our Web Part 0: Overview COMP630L Topics in DB Systems: Managing Web Data Fall, 2007 Dr Wilfred Ng.
CS 345 Data Mining Lecture 1 Introduction to Web Mining.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Introductions Search Engine Development COMP 475 Spring 2009 Dr. Frank McCown.
(C) 2003, The University of Michigan1 Information Retrieval Handout #8 February 25, 2005.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
CS246 Web Characteristics. Junghoo "John" Cho (UCLA Computer Science)2 Web Characteristics What is the Web like? Any questions on some of the characteristics.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Chapter 7 Web Content Mining Xxxxxx. Introduction Web-content mining techniques are used to discover useful information from content on the web – textual.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Crawlers Padmini Srinivasan Computer Science Department Department of Management Sciences
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Crawling Slides adapted from
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
Autumn Web Information retrieval (Web IR) Handout #0: Introduction Ali Mohammad Zareh Bidoki ECE Department, Yazd University
(C) 2003, The University of Michigan1 Information Retrieval Handout #7 March 24, 2003.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Meet the web: First impressions How big is the web and how do you measure it? How many people use the web? How many use search engines? What is the shape.
Search Tools and Search Engines Searching for Information and common found internet file types.
Web Search Engines AGED Search Engines Search engines (most have directories, too)  Yahoo  AltaVista  Lycos
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
Web Search – Summer Term 2006 VII. Web Search - Indexing: Structure Index (c) Wolfgang Hürst, Albert-Ludwigs-University.
The Internet and World Wide Web Sullivan University Library.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
(Big) data accessing Prof. Wenwen Li School of Geographical Sciences and Urban Planning 5644 Coor Hall
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Dr. Frank McCown Comp 250 – Web Development Harding University
DATA MINING Introductory and Advanced Topics Part III – Web Mining
CS 430: Information Discovery
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS246 Web Characteristics.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anwar Alhenshiri.
Presentation transcript:

Information Retrieval (9) Prof. Dragomir R. Radev

IR Winter 2010 … 14. Webometrics The Bow-tie model …

Brief history of the Web FTP/Gopher WWW (1989) Archie (1990) Mosaic (1993) Webcrawler (1994) Lycos (1994) Yahoo! (1994) Google (1998)

Size The Web is the largest repository of data and it grows exponentially. –320 Million Web pages [Lawrence & Giles 1998] –800 Million Web pages, 15 TB [Lawrence & Giles 1999] –20 Billion Web pages indexed [now] Amount of data –roughly 200 TB [Lyman et al. 2003]

Zipfian properties In-degree Out-degree Visits to a page

Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page

Measuring the size of the web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP addresses

Bharat and Broder 1998 Based on crawls of HotBot, Altavista, Excite, and InfoSeek 10,000 queries in mid and late 1997 Estimate is 200M pages Only 1.4% are indexed by all of them

Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).

What makes Web IR different? Much bigger No fixed document collection Users Non-human users Varied user base Miscellaneous user needs Dynamic content Evolving content Spam Infinite sized – size is whatever can be indexed!

IR Winter 2010 … 15. Crawling the Web Hypertext retrieval & Web-based IR Document closures Focused crawling …

Web crawling The HTTP/HTML protocols Following hyperlinks Some problems: –Link extraction –Link normalization –Robot exclusion –Loops –Spider traps –Server overload

Example U-M’s root robots.txt file: –User-agent: * –Disallow: /~websvcs/projects/ –Disallow: /%7Ewebsvcs/projects/ –Disallow: /~homepage/ –Disallow: /%7Ehomepage/ –Disallow: /~smartgl/ –Disallow: /%7Esmartgl/ –Disallow: /~gateway/ –Disallow: /%7Egateway/

Example crawler E.g., poacher – /examples/poacher –Included in clairlib

&ParseCommandLine(); &Initialise(); $robot->run($siteRoot) #======================================================================= # Initialise() - initialise global variables, contents, tables, etc # This function sets up various global variables such as the version number # for WebAssay, the program name identifier, usage statement, etc. #======================================================================= sub Initialise { $robot = new WWW::Robot( 'NAME' => $BOTNAME, 'VERSION' => $VERSION, ' ' => $ , 'TRAVERSAL' => $TRAVERSAL, 'VERBOSE' => $VERBOSE, ); $robot->addHook('follow-url-test', \&follow_url_test); $robot->addHook('invoke-on-contents', \&process_contents); $robot->addHook('invoke-on-get-error', \&process_get_error); } #======================================================================= # follow_url_test() - tell the robot module whether is should follow link #======================================================================= sub follow_url_test {} #======================================================================= # process_get_error() - hook function invoked whenever a GET fails #======================================================================= sub process_get_error {} #======================================================================= # process_contents() - process the contents of a URL we've retrieved #======================================================================= sub process_contents { run_command($COMMAND, $filename) if defined $COMMAND; }

Focused crawling Topical locality –Pages that are linked are similar in content (and vice- versa: Davison 00, Menczer 02, 04, Radev et al. 04) The radius-1 hypothesis –given that page i is relevant to a query and that page i points to page j, then page j is also likely to be relevant (at least, more so than a random web page) Focused crawling –Keeping a priority queue of the most relevant pages

Challenges in indexing the web Page importance varies a lot Anchor text User modeling Detecting duplicates Dealing with spam (content-based and link-based)

Duplicate detection Shingles TO BE OR BE OR NOT OR NOT TO NOT TO BE The use the Jaccard coefficient (size of intersection/size of union) to determine similarity Hashing Shingling (separate lecture)

Document closures for Q&A capital P LP Madrid spain capital

Document closures for IR Physics P LP Physics Department University of Michigan

The link-content hypothesis Topical locality: page is similar (  ) to the page that points to it (  ). Davison (TF*IDF, 100K pages) –0.31 same domain –0.23 linked pages –0.19 sibling –0.02 random Menczer (373K pages, non-linear least squares fit) Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001  1 =1.8,  2 =0.6,