Crawlers: Nutch CSE 454 11/12/2018 5:08 AM.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Web Categorization Crawler – Part I Mohammed Agabaria Adam Shobash Supervisor: Victor Kulikov Winter 2009/10 Final Presentation Sep Web Categorization.
Introduction to Web Crawling and Regular Expression CSC4170 Web Intelligence and Social Computing Tutorial 1 Tutor: Tom Chao Zhou
CpSc 881: Information Retrieval. 2 How hard can crawling be?  Web search engines must crawl their documents.  Getting the content of the documents is.
Crawling the WEB Representation and Management of Data on the Internet.
ECOMMERCE TECHNOLOGY SUMMER 2002 COPYRIGHT © 2002 MICHAEL I. SHAMOS Lecture 5: Search Engines.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 20: Crawling 1.
Searching the Web. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
March 26, 2003CS502 Web Information Systems1 Web Crawling and Automatic Discovery Donna Bergmark Cornell Information Systems
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Web Search 1.
A web browser A web browser is a software application for retrieving, presenting, and traversing information resources on the World Wide Web. An information.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
Web Crawling David Kauchak cs160 Fall 2009 adapted from:
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
Web Crawlers.
Crawlers - March (Web) Crawlers Domain Presented by: Or Shoham Amit Yaniv Guy Kroupp Saar Kohanovitch.
Crawlers and Spiders The Web Web crawler Indexer Search User Indexes Query Engine 1.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Mercator: A scalable, extensible Web crawler Allan Heydon and Marc Najork, World Wide Web, Young Geun Han.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Crawling Slides adapted from
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
How Does a Search Engine Work? Part 1 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial.
Copyright © Weld /14/2015 7:25 PM1 CSE 454 Crawlers & Indexing.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
1 CS 430 / INFO 430: Information Discovery Lecture 19 Web Search 1.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
1 CS 430 / INFO 430 Information Retrieval Lecture 19 Web Search 1.
Information Discovery Lecture 20 Web Search 2. Example: Heritrix Crawler A high-performance, open source crawler for production and research Developed.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Allan Heydon and Mark Najork --Sumeet Takalkar. Inspiration of Mercator What is a Mercator Crawling Algorithm and its Functional Components Architecture.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Design and Implementation of a High- Performance Distributed Web Crawler Vladislav Shkapenyuk, Torsten Suel 실시간 연구실 문인철
Database System Laboratory Mercator: A Scalable, Extensible Web Crawler Allan Heydon and Marc Najork International Journal of World Wide Web, v.2(4), p ,
Information Architecture
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 17 Crawling and web indexes
CS 430: Information Discovery
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
IST 516 Fall 2011 Dongwon Lee, Ph.D.
CSE 454 Crawlers 9/22/ :09 PM.
The Anatomy of a Large-Scale Hypertextual Web Search Engine
Lecture 5: Search Engines
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Technologies behind Internet Search Engine
Anwar Alhenshiri.
cs430 lecture 02/22/01 Kamen Yotov
Presentation transcript:

Crawlers: Nutch CSE 454 11/12/2018 5:08 AM

Administrivia Groups Formed Architecture Documents under Review Group Meetings 11/12/2018 5:08 AM

Course Overview Info Extraction Ecommerce Web Services P2P Semantic Web Security Datamining Case Studies: Nutch, Google, Altavista Information Retrieval Precision vs Recall Inverted Indicies Crawler Architecture Synchronization & Monitors Systems Foundation: Networking & Clusters 11/12/2018 5:08 AM

Standard Web Search Engine Architecture store documents, check for duplicates, extract links DocIds crawl the web create an inverted index user query inverted index Search engine servers show results To user 11/12/2018 5:08 AM Slide adapted from Marty Hearst / UC Berkeley]

Issues Crawling Search Presentation 11/12/2018 5:08 AM

Crawling Issues Storage efficiency Search strategy Politeness Where to start Link ordering Circularities Duplicates Checking for changes Politeness Forbidden zones: robots.txt CGI & scripts Load on remote servers Bandwidth (download what need) Parsing pages for links Scalability 11/12/2018 5:08 AM

Searching Issues Scalability (how measure speed?) Ranking Boolean queries Phrase search Nearness Substrings & stemming Stop words Multiple languages Spam, cloaking, … Multiple meanings for search words File types: images, audio, … Updating the index 11/12/2018 5:08 AM

Thinking about Efficiency Disk access: 1-10ms Depends on seek distance, published average is 5ms Thus perform 200 seeks / sec (And we are ignoring rotation and transfer times) Clock cycle: 2 GHz Typically completes 2 instructions / cycle ~10 cycles / instruction, but pipelining & parallel execution Thus: 4 billion instructions / sec Disk is 20 Million times slower !!! Store index in Oracle database? Store index using files and unix filesystem? 11/12/2018 5:08 AM

Search Engine Architecture Spider Crawls the web to find pages. Follows hyperlinks. Never stops Indexer Produces data structures for fast searching of all words in the pages Retriever Query interface Database lookup to find hits 300 million documents 300 GB RAM, terabytes of disk Ranking, summaries Front End 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

Search Engine Size over Time Number of indexed pages, self-reported Google: 50% of the web? Copyright © Daniel Weld 2000, 2002

Crawlers (Spiders, Bots) Retrieve web pages for indexing by search engines Start with an initial page P0. Find URLs on P0 and add them to a queue When done with P0, pass it to an indexing program, get a page P1 from the queue and repeat Can be specialized (e.g. only look for email addresses) Issues Which page to look at next? (keywords, recency, ?) Avoid overloading a site How deep within a site to go (drill-down)? How frequently to visit pages? 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

Spiders 243 active spiders registered 1/01 Inktomi Slurp Digimark http://info.webcrawler.com/mak/projects/robots/active/html/index.html Inktomi Slurp Standard search engine Digimark Downloads just images, looking for watermarks Adrelevance Looking for Ads. 11/12/2018 5:08 AM

Searches / Day Google 250 M Overture 167 M Inktomi 80 M LookSmart 45 M FindWhat 33 M AskJeeves 20 M Altavista 18 M FAST 12 M From SearchEngineWatch 02/03 11/12/2018 5:08 AM

Hitwise: Search Engine Ratings 5/04 11/12/2018 5:08 AM

? Outgoing Links? Parse HTML… Looking for…what? 11/12/2018 5:08 AM anns html foos Bar baz hhh www A href = www.cs Frame font zzz ,li> bar bbb anns html foos A href = ffff zcfg www.cs bbbbb z ,li> bar bbb 11/12/2018 5:08 AM

Which tags / attributes hold URLs? Anchor tag: <a href=“URL” … > … </a> Option tag: <option value=“URL”…> … </option> Map: <area href=“URL” …> Frame: <frame src=“URL” …> Link to an image: <img src=“URL” …> Relative path vs. absolute path: <base href= …> 11/12/2018 5:08 AM

Copyright © Daniel Weld 2000, 2002 Robot Exclusion Person may not want certain pages indexed. Crawlers should obey Robot Exclusion Protocol. But some don’t Look for file robots.txt at highest directory level If domain is www.ecom.cmu.edu, robots.txt goes in www.ecom.cmu.edu/robots.txt Specific document can be shielded from a crawler by adding the line: <META NAME="ROBOTS” CONTENT="NOINDEX"> 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

Robots Exclusion Protocol Format of robots.txt Two fields. User-agent to specify a robot Disallow to tell the agent what to ignore To exclude all robots from a server: User-agent: * Disallow: / To exclude one robot from two directories: User-agent: WebCrawler Disallow: /news/ Disallow: /tmp/ View the robots.txt specification at http://info.webcrawler.com/mak/projects/robots/norobots.html 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

Web Crawling Strategy Starting location(s) Traversal order Politeness Depth first (LIFO) Breadth first (FIFO) Or ??? Politeness Cycles? Coverage? d b e h j f c g i 11/12/2018 5:08 AM

Structure of Mercator Spider Document fingerprints 1. Remove URL from queue 2. Simulate network protocols & REP 3. Read w/ RewindInputStream (RIS) 4. Has document been seen before? (checksums and fingerprints) 5. Extract links 6. Download new URL? 7. Has URL been seen before? 8. Add URL to frontier 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

URL Frontier (priority queue) Most crawlers do breadth-first search from seeds. Politeness constraint: don’t hammer servers! Obvious implementation: “live host table” Will it fit in memory? Is this efficient? Mercator’s politeness: One FIFO subqueue per thread. Choose subqueue by hashing host’s name. Dequeue first URL whose host has NO outstanding requests. Live host table = hosts that have outstanding requests to them. It should fit in memory, but the problem is locality. Alta vista had this implementation and because of URL locality, it spent a lot of time repeatedly checking the table, only to find out that it couldn’t make the query. 11/12/2018 5:08 AM

Fetching Pages Need to support http, ftp, gopher, .... Extensible! Need to fetch multiple pages at once. Need to cache as much as possible DNS robots.txt Documents themselves (for later processing) Need to be defensive! Need to time out http connections. Watch for “crawler traps” (e.g., infinite URL names.) See section 5 of Mercator paper. Use URL filter module Checkpointing! 11/12/2018 5:08 AM

(A?) Synchronous I/O Problem: network + host latency Google Mercator Want to GET multiple URLs at once. Google Single-threaded crawler + asynchronous I/O Mercator Multi-threaded crawler + synchronous I/O Easier to code? 11/12/2018 5:08 AM

Duplicate Detection URL-seen test: has this URL been seen before? To save space, store a hash Content-seen test: different URL, same doc. Supress link extraction from mirrored pages. What to save for each doc? 64 bit “document fingerprint” Minimize number of disk reads upon retrieval. 11/12/2018 5:08 AM

HISTOGRAM OF DOCUMENT SIZES Exponentially increasing size Mercator Statistics HISTOGRAM OF DOCUMENT SIZES Exponentially increasing size PAGE TYPE PERCENT text/html 69.2% image/gif 17.9% image/jpeg 8.1% text/plain 1.5 pdf 0.9% audio 0.4% zip 0.4% postscript 0.3% other 1.4% 11/12/2018 5:08 AM Copyright © Daniel Weld 2000, 2002

Advanced Crawling Issues Limited resources Fetch most important pages first Topic specific search engines Only care about pages which are relevant to topic “Focused crawling” Minimize stale pages Efficient re-fetch to keep index timely How track the rate of change for pages? 11/12/2018 5:08 AM

Focused Crawling Priority queue instead of FIFO. How to determine priority? Similarity of page to driving query Use traditional IR measures Backlink How many links point to this page? PageRank (Google) Some links to this page count more than others Forward link of a page Location Heuristics E.g., Is site in .edu? E.g., Does URL contain ‘home’ in it? Linear combination of above 11/12/2018 5:08 AM