The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
TI: An Efficient Indexing Mechanism for Real-Time Search on Tweets Chun Chen 1, Feng Li 2, Beng Chin Ooi 2, and Sai Wu 2 1 Zhejiang University, 2 National.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Lucene & Nutch Lucene  Project name  Started as text index engine Nutch  A complete web search engine, including: Crawling, indexing, searching  Index.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
© nCode 2000 Title of Presentation goes here - go to Master Slide to edit - Slide 1 Anatomy of a Large-Scale Hypertextual Web Search Engine ECE 7995: Term.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Google and Scalable Query Services
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Nutch Search Engine Tool. Nutch overview A full-fledged web search engine Functionalities of Nutch  Internet and Intranet crawling  Parsing different.
Crawling Ida Mele. Nutch Apache Nutch is an open source Java implementation of a search engine We can use Nutch for crawling a portion of the Web Useful.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking, Crawling and Indexing in IR.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Background Knowledge Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Nutch in a Nutshell (part I) Presented by Liew Guo Min Zhao Jin.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Data Structures & Algorithms and The Internet: A different way of thinking.
Crawling Slides adapted from
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
Search Xin Liu. 2 Searching the Web for Information How a Search Engine Works –Basic parts: 1.Crawler: Visits sites on the Internet, discovering Web pages.
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
MapReduce M/R slides adapted from those of Jeff Dean’s.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
All About Nutch Michael J. Cafarella CSE 454 April 14, 2005.
Intro to Web Search Michael J. Cafarella December 5, 2007.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,
What is Web Information retrieval from web Search Engine Web Crawler Web crawler policies Conclusion How does a web crawler work Synchronization Algorithms.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
1 Crawling Slides adapted from – Information Retrieval and Web Search, Stanford University, Christopher Manning and Prabhakar Raghavan.
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
CCD-410 Cloudera Certified Developer for Apache Hadoop (CCDH) Cloudera.
1 Efficient Crawling Through URL Ordering Junghoo Cho Hector Garcia-Molina Lawrence Page Stanford InfoLab.
Dr. Frank McCown Comp 250 – Web Development Harding University
IST 516 Fall 2010 Dongwon Lee, Ph.D. Wonhong Nam, Ph.D.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Google and Scalable Query Services
Search Search Engines Search Engine Optimization Search Interfaces
Hongjun Song Computer Science The University of Memphis
Introduction to Nutch Zhao Dongsheng
Information Retrieval and Web Design
Presentation transcript:

The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella

Meta-details Built to encourage public search work Open-source, w/pluggable modules Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine Pretty good ranking Currently can do ~ 200M pages << 8 Billion, but not too shabby

Outline Nutch design Link database, fetcher, indexer, etc… Supporting parts Distributed filesystem, job control

WebDB Fetcher 2 of N Fetcher 1 of N Fetcher 0 of N Fetchlist 2 of N Fetchlist 1 of N Fetchlist 0 of N Update 2 of N Update 1 of N Update 0 of N Content 0 of N Indexer 2 of N Indexer 1 of N Indexer 0 of N Searcher 2 of N Searcher 1 of N Searcher 0 of N WebServer 2 of M WebServer 1 of M WebServer 0 of M Index 2 of N Index 1 of N Index 0 of N Inject

Moving Parts Acquisition cycle WebDB Fetcher Index generation Indexing Link analysis (maybe) Serving results

WebDB Contains info on all pages, links URL, last download, # failures, link score, content hash, ref counting Source hash, target URL Must always be consistent Designed to minimize disk seeks 19ms seek time x 200m new pages/mo = ~44 days of disk seeks!

Fetcher Fetcher is very stupid. Not a “crawler” Divide “to-fetch list” into k pieces, One for each fetcher machine URLs for one domain go to same list, otherwise random “Politeness” w/o inter-fetcher protocols Can observe robots.txt similarly Better DNS, robots caching Easy parallelism Two outputs: pages, WebDB edits

2. Sort edits (externally, if necessary) WebDB/Fetcher Updates URL: LastUpdated: Never ContentHash: None URL: LastUpdated: Never ContentHash: None URL: LastUpdated: 4/07/05 ContentHash: MD5_toewkekqmekkalekaa URL: LastUpdated: 3/22/05 ContentHash: MD5_sdflkjweroiwelksd Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_balboglerropewolefbag Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_toewkekqmekkalekaa Edit: NEW_LINK URL: ContentHash: None WebDB Fetcher edits 1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tables URL: LastUpdated: Today! ContentHash: MD5_balboglerropewolefbag URL: LastUpdated: Today! ContentHash: MD5_toewkekqmekkalekaa

Indexing Iterate through all k page sets in parallel, constructing inverted index Creates a “searchable document” of: URL text Content text Incoming anchor text Other content types might have different document fields Eg, has sender/receiver Any searchable field end-user will want Uses Lucene text indexer

Link analysis A page’s relevance depends on both intrinsic and extrinsic factors Intrinsic: page title, URL, text Extrinsic: anchor text, link graph E.g., PageRank, HITS, simple in-degree Sexy, but importance generally overstated

Link analysis (2) Nutch performs analysis in WebDB Emit a score for each known page At index time, incorporates score into index Extremely time-consuming Disk-consuming, too (because we want to use low-memory machines) 0.5 * log(# incoming links)

“britney” Query Processing Docs 0-1M Docs 1-2MDocs 2-3MDocs 3-4MDocs 4-5M “britney” Ds 1, 29 Ds 1.2M, 1.7M Ds 2.3M, 2.9M Ds 3.1M, 3.2M Ds 4.4M, 4.5M 1.2M, 4.4M, 29, …

Administering Nutch Admin costs are critical It’s a hassle when you have 25 machines Google has maybe ~150k Files WebDB content, working files Fetchlists, fetched pages Link analysis outputs, working files Inverted indices Jobs Emit fetchlists, fetch, update WebDB Run link analysis Build inverted indices

Administering Nutch (2) Admin sounds boring, but it’s not! Really! Mike swears! Large-file maintenance Nutch Distributed File System Ala Google File System (Ghemawat et al.) Job Control Map/Reduce (Dean and Ghemawat) To be discussed soon

Nutch Distributed File System Similar, but not identical, to GFS Requirements are fairly strange Extremely large files Most files read once, from start to end Low admin costs per GB Equally strange design Write-once, with delete Single file can exist across many machines Wholly automatic failure recovery

NDFS (2) Data divided into blocks Blocks can be copied, replicated Datanodes hold and serve blocks Namenode holds metainfo Filename  block list Block  datanode-location Datanodes report to Namenode every few seconds,

NDFS File Read Namenode Datanode 0 Datanode 1Datanode 2 Datanode 3Datanode 4Datanode 5 1.Client asks datanode for filename info 2.Namenode responds with blocklist, and location(s) for each block 3.Client fetches each block, in sequence, from a datanode “crawl.txt” (block-33 / datanodes 1, 4) (block-95 / datanodes 0, 2) (block-65 / datanodes 1, 4, 5)

NDFS Replication Namenode Datanode 0 (33, 95) Datanode 1 (46, 95) Datanode 2 (33, 104) Datanode 3 (21, 33, 46) Datanode 4 (90) Datanode 5 (21, 90, 104) 1.Always keep at least k copies of each blk 2.Imagine datanode 4 dies; blk 90 lost 3.Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0 4.Choosing replication target is tricky (Blk 90 to dn 0)

Conclusion Partial documentation Source code Developer discussion board “Lucene in Action” by Hatcher, Gospodnetic (you can borrow mine) Questions?