All About Nutch Michael J. Cafarella CSE 454 April 14, 2005.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce CSE 454 Slides based on those by Jeff Dean, Sanjay Ghemawat, and Dan Weld.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Background Knowledge Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Intro to Web Search Michael J. Cafarella December 5, 2007.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
David Evans CS150: Computer Science University of Virginia Computer Science Class 38: Googling.
Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
04/27/2011DHT1 Operating System ecs251 Spring 2011 : Operating System #6: Map and Reduce Dr. S. Felix Wu Computer Science Department University of California,
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Cloud Computing.
MapReduce: Simplified Data Processing on Large Clusters
The Map-Reduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
湖南大学-信息科学与工程学院-计算机与科学系
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

All About Nutch Michael J. Cafarella CSE 454 April 14, 2005

Meta-details Built to encourage public search work Open-source, w/pluggable modules Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine Pretty good ranking Currently can do ~ 200M pages

Outline Nutch design Link database, fetcher, indexer, etc… Supporting parts Distributed filesystem, job control Nutch for your project

WebDB Fetcher 2 of N Fetcher 1 of N Fetcher 0 of N Fetchlist 2 of N Fetchlist 1 of N Fetchlist 0 of N Update 2 of N Update 1 of N Update 0 of N Content 0 of N Indexer 2 of N Indexer 1 of N Indexer 0 of N Searcher 2 of N Searcher 1 of N Searcher 0 of N WebServer 2 of M WebServer 1 of M WebServer 0 of M Index 2 of N Index 1 of N Index 0 of N Inject

Moving Parts Acquisition cycle WebDB Fetcher Index generation Indexing Link analysis (maybe) Serving results

WebDB Contains info on all pages, links URL, last download, # failures, link score, content hash, ref counting Source hash, target URL Must always be consistent Designed to minimize disk seeks 19ms seek time x 200m new pages/mo = ~44 days of disk seeks!

Fetcher Fetcher is very stupid. Not a “crawler” Divide “to-fetch list” into k pieces, one for each fetcher machine URLs for one domain go to same list, otherwise random “Politeness” w/o inter-fetcher protocols Can observe robots.txt similarly Better DNS, robots caching Easy parallelism Two outputs: pages, WebDB edits

2. Sort edits (externally, if necessary) WebDB/Fetcher Updates URL: LastUpdated: Never ContentHash: None URL: LastUpdated: Never ContentHash: None URL: LastUpdated: 4/07/05 ContentHash: MD5_toewkekqmekkalekaa URL: LastUpdated: 3/22/05 ContentHash: MD5_sdflkjweroiwelksd Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_balboglerropewolefbag Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_toewkekqmekkalekaa Edit: NEW_LINK URL: ContentHash: None WebDB Fetcher edits 1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tables URL: LastUpdated: Today! ContentHash: MD5_balboglerropewolefbag URL: LastUpdated: Today! ContentHash: MD5_toewkekqmekkalekaa

Indexing Iterate through all k page sets in parallel, constructing inverted index Creates a “searchable document” of: URL text Content text Incoming anchor text Other content types might have a different document fields Eg, has sender/receiver Any searchable field end-user will want Uses Lucene text indexer

Link analysis A page’s relevance depends on both intrinsic and extrinsic factors Intrinsic: page title, URL, text Extrinsic: anchor text, link graph PageRank is most famous of many Others include: HITS Simple incoming link count Link analysis is sexy, but importance generally overstated

Link analysis (2) Nutch performs analysis in WebDB Emit a score for each known page At index time, incorporate score into inverted index Extremely time-consuming In our case, disk-consuming, too (because we want to use low-memory machines) 0.5 * log(# incoming links)

“britney” Query Processing Docs 0-1M Docs 1-2MDocs 2-3MDocs 3-4MDocs 4-5M “britney” Ds 1, 29 Ds 1.2M, 1.7M Ds 2.3M, 2.9M Ds 3.1M, 3.2M Ds 4.4M, 4.5M 1.2M, 4.4M, 29, …

Administering Nutch Admin costs are critical It’s a hassle when you have 25 machines Google has maybe >100k Files WebDB content, working files Fetchlists, fetched pages Link analysis outputs, working files Inverted indices Jobs Emit fetchlists, fetch, update WebDB Run link analysis Build inverted indices

Administering Nutch (2) Admin sounds boring, but it’s not! Really I swear Large-file maintenance Google File System (Ghemawat, Gobioff, Leung) Nutch Distributed File System Job Control Map/Reduce (Dean and Ghemawat)

Nutch Distributed File System Similar, but not identical, to GFS Requirements are fairly strange Extremely large files Most files read once, from start to end Low admin costs per GB Equally strange design Write-once, with delete Single file can exist across many machines Wholly automatic failure recovery

NDFS (2) Data divided into blocks Blocks can be copied, replicated Datanodes hold and serve blocks Namenode holds metainfo Filename  block list Block  datanode-location Datanodes report in to namenode every few seconds,

NDFS File Read Namenode Datanode 0 Datanode 1Datanode 2 Datanode 3Datanode 4Datanode 5 1.Client asks datanode for filename info 2.Namenode responds with blocklist, and location(s) for each block 3.Client fetches each block, in sequence, from a datanode “crawl.txt” (block-33 / datanodes 1, 4) (block-95 / datanodes 0, 2) (block-65 / datanodes 1, 4, 5)

NDFS Replication Namenode Datanode 0 (33, 95) Datanode 1 (46, 95) Datanode 2 (33, 104) Datanode 3 (21, 33, 46) Datanode 4 (90) Datanode 5 (21, 90, 104) 1.Always keep at least k copies of each blk 2.Imagine datanode 4 dies; blk 90 lost 3.Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0 4.Choosing replication target is tricky (Blk 90 to dn 0)

Map/Reduce Map/Reduce is programming model from Lisp (and other places) Easy to distribute across nodes Nice retry/failure semantics map(key, val) is run on each item in set emits key/val pairs reduce(key, vals) is run for each unique key emitted by map() emits final output Many problems can be phrased this way

Map/Reduce (2) Task: count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)”

Map/Reduce (3) Task: grep Input consists of (url+offset, single line) map(key=url+offset, val=line): If contents matches regexp, emit (line, “1”) reduce(key=line, values=uniq_counts): Don’t do anything; just emit line We can also do graph inversion, link analysis, WebDB updates, etc

Map/Reduce (4) How is this distributed? 1. Partition input key/value pairs into chunks, run map() tasks in parallel 2. After all map()s are complete, consolidate all emitted values for each unique emitted key 3. Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!

Map/Reduce Job Processing JobTracker TaskTracker 0 TaskTracker 1TaskTracker 2 TaskTracker 3TaskTracker 4TaskTracker 5 1.Client submits “grep” job, indicating code and input files 2.JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. 3.After map(), tasktrackers exchange map- output to build reduce() keyspace 4.JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5.reduce() output may go to NDFS “grep”

Searching webcams Index size will be small Need all the hints you can get Page text, anchor text URL sources like Yahoo or DMOZ entries Webcam-only content types Avoid processing images at query time Take a look at Nutch pluggable content types (current examples include PDF, MS Word, etc.). Might work.

Searching webcams (2) Annotate Lucene document with new fields “Image qualities” might contain “indoors” or “daylight” or “flesh tones” Parse text for city names to fill “location” field Multiple downloads to compute “lattitude” field Others? Will require new search procedure, too

Conclusion Partial documentation Source code Developer discussion board “Lucene in Action” by Hatcher, Gospodnetic (you can borrow mine) Questions?