Background Knowledge Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Lecture 3 – Networks and Distributed Systems CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of this.
1 INF 2914 Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
INF 2914 Information Retrieval and Web Search
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
LIS618 lecture 2 the Boolean model Thomas Krichel
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Modern Information Retrieval Lecture 3: Boolean Retrieval.
Distributed Computing Seminar Lecture 1: Introduction to Distributed Computing & Systems Background Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet.
All About Nutch Michael J. Cafarella CSE 454 April 14, 2005.
Intro to Web Search Michael J. Cafarella December 5, 2007.
Lecture 3 – Networks and Distributed Systems CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of this.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Lecture 1 – Introduction CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of this presentation is licensed.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
Google Cluster Computing Faculty Training Workshop Module III: Nutch This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Introduction to Information Retrieval Boolean Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Data Parallel and Graph Parallel Systems for Large-scale Data Processing Presenter: Kun Li.
Lecture 1 – Introduction CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Module II: Student Background Knowledge This presentation includes course content.
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
The Nutch Open-Source Search Engine CSE 454 Slides by Michael J. Cafarella.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
CS315 Introduction to Information Retrieval Boolean Search 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Large Scale Search: Inverted Index, etc.
CS122B: Projects in Databases and Web Applications Winter 2017
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
MapReduce: Simplified Data Processing on Large Clusters
Slides from Book: Christopher D
Modified from Stanford CS276 slides Lecture 4: Index Construction
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Boolean Retrieval.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Information Retrieval and Web Search Lecture 1: Boolean retrieval
Boolean Retrieval.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Background Knowledge Peng Bo School of EECS, Peking University 6/26/2008 Refer to Aaron Kimball’s slides

Background Topics Parallelization & Synchronization Fundamentals of Networking Search Engine Technology –Inverted index –PageRank algorithm

Parallelization & Synchronization

Parallelization Idea Parallelization is “easy” if processing can be cleanly split into n units:

Parallelization Idea (2) In a parallel computation, we would like to have as many threads as we have processors. e.g., a four-processor computer would be able to run four threads at the same time.

Parallelization Idea (3)

Parallelization Idea (4)

Parallelization Pitfalls But this model is too simple! How do we assign work units to worker threads? What if we have more work units than threads? How do we aggregate the results at the end? How do we know all the workers have finished? What if the work cannot be divided into completely separate tasks? What is the common theme of all of these problems?

Parallelization Pitfalls (2) Each of these problems represents a point at which multiple threads must communicate with one another, or access a shared resource. Golden rule: Any memory that can be used by multiple threads must have an associated synchronization system!

What is Wrong With This? Thread 1: void foo() { x++; y = x; } Thread 2: void bar() { y++; x++; } If the initial state is y = 0, x = 6, what happens after these threads finish running?

Multithreaded = Unpredictability When we run a multithreaded program, we don’t know what order threads run in, nor do we know when they will interrupt one another. Thread 1: void foo() { eax = mem[x]; inc eax; mem[x] = eax; ebx = mem[x]; mem[y] = ebx; } Thread 2: void bar() { eax = mem[y]; inc eax; mem[y] = eax; eax = mem[x]; inc eax; mem[x] = eax; } Many things that look like “one step” operations actually take several steps under the hood:

Multithreaded = Unpredictability This applies to more than just integers: Pulling work units from a queue Reporting work back to master unit Telling another thread that it can begin the “next phase” of processing … All require synchronization!

Synchronization Primitives synchronization primitive –Semaphore / mutex –Condition variable –Barriers

Semaphores A semaphore is a flag that can be raised or lowered in one step Semaphores were flags that railroad engineers would use when entering a shared track

The “Corrected” Example Thread 1: void foo() { sem.lock(); x++; y = x; sem.unlock(); } Thread 2: void bar() { sem.lock(); y++; x++; sem.unlock(); } Global var “Semaphore sem = new Semaphore();” guards access to x & y

Condition Variables A condition variable notifies threads that a particular condition has been met Inform another thread that a queue now contains elements to pull from (or that it’s empty – request more elements!)

The final example Thread 1: void foo() { sem.lock(); x++; y = x; fooDone = true; sem.unlock(); fooFinishedCV.notify(); } Thread 2: void bar() { sem.lock(); while(!fooDone) fooFinishedCV.wait(sem); y++; x++; sem.unlock(); } Global vars: Semaphore sem = new Semaphore(); ConditionVar fooFinishedCV = new ConditionVar(); boolean fooDone = false;

Barriers A barrier knows in advance how many threads it should wait for. Threads “register” with the barrier when they reach it, and fall asleep. Barrier wakes up all registered threads when total count is correct Pitfall: What happens if a thread takes a long time?

Too Much Synchronization? Deadlock Synchronization becomes even more complicated when multiple locks can be used Can cause entire system to “get stuck” Thread A: semaphore1.lock(); semaphore2.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); Thread B: semaphore2.lock(); semaphore1.lock(); /* use data guarded by semaphores */ semaphore1.unlock(); semaphore2.unlock(); (Image: RPI CSCI.4210 Operating Systems notes)

And if you thought I was joking…

The Moral: Be Careful! Synchronization is hard –Need to consider all possible shared state –Must keep locks organized and use them consistently and correctly Knowing there are bugs may be tricky; fixing them can be even worse! Keeping shared state to a minimum reduces total system complexity

Fundamentals of Networking

Sockets: The Internet = tubes? A socket is the basic network interface Provides a two-way “pipe” abstraction between two applications Client creates a socket, and connects to the server, who receives a socket representing the other side

Ports Within an IP address, a port is a sub-address identifying a listening program Allows multiple clients to connect to a server at once

Example: Web Server (1/3) The server creates a listener socket attached to a specific port. 80 is the agreed-upon port number for web traffic.

Example: Web Server (2/3) The client-side socket is still connected to a port, but the OS chooses a random unused port number When the client requests a URL (e.g., “ its OS uses a system called DNS to find its IP address.

Example: Web Server (3/3) Server chooses a randomly-numbered port to handle this particular client Listener is ready for more incoming connections, while we process the current connection in parallel

What makes this work? Underneath the socket layer are several more protocols Most important are TCP and IP (which are used hand-in-hand so often, they’re often spoken of as one protocol: TCP/IP) Even more low-level protocols handle how data is sent over Ethernet wires, or how bits are sent through the air using wireless…

IP: The Internet Protocol Defines the addressing scheme for computers Encapsulates internal data in a “packet” Does not provide reliability Just includes enough information for the data to tell routers where to send it

TCP: Transmission Control Protocol Built on top of IP Introduces concept of “connection” Provides reliability and ordering

Why is This Necessary? Not actually tube-like “underneath the hood” Unlike phone system (circuit switched), the packet switched Internet uses many routes at once

Networking Issues If a party to a socket disconnects, how much data did they receive? … Did they crash? Or did a machine in the middle? Can someone in the middle intercept/modify our data? Traffic congestion makes switch/router topology important for efficient throughput

Search Engine Technology This presentation © Michael Cafarella Redistributed under the Creative Commons Attribution 3.0 license.

Doug Cutting

Meta-details Built to encourage public search work –Open-source, w/pluggable modules –Cheap to run, both machines & admins Goal: Search more pages, with better quality, than any other engine –Pretty good ranking –Has done ~ 200M pages, more possible Hadoop is a spinoff

WebDB Fetcher 2 of N Fetcher 1 of N Fetcher 0 of N Fetchlist 2 of N Fetchlist 1 of N Fetchlist 0 of N Update 2 of N Update 1 of N Update 0 of N Content 0 of N Indexer 2 of N Indexer 1 of N Indexer 0 of N Searcher 2 of N Searcher 1 of N Searcher 0 of N WebServer 2 of M WebServer 1 of M WebServer 0 of M Index 2 of N Index 1 of N Index 0 of N Inject

Moving Parts Acquisition cycle –WebDB –Fetcher Index generation –Indexing –Link analysis (maybe) Serving results

WebDB Contains info on all pages, links –URL, last download, # failures, link score, content hash, ref counting –Source hash, target URL Must always be consistent Designed to minimize disk seeks –19ms seek time x 200m new pages/mo = ~44 days of disk seeks! Single-disk WebDB was huge headache

Fetcher Fetcher is very stupid. Not a “ crawler ” Pre-MapRed: divide “ to-fetch list ” into k pieces, one for each fetcher machine URLs for one domain go to same list, otherwise random –“ Politeness ” w/o inter-fetcher protocols –Can observe robots.txt similarly –Better DNS, robots caching –Easy parallelism Two outputs: pages, WebDB edits

2. Sort edits (externally, if necessary) WebDB/Fetcher Updates URL: LastUpdated: Never ContentHash: None URL: LastUpdated: Never ContentHash: None URL: LastUpdated: 4/07/05 ContentHash: MD5_toewkekqmekkalekaa URL: LastUpdated: 3/22/05 ContentHash: MD5_sdflkjweroiwelksd Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_balboglerropewolefbag Edit: DOWNLOAD_CONTENT URL: ContentHash: MD5_toewkekqmekkalekaa Edit: NEW_LINK URL: ContentHash: None WebDB Fetcher edits 1. Write down fetcher edits3. Read streams in parallel, emitting new database4. Repeat for other tables URL: LastUpdated: Today! ContentHash: MD5_balboglerropewolefbag URL: LastUpdated: Today! ContentHash: MD5_toewkekqmekkalekaa

Indexing How to retrieve some information from a large document set efficiently?

Document Collection

User Information Need Search inside this news site for articles talks about Culture between China and Japan, and doesn’t talk about students abroad. QUERY : –“ 中国 日本 文化 - 留学生 ”

How to do it? Could grep all of Web Pages for “ 中 国 ”, “ 文化 ” and “ 日本 ”, then strip out pages containing “ 留学生 ” ? –Slow (for large corpora) –NOT “ 留学生 ” is non-trivial –Other operations (e.g., find “ 中国 ” NEAR “ 日 本 ” ) not feasible

Document Representation Bag of words model Document-term incidence matrix 中国文化日本留学 生 教育北京 … D1D D2D D3D D4D D5D D6D … 1 if page contains word, 0 otherwise

Incidence Vector D1D1 D2D2 D3D3 D4D4 D5D5 D6D6 … 中国 文化 日本 留学 生 教育 北京 … Transpose the Document-term incidence matrix So we have a 0/1 vector for each term.

Retrieval Search inside this news site for articles talks about Culture between China and Japan, and doesn’t talk about students abroad. To answer query: –take the vectors for “ 中国 ”, “ 文化 ”, “ 日本 ”, “ 留 学生 ” (complemented)  bitwise AND – AND AND AND =

D5D5

Let’s build a search system! Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term include spaces/punctuation –6GB of data in the documents. Say there are M = 500K distinct terms among these.

Can ’ t build the matrix 500K x 1M matrix has half-a-trillion 0 ’ s and 1 ’ s. But it has no more than one billion 1 ’ s. –matrix is extremely sparse. What ’ s a better representation? –We only record the 1 positions. Why?

Inverted index For each term T: store a list of all documents that contain T. Do we use an array or a list for this? 中国 文化 留学生 What happens if the word 中国 is added to document 14?

Inverted index Linked lists generally preferred to arrays –Dynamic space allocation –Insertion of terms into documents easy –Space overhead of pointers 中国 文化 留学生 Dictionary Postings Sorted by docID (more later on why).

Inverted index construction Tokenizer Token stream. Friends RomansCountrymen Linguistic modules Modified tokens. friend romancountryman Indexer Inverted index. friend roman countryman Documents to be indexed. Friends, Romans, countrymen.

Sequence of (Modified token, Document ID) pairs. I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 1 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Doc 2 Indexer steps

Sort by terms. Core indexing step

Multiple term entries in a single document are merged. Frequency information is added.

The result is split into a Dictionary file and a Postings file. Why split?

The index we just built How do we process a Boolean query?

Query processing Consider processing the query: 中国 AND 文化 –Locate 中国 in the Dictionary; Retrieve its postings. –Locate 文化 in the Dictionary; Retrieve its postings. –“ Merge ” the two postings: 中国 文化

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 中国 文化 2 8 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

Indexing in Nutch Iterate through all k page sets in parallel, constructing inverted index Creates a “ searchable document ” of: –URL text –Content text –Incoming anchor text Other content types might have a different document fields –Eg, has sender/receiver –Any searchable field end-user will want Uses Lucene text indexer

Link analysis A page ’ s relevance depends on both intrinsic and extrinsic factors –Intrinsic: page title, URL, text –Extrinsic: anchor text, link graph PageRank is most famous of many Others include: –HITS –OPIC –Simple incoming link count

Web Graph

PageRank Algorithm

Link analysis in Nutch Nutch performs analysis in WebDB –Emit a score for each known page –At index time, incorporate score into inverted index Extremely time-consuming –In our case, disk-consuming, too (because we want to use low-memory machines) Link analysis is sexy, but importance generally overstated, so fast and easy: –0.5 * log(# incoming links)

“ britney” Query Processing Docs 0-1M Docs 1-2MDocs 2-3MDocs 3-4MDocs 4-5M “ britney” Ds 1, 29 Ds 1.2M, 1.7M Ds 2.3M, 2.9M Ds 3.1M, 3.2M Ds 4.4M, 4.5M 1.2M, 4.4M, 29, …

Administering Nutch Admin costs are critical –It ’ s a hassle when you have 25 machines –Google has >100k, probably more Files –WebDB content, working files –Fetchlists, fetched pages –Link analysis outputs, working files –Inverted indices Jobs –Emit fetchlists, fetch, update WebDB –Run link analysis –Build inverted indices

Administering Nutch (2) Admin sounds boring, but it ’ s not! –Really –I swear Large-file maintenance –Google File System (Ghemawat, Gobioff, Leung) –Nutch Distributed File System Job Control –Map/Reduce (Dean and Ghemawat) –Pig (Yahoo Research) Data Storage (BigTable)

Nutch Distributed File System Similar, but not identical, to GFS Requirements are fairly strange –Extremely large files –Most files read once, from start to end –Low admin costs per GB Equally strange design –Write-once, with delete –Single file can exist across many machines –Wholly automatic failure recovery

NDFS (2) Data divided into blocks Blocks can be copied, replicated Datanodes hold and serve blocks Namenode holds meta info –Filename  block list –Block  datanode-location Datanodes report in to namenode every few seconds

NDFS File Read Namenode Datanode 0 Datanode 1Datanode 2 Datanode 3Datanode 4Datanode 5 1.Client asks datanode for filename info 2.Namenode responds with blocklist, and location(s) for each block 3.Client fetches each block, in sequence, from a datanode “crawl.txt” (block-33 / datanodes 1, 4) (block-95 / datanodes 0, 2) (block-65 / datanodes 1, 4, 5)

NDFS Replication Namenode Datanode 0 (33, 95) Datanode 1 (46, 95) Datanode 2 (33, 104) Datanode 3 (21, 33, 46) Datanode 4 (90) Datanode 5 (21, 90, 104) 1.Always keep at least k copies of each blk 2.Imagine datanode 4 dies; blk 90 lost 3.Namenode loses heartbeat, decrements blk 90’s reference count. Asks datanode 5 to replicate blk 90 to datanode 0 4.Choosing replication target is tricky (Blk 90 to dn 0)

Map/Reduce Map/Reduce is programming model from Lisp (and other places) –Easy to distribute across nodes –Nice retry/failure semantics map(key, val) is run on each item in set –emits key/val pairs reduce(key, vals) is run for each unique key emitted by map() –emits final output Many problems can be phrased this way

Map/Reduce (2) Task: count words in docs –Input consists of (url, contents) pairs –map(key=url, val=contents): For each word w in contents, emit (w, “ 1 ” ) –reduce(key=word, values=uniq_counts): Sum all “ 1 ” s in values list Emit result “ (word, sum) ”

Map/Reduce (3) Task: grep –Input consists of (url+offset, single line) –map(key=url+offset, val=line): If contents matches regexp, emit (line, “ 1 ” ) –reduce(key=line, values=uniq_counts): Don ’ t do anything; just emit line We can also do graph inversion, link analysis, WebDB updates, etc

Map/Reduce (4) How is this distributed? 1.Partition input key/value pairs into chunks, run map() tasks in parallel 2.After all map()s are complete, consolidate all emitted values for each unique emitted key 3.Now partition space of output map keys, and run reduce() in parallel If map() or reduce() fails, reexecute!

Map/Reduce Job Processing JobTracker TaskTracker 0 TaskTracker 1TaskTracker 2 TaskTracker 3TaskTracker 4TaskTracker 5 1.Client submits “grep” job, indicating code and input files 2.JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers. 3.After map(), tasktrackers exchange map- output to build reduce() keyspace 4.JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work. 5.reduce() output may go to NDFS “grep”

Nutch & Hadoop NDFS stores the crawl and indexes MapReduce for indexing, parsing, WebDB construction, even fetching –Broke previous 200M/mo limit –Index-serving? Required massive rewrite of almost every Nutch component

Summary Parallelization & Synchronization Fundamentals of Networking Search Engine Technology –Inverted index –PageRank algorithm

Readings [Brin and Page,1998] S. Brin and L. Page, "The anatomy of a large- scale hypertextual Web search engine," presented at Proceedings of the 7th Internation World Wide Web Conference/Computer Networks, Amsterdam, 1998.

Q&A