CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING.

CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING. MAPREDUCE – THEORY AND APPLICATIONS.

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications2 OUTLINE Parallel Processing  Types of parallel architectures  Types of parallelism  Synchronizations between parallel tasks MapReduce  Functional programming  Thery behind MapReduce  Examples for applications of MapReduce

Parallel Processing

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications4 Flynn's Taxonomy of Computers Instructions applied singlemultiple Data manipulated single multiple SISD – single- threaded process MISD – pipeline architecture (uncommon) SIMD – vector processing MIMD – multi- threaded programming

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications5 SISD

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications6 SIMD

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications7 MISD

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications8 Data parallelism  At the micro level, independent algebraic operations can commute – be processed in any order.  If commutative operations are applied to different memory addresses, then they can also occur at the same time  Compilers, CPUs often do so automatically x := (a * b) + (y * z); computation Acomputation B

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications9 Higher-level Parallelism  Commutativity can apply to larger operations. If foo() and bar() do not manipulate the same memory, then there is no reason why these cannot occur at the same time x := foo(a) + bar(b) computation Acomputation B

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications10 Task-Level Parallelism Dividing work into larger “tasks” identifies logical units for parallelization as threads Intelligent task design eliminates as many synchronization points as possible, but some will be inevitable Independent tasks can operate on different physical machines in distributed fashion Good task design requires identifying common data and functionality to move as a unit synchronization points Task A unexploited parallelism

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications11 Patterns for Parallelism - Master/Workers One object called the master initially owns all data. Creates several workers to process individual elements Waits for workers to report results back worker threads master

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications12 Patterns for Parallelism - Producer/Consumer Flow Producer threads create work items Consumer threads process them CP P P C C

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications13 Patterns for Parallelism - Work Queues All ready consumers should be available to process data from any producer Work queues divorce 1:1 relationship from producers to consumers C P P P C C shared queue

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications14 Types of synchronous parallel tasks Globally synchronous Each task on each iteration needs the state of all other tasks Locally synchronous Each task on each iteration needs the state of several other tasks Asynchronous

MapReduce

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications16 Functional programming review Functional operations do not modify data structures: They always create new ones Original data still exists in unmodified form Data flows are implicit in program design Order of operations does not matter List processing inherent to functional programming (LISP)

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications17 Functional programming – higher order functions Functions can be used as arguments (define (doDouble f x) (f (f x))) It does not matter what f does to its argument; DoDouble() will do it twice.

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications18 Functional programming – Map Creates a new list by applying a function to each element of the input list; returns output in order. (define (map f l) (if (null? l) ‘() (cons (f (car l)) (map f (cdr l)))))

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications19 Functional programming – Accumulate (reduce) Moves across a list, applying a function to each element plus an accumulator. The function returns the next accumulator value, which is combined with the next element of the list (define (accumulate comb base l) (if (null? l) base (comb (car l) (accumulate comb base (cdr l)))))

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications20 Functional programming – Accumulate (reduce) examples Sum of all elements of a list: (define (sum l) (accumulate + 0 l)) Product of all elements of a list: (define (prod l) (accumulate * 0 l)) Length of a list: (define (len l) (accumulate (lambda (x y) (+ 1 y)) 0 l))

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications21 Implicit Parallelism In map In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements If order of application of f to elements in list is commutative, we can reorder or parallelize execution This is the “secret” that MapReduce exploits

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications22 Motivation for MapReduce: Large Scale Data Processing Want to process lots of data ( > 1 TB) Want to parallelize across hundreds/thousands of CPUs … Want to make this easy

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications23 MapReduce programming model Borrows from functional programming Users implement two functions: map (in_key, in_value) -> (out_key, intermediate_value) list reduce (out_key, intermediate_value list) -> out_value list

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications24 Map Records from the data source (lines out of files, rows of a database, etc) are fed into the map function as key*value pairs: e.g., (filename, line). map() produces one or more intermediate values along with an output key from the input.

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications25 Reduce After the map phase is over, all the intermediate values for a given output key are combined together into a list reduce() combines those intermediate values into one or more final values for that same output key (in practice, usually only one final value per key)

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications26 MapReduce diagram

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications27 Parallelism in MapReduce map() functions run in parallel, creating different intermediate values from different input data sets reduce() functions also run in parallel, each working on a different output key All values are processed independently Bottleneck: reduce phase can’t start until map phase is completely finished.

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications28 Example – word count with MapReduce map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications29 Example – distributed grep with MapReduce The map function emits a line if it matches a supplied pattern The reduce function is an identity function that just copies the supplied intermediate data to the output

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications30 Example – count of URL access frequency The map function processes logs of web page requests and outputs The reduce function adds together all values for the same URL and emits a pair

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications31 Example – Reverse Web-Link Graph The map function outputs pairs for each link to a target URL found in a page named source The reduce function concatenates the list of all source URLs associated with a given target URL and emits the pair:

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications32 Example – inverted index The map function parses each document, and emits a sequence of pairs. The reduce function accepts all pairs for a given word, sorts the corresponding document IDs and emits a pair The set of all output pairs forms a simple inverted index

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications33 PageRank If a user starts at a random web page and surfs by clicking links and randomly entering new URLs, what is the probability that s/he will arrive at a given page? The PageRank of a page captures this notion More “popular” or “worthwhile” pages get a higher rank

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications34 PageRank Given page A, and pages T 1 through T n linking to A, PageRank is defined as: PR(A) = (1-d) + d (PR(T 1 )/C(T 1 ) +... + PR(T n )/C(T n )) The map C(P) is the cardinality (out-degree) of page P d is the damping (“random URL”) factor; it is tunable (usually = 0.85) Calculation is iterative: PRi+1 is based on PRi Each page distributes its PRi to all pages it links to. Linkees add up their awarded rank fragments to find their PRi+1 http://infolab.stanford.edu/~backrub/google.html

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications35 PageRank Create two tables 'current' and 'next' holding the PageRank for each page. Seed 'current' with initial PR values Iterate over all pages in the graph, distributing PR from 'current' into 'next' of linkees current := next; next := fresh_table(); Go back to iteration step or end if converged

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications36 Parallelization of PageRank The 'next' table depends on 'current', but not on any other rows of 'next' Individual rows of the adjacency matrix can be processed in parallel Sparse matrix rows are relatively small Therefore: We can map each row of 'current' to a list of PageRank “fragments” to assign to linkees These fragments can be reduced into a single PageRank value for a page by summing

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications37 Parallelization of PageRank - diagram

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications38 PageRank Parse HTML Map task takes (URL, page content) pairs and maps them to (URL, (PRinit, list-of-urls)) PRinit is the “seed” PageRank for URL list-of-urls contains all pages pointed to by URL Reduce task is just the identity function PageRank Distribution Map task takes (URL, (cur_rank, url_list)) For each u in url_list, emit (u, cur_rank/|url_list|) Emit (URL, url_list) to carry the points-to list along through iterations Reduce task gets (URL, url_list) and many (URL, val) values Sum vals and fix up with d Emit (URL, (new_rank, url_list)) A non-parallelizable component determines whether convergence has been achieved (Fixed number of iterations? Comparison of key values?) If so, write out the PageRank lists - done! Otherwise, feed output of Phase 2 into another Phase 2 iteration

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications39 Breadth-first search MapReduce implementation: Iterated passes through MapReduce – map some nodes, result includes additional nodes which are fed into successive MapReduce passes Represent the graph with a sparse matrix – for each row store a list of the column numbers, where the value is non-zero 1: 3, 18, 200 … Breadth-First Search is an iterated algorithm over graphs Frontier advances from origin by one level with each pass

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications40 Breadth-first search – shortest path in a graph Algorithm: DistanceTo(startNode) = 0 For all nodes n directly reachable from startNode, DistanceTo(n) = 1 For all nodes n reachable from some other set of nodes S, DistanceTo(n) = 1 + min(DistanceTo(m), m  S) MapReduce implementation: A map task receives a node n as a key, and (D, points-to) as its value D is the distance to the node from the start points-to is a list of nodes reachable from n  p  points-to, emit (p, D+1) Reduce task gathers possible distances to a given p and selects the minimum one

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications41 Breadth-first search – shortest path in a graph The MapReduce task can advance the known frontier by one hop To perform the whole BFS, a non-MapReduce component then feeds the output of this step back into the MapReduce task for another iteration Mapper emits (n, points-to) as well Termination This algorithm starts from one node Subsequent iterations include many more nodes of the graph as frontier advances Does this ever terminate? Eventually, routes between nodes will stop being discovered and no better distances will be found. When distance is the same, we stop Mapper should emit (n, D) to ensure that “current distance” is carried into the reducer

END OF LECTURE #2

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications43 The information in this document is compiled using varous public sources, freely available in internet. These sources include:  http://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaperhttp://www.scribd.com/doc/17929394/Cloud-Computing-Use-Cases-Whitepaper  http://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessmenthttp://www.enisa.europa.eu/act/rm/files/deliverables/cloud-computing-risk-assessment  http://code.google.com/edu/parallel/index.html http://code.google.com/edu/parallel/index.html  Google: Cluster Computing and MapReduce: http://code.google.com/edu/submissions/mapreduce-minilecture/listing.htmlhttp://code.google.com/edu/submissions/mapreduce-minilecture/listing.html  Google Course: MapReduce in a Week http://code.google.com/edu/submissions/mapreduce/listing.htmlhttp://code.google.com/edu/submissions/mapreduce/listing.html  Intensive MapReduce course at MIT http://mr.iap.2008.googlepages.comhttp://mr.iap.2008.googlepages.com  Hadoop Virtual Image Documentation http://code.google.com/edu/parallel/tools/hadoopvm/index.htmlhttp://code.google.com/edu/parallel/tools/hadoopvm/index.html  http://www.umiacs.umd.edu/~jimmylin/cloud-computinghttp://www.umiacs.umd.edu/~jimmylin/cloud-computing  Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, Christos Kozyrakis,  Evaluating MapReduce for Multi-core and Multiprocessor Systems, http://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdfhttp://csl.stanford.edu/~christos/publications/2007.cmp_mapreduce.hpca.pdf  http://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousinghttp://www.dbms2.com/2008/08/26/why-mapreduce-matters-to-sql-data-warehousing  Bingsheng He, Wenbin Fang, Qiong Luo, Mars: A MapReduce Framework on Graphics Processors http://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdfhttp://www.cse.ust.hk/catalac/users/saven/GPGPU/MapReduce/PACT08/171.pdf  Hung-chih Yang, Ali Dasdan, Map-reduce-merge: simplified relational data processing on large clusters http://portal.acm.org/citation.cfm?doid=1247480.1247602http://portal.acm.org/citation.cfm?doid=1247480.1247602  Foto N. Afrati, Jeffrey D. Ullman, A New Computation Model for Rack-Based Computing http://infolab.stanford.edu/~ullman/pub/mapred.pdfhttp://infolab.stanford.edu/~ullman/pub/mapred.pdf  Ralf Lammel, Google’s MapReduce Programming Model Revisite http://www.cs.vu.nl/~ralf/MapReduce/paper.pdfhttp://www.cs.vu.nl/~ralf/MapReduce/paper.pdf  http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1http://www.baselinemag.com/c/a/Infrastructure/How-Google-Works-1  Joe Hellerstein, Parallel Programming in the Age of Big Data http://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programminghttp://gigaom.com/2008/11/09/mapreduce-leads-the-way-for-parallel-programming  Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters https://sites.google.com/a/colgate.edu/cloudintro/Homehttps://sites.google.com/a/colgate.edu/cloudintro/Home © 2011 COPYRIGHTS DISCLAIMER The information in this document is proprietary to Sofia University “Sv. Kliment Ohridski” (called THE UNIVERSITY bellow) http://uni-sofia.bg THE UNIVERSITY assumes no responsibility for errors or omissions in this document. THE UNIVERSITY does not warrant the accuracy or completeness of the information, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warranties of merchantability, fitness for a particular purpose, or non-infringement. This document is used only for educational purposes related to the masters programs of THE UNIVERSITY, Faculty of Mathematics and Informatics. This document is compiled using various public sources freely available in internet or offered by SAP AG. This document is not used directly or indirectly for any type of commercial use. http://fmi.uni-sofia.bg THE UNIVERSITY shall have no liability for damages of any kind including without limitation direct, special, indirect, or consequential damages that may result from the use of these materials. This limitation shall not apply in cases of intent or gross negligence. The statutory liability for personal injury and defective products is not affected. THE UNIVERSITY has no control over the information that you may access through the use of hot links contained in these materials and does not endorse your use of third-party Web pages nor provide any warranty whatsoever relating to third-party Web pages.

2011 Sofia University “Sv. Kliment Ohridski” > Faculty of Mathematics and Informatics > Cloud Computing Architecture and Applications44 Headline area Drawing area White space The Grid

CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING.

Similar presentations

Presentation on theme: "CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING.

Similar presentations

Presentation on theme: "CLOUD COMPUTING ARCHITECTURES & APPLICATIONS LECTURERS LAZAR KIRCHEV, PhD ILIYAN NENOV KRUM BAKALSKY 14 March, 2011 LECTURE #2 PARALLEL DATA PROCESSING."— Presentation transcript:

Similar presentations

About project

Feedback