COMP9313: Big Data Management Lecturer: Xin Cao Course web site:

Slides:



Advertisements
Similar presentations
Overview of this week Debugging tips for ML algorithms
Advertisements

MapReduce.
Thanks to Jimmy Lin slides
Spark: Cluster Computing with Working Sets
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Cloud Computing Lecture #5 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, October 1, 2008 This work is licensed.
MapReduce Algorithms CSE 490H. Algorithms for MapReduce Sorting Searching TF-IDF BFS PageRank More advanced algorithms.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Cloud Computing Lecture #4 Graph Algorithms with MapReduce Jimmy Lin The iSchool University of Maryland Wednesday, February 6, 2008 This work is licensed.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Graph Algorithms Ch. 5 Lin and Dyer. Graphs Are everywhere Manifest in the flow of s Connections on social network Bus or flight routes Social graphs:
MapReduce.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Graph Algorithms. Graph Algorithms: Topics  Introduction to graph algorithms and graph represent ations  Single Source Shortest Path (SSSP) problem.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
大规模数据处理 / 云计算 05 – Graph Algorithm 闫宏飞 北京大学信息科学技术学院 7/22/2014 Jimmy Lin University of Maryland SEWMGroup This work.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
PySpark Tutorial - Learn to use Apache Spark with Python
Image taken from: slideshare
Hadoop.
Big Data Infrastructure
Spark.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
Spark Presentation.
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
PREGEL Data Management in the Cloud
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce and Data Management
Cloud Computing Lecture #4 Graph Algorithms with MapReduce
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Graph Algorithms Ch. 5 Lin and Dyer.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Word Co-occurrence Chapter 3, Lin and Dyer.
Distributed System Gang Wu Spring,2018.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Graph Algorithms Adapted from UMD Jimmy Lin’s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
MAPREDUCE TYPES, FORMATS AND FEATURES
MapReduce Algorithm Design
Introduction to Spark.
CS639: Data Management for Data Science
5/7/2019 Map Reduce Map reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Fast, Interactive, Language-Integrated Cluster Computing
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Graph Algorithms Ch. 5 Lin and Dyer.
Word Co-occurrence Chapter 3, Lin and Dryer.
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Presentation transcript:

COMP9313: Big Data Management Lecturer: Xin Cao Course web site: http://www.cse.unsw.edu.au/~cs9313/

Chapter 12: Revision and Exam Preparation

Learning outcomes After completing this course, you are expected to: elaborate the important characteristics of Big Data (Chapter 1) develop an appropriate storage structure for a Big Data repository (Chapter 5) utilize the map/reduce paradigm and the Spark platform to manipulate Big Data (Chapters 2, 3, 4, and 7) use a high-level query language to manipulate Big Data (Chapter 6) develop efficient solutions for analytical problems involving Big Data (Chapters 8-11)

Final exam Final written exam (100 pts) Six questions in total on six topics Three hours You can bring the printed lecture notes If you are ill on the day of the exam, do not attend the exam – I will not accept any medical special consideration claims from people who already attempted the exam.

Topic 1: MapReduce (Chapters 2 and 3)

Data Structures in MapReduce Key-value pairs are the basic data structure in MapReduce Keys and values can be: integers, float, strings, raw bytes They can also be arbitrary data structures The design of MapReduce algorithms involves: Imposing the key-value structure on arbitrary datasets E.g.: for a collection of Web pages, input keys may be URLs and values may be the HTML content In some algorithms, input keys are not used, in others they uniquely identify a record Keys can be combined in complex ways to design various algorithms

Map and Reduce Functions Programmers specify two functions: map (k1, v1) → list [<k2, v2>] Map transforms the input into key-value pairs to process reduce (k2, [v2]) → [<k3, v3>] Reduce aggregates the list of values for each key All values with the same key are sent to the same reducer Optionally, also: combine (k2, [v2]) → [<k3, v3>] Mini-reducers that run in memory after the map phase Used as an optimization to reduce network traffic partition (k2, number of partitions) → partition for k2 Often a simple hash of the key, e.g., hash(k2) mod n Divides up key space for parallel reduce operations The execution framework handles everything else…

Combiners Often a Map task will produce many pairs of the form (k,v1), (k,v2), … for the same key k E.g., popular words in the word count example Combiners are a general mechanism to reduce the amount of intermediate data, thus saving network time They could be thought of as “mini-reducers” Warning! The use of combiners must be thought carefully Optional in Hadoop: the correctness of the algorithm cannot depend on computation (or even execution) of the combiners A combiner operates on each map output key. It must have the same output key-value types as the Mapper class. A combiner can produce summary information from a large dataset because it replaces the original Map output Works only if reduce function is commutative and associative In general, reducer and combiner are not interchangeable

Partitioner Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. This controls which of the m reduce tasks the intermediate key (and hence the record) is sent to for reduction. System uses HashPartitioner by default: hash(key) mod R Sometimes useful to override the hash function: E.g., hash(hostname(URL)) mod R ensures URLs from a host end up in the same output file Job sets Partitioner implementation (in Main)

A Brief View of MapReduce

MapReduce Data Flow

MapReduce Data Flow

Design Patter 1: Local Aggregation Programming Control: In mapper combining provides control over when local aggregation occurs how it exactly takes place Hadoop makes no guarantees on how many times the combiner is applied, or that it is even applied at all. More efficient: The mappers will generate only those key-value pairs that need to be shuffled across the network to the reducers There is no additional overhead due to the materialization of key-value pairs Combiners don't actually reduce the number of key-value pairs that are emitted by the mappers in the first place Scalability issue (not suitable for huge data) : More memory required for a mapper to store intermediate results

Design Patter 2: Pairs vs Strips The pairs approach Keep track of each team co-occurrence separately Generates a large number of key-value pairs (also intermediate) The benefit from combiners is limited, as it is less likely for a mapper to process multiple occurrences of a word The stripe approach Keep track of all terms that co-occur with the same term Generates fewer and shorted intermediate keys The framework has less sorting to do Greatly benefits from combiners, as the key space is the vocabulary More efficient, but may suffer from memory problem These two design patterns are broadly useful and frequently observed in a variety of applications Text processing, data mining, and bioinformatics

Design Pattern 3: Order Inversion Common design pattern Computing relative frequencies requires marginal counts But marginal cannot be computed until you see all counts Buffering is a bad idea! Trick: getting the marginal counts to arrive at the reducer before the joint counts Caution: You need to guarantee that all key-value pairs relevant to the same term are sent to the same reducer

Design Pattern 4: Value-to-key Conversion Put the value as part of the key, make Hadoop do sorting for us Provides a scalable solution for secondary sorting. Caution: You need to guarantee that all key-value pairs relevant to the same term are sent to the same reducer

Topic 2: Spark (Chapter 7)

Data Sharing in MapReduce Slow due to replication, serialization, and disk IO Complex apps, streaming, and interactive queries all need one thing that MapReduce lacks: Efficient primitives for data sharing

Data Sharing in Spark Using RDD 10-100× faster than network and disk

What is RDD Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, et al. NSDI’12 RDD is a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. Resilient Fault-tolerant, is able to recompute missing or damaged partitions due to node failures. Distributed  Data residing on multiple nodes in a cluster. Dataset  A collection of partitioned elements, e.g. tuples or other objects (that represent records of the data you work with). RDD is the primary data abstraction in Apache Spark and the core of Spark. It enables operations on collection of elements in parallel.

RDD Operations Transformation: returns a new RDD. Nothing gets evaluated when you call a Transformation function, it just takes an RDD and return a new RDD. Transformation functions include map, filter, flatMap, groupByKey, reduceByKey, aggregateByKey, filter, join, etc. Action: evaluates and returns a new value. When an Action function is called on a RDD object, all the data processing queries are computed at that time and the result value is returned. Action operations include reduce, collect, count, first, take, countByKey, foreach, saveAsTextFile, etc.

Working with RDDs Create an RDD from a data source by parallelizing existing Python collections (lists) by transforming an existing RDDs from files in HDFS or any other storage system Apply transformations to an RDD: e.g., map, filter Apply actions to an RDD: e.g., collect, count Users can control two other aspects: Persistence Partitioning

RDD Operations

More Examples on Pair RDD Create a pair RDD from existing RDDs Output? reduceByKey() function: reduce key-value pairs by key using give func mapValues() function: work on values only groupByKey() function: When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable<V>) pairs val pairs = sc.parallelize( List( (“This”, 2), (“is”, 3), (“Spark”, 5), (“is”, 3) ) ) pairs.collect().foreach(println) val pair1 = pairs.reduceByKey((x,y) => x + y) pairs1.collect().foreach(println) val pair2 = pairs.mapValues( x => x -1 ) pairs2.collect().foreach(println) pairs.groupByKey().collect().foreach(println)

Topic 3: Link Analysis (Chapter 8)

PageRank: The “Flow” Model A “vote” from an important page is worth more A page is important if it is pointed to by other important pages Define a “rank” rj for page j y m a a/2 y/2 “Flow” equations: ry = ry /2 + ra /2 ra = ry /2 + rm rm = ra /2  

Google’s Solution: Random Teleports   di … out-degree of node i  

The Google Matrix  

Random Teleports ( = 0.8) y a m M [1/N]NxN 1/2 1/2 0 1/2 0 0 0 1/2 1 7/15 y 1/2 1/2 0 1/2 0 0 0 1/2 1 1/3 1/3 1/3 0.8 + 0.2 7/15 1/15 7/15 1/15 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 13/15 a 7/15 m 1/15 1/15 A y a = m 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56 7/33 5/33 21/33 . . .

Topic-Specific PageRank Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability To avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of “relevant” pages (teleport set) Idea: Bias the random walk When walker teleports, she pick a page from a set S S contains only pages that are relevant to the topic E.g., Open Directory (DMOZ) pages for a given topic/query For each teleport set S, we get a different vector rS

Matrix Formulation  

Hubs and Authorities   j1 j2 j3 j4 i   i j1 j2 j3 j4  

Hubs and Authorities  

Hubs and Authorities   Convergence criterion:           Repeated matrix powering

Example of HITS 1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 . . . h(yahoo) M’soft Amazon . . . h(yahoo) h(amazon) h(m’soft) = .58 .80 .53 .27 .80 .53 .27 .79 .57 .23 .788 .577 .211 a(yahoo) = .58 a(amazon) = .58 a(m’soft) = .58 .58 .62 .49 . . . .62 .49 .628 .459

Topic 4: Graph Data Processing

From Intuition to Algorithm Data representation: Key: node n Value: d (distance from start), adjacency list (list of nodes reachable from n) Initialization: for all nodes except for start node, d =  Mapper: m  adjacency list: emit (m, d + 1) Sort/Shuffle Groups distances by reachable nodes Reducer: Selects minimum distance path for each reachable node Additional bookkeeping needed to keep track of actual path

Multiple Iterations Needed Each MapReduce iteration advances the “known frontier” by one hop Subsequent iterations include more and more reachable nodes as frontier expands The input of Mapper is the output of Reducer in the previous iteration Multiple iterations are needed to explore entire graph Preserving graph structure: Problem: Where did the adjacency list go? Solution: mapper emits (n, adjacency list) as well

BFS Pseudo-Code Equal Edge Weights (how to deal with weighted edges?) Only distances, no paths stored (how to obtain paths?) class Mapper     method Map(nid n, node N)     d ← N.Distance     Emit(nid n,N)                                  //Pass along graph structure     for all nodeid m ∈ N.AdjacencyList do         Emit(nid m, d+w)                        //Emit distances to reachable nodes class Reducer     method Reduce(nid m, [d1, d2, . . .])     dmin←∞     M ← ∅     for all d ∈ counts [d1, d2, . . .] do         if IsNode(d) then             M ← d            //Recover graph structure         else if d < dmin then                  //Look for shorter distance             dmin ← d     M.Distance ← dmin                        //Update shortest distance     Emit(nid m, node M)

Stopping Criterion How many iterations are needed in parallel BFS (equal edge weight case)? Convince yourself: when a node is first “discovered”, we’ve found the shortest path Now answer the question... The diameter of the graph, or the greatest distance between any pair of nodes Six degrees of separation? If this is indeed true, then parallel breadth-first search on the global social network would take at most six MapReduce iterations.

BFS Pseudo-Code (Weighted Edges) The adjacency lists, which were previously lists of node ids, must now encode the edge distances as well Positive weights! In line 6 of the mapper code, instead of emitting d + 1 as the value, we must now emit d + w, where w is the edge distance The termination behaviour is very different! How many iterations are needed in parallel BFS (positive edge weight case)? Convince yourself: when a node is first “discovered”, we’ve found the shortest path Not true!

Additional Complexities Assume that p is the current processed node In the current iteration, we just “discovered” node r for the very first time. We've already discovered the shortest distance to node p, and that the shortest distance to r so far goes through p Is s->p->r the shortest path from s to r? The shortest path from source s to node r may go outside the current search frontier It is possible that p->q->r is shorter than p->r! We will not find the shortest distance to r until the search frontier expands to cover q. search frontier r s q p

Computing PageRank Properties of PageRank Can be computed iteratively Effects at each iteration are local Sketch of algorithm: Start with seed ri values Each page distributes ri “credit” to all pages it links to Each target page tj adds up “credit” from multiple in-bound links to compute rj Iterate until values converge

Simplified PageRank First, tackle the simple case: No teleport No dangling nodes (dead ends) Then, factor in these complexities… How to deal with the teleport probability? How to deal with dangling nodes?

Sample PageRank Iteration (1)

Sample PageRank Iteration (2)

PageRank Pseudo-Code

Complete PageRank Two additional complexities What is the proper treatment of dangling nodes? How do we factor in the random jump factor? Solution: If a node’s adjacency list is empty, distribute its value to all nodes evenly. In mapper, for such a node i, emit (nid m, ri/N) for each node m in the graph Add the teleport value In reducer, M.PageRank =  * s + (1- ) / N

Topic 5: Streaming Data Processing Types of queries one wants on answer on a data stream: (we’ll learn these today) Sampling data from a stream Construct a random sample Queries over sliding windows Number of items of type x in the last k elements of the stream Filtering a data stream Select elements with property x from the stream Counting distinct elements (Not required) Number of distinct elements in the last k elements of the stream

Topic 6: Machine Learning Recommender systems User-user collaborative filtering Item-item collaborative filtering SVM Given you the training dataset, find out the optimal solution

CATEI and myExperience Survey

End of Chapter 12