Working with Key/Value Pairs

Slides:

Advertisements

Similar presentations

Beyond Mapper and Reducer

Advertisements

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Jimmy Lin The iSchool University of Maryland Wednesday, April 15, 2009

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Hadoop: The Definitive Guide Chap. 8 MapReduce Features

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.

CS506/606: Problem Solving with Large Clusters Zak Shafran, Richard Sproat Spring 2011 Introduction URL:

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

© The McGraw-Hill Companies, 2006 Chapter 4 Implementing methods.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz (Slides by Tyler S. Randolph)

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

30/10/ Iteration Loops Do While (condition is true) … Loop.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

Think First, Code Second Understand the problem Work out step by step procedure for solving the problem (algorithm) top down design and stepwise refinement.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.

Functional Processing of Collections (Advanced) 6.0.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Lesson 3 Functions. Lesson 3 Functions are declared with function. For example, to calculate the cube of a number function function name (parameters)

Searching and Sorting Searching algorithms with simple arrays

PySpark Tutorial - Learn to use Apache Spark with Python

Spark Programming By J. H. Wang May 9, 2017.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Algorithmic complexity: Speed of algorithms

MapReduce Types, Formats and Features

Spark Presentation.

Functional Processing of Collections (Advanced)

Topics Introduction to Repetition Structures

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

Introduction to Spark.

Subroutines Idea: useful code can be saved and re-used, with different data values Example: Our function to find the largest element of an array might.

Hadoop MapReduce Types

Lecture 17: Distributed Transactions

湖南大学-信息科学与工程学院-计算机与科学系

Learning to Program in Python

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

CS110: Discussion about Spark

COS 260 DAY 10 Tony Gauvin.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Sets, Maps and Hash Tables

Distributed System Gang Wu Spring，2018.

Do While (condition is true) … Loop

Parallel Computation Patterns (Reduction)

Data processing with Hadoop

Algorithmic complexity: Speed of algorithms

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

MAPREDUCE TYPES, FORMATS AND FEATURES

Working with Key/Value Pairs

MapReduce Algorithm Design

Introduction to Spark.

Topic 5: Heap data structure heap sort Priority queue

Java Programming Loops

Algorithmic complexity: Speed of algorithms

Topics Introduction to Repetition Structures

CS639: Data Management for Data Science

CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

CS203 Lecture 15.

Fast, Interactive, Language-Integrated Cluster Computing

COS 518: Distributed Systems Lecture 11 Mike Freedman

MapReduce: Simplified Data Processing on Large Clusters

CS639: Data Management for Data Science

Presentation transcript:

Working with Key/Value Pairs

Creating pair RDDs Convert regular RDDs to pair RDDs Create pair RDDs val lines = sc.textFile(“Readme.md”) val pairs = lines.map(x => (x.split(“ ”)[0], x)) Create pair RDDs val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) x.collect.foreach(println(_))

Transformations Pair RDDs are still RDDs Transformations on regular RDDs can be applied on pair RDDs Is the above example appropriate for the lines of code? https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val lines=sc.textFile("/home/mqhuang/spark-1.6.1/README.md") val lines=sc.textFile("/home/mqhuang/pair.txt") val lines=sc.textFile("README.md") val pairs = lines.map(x=>(x.split(" ")(0),x)) val shortpairs=pairs.filter{case (key, value) => value.length < 20} shortpairs.collect.foreach(println(_))

Transformations: aggregations reduceByKey() and foldByKey() Special cases of combineByKey() operation They are transformations, not actions Run several parallel reduce/fold operations, one for each key in the dataset, where each operation combines values that have the same key Implementation details on Spark Each node carries out local aggregations Use initial value to initialize the accumulator Initialization will not take place if the key does not exist Shuffle Reducer nodes carry out second rounds of aggregations Use the first accumulator for initialization https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val x = sc.parallelize(List((1,2), (3,4), (3,6), (1, 10))) x.partitions.size x.reduceByKey((x,y)=>x+y).collect.foreach(println(_)) x.foldByKey(0)((x,y)=>x+y).collect.foreach(println(_)) x.foldByKey(1)((x,y)=>x+y).collect.foreach(println(_)) val x = sc.parallelize(List((1,2), (3,4), (3,6), (1, 10)),2) val x = sc.parallelize(List((1,11), (1, 10))) x.reduceByKey((x,y)=>x+1).collect.foreach(println(_)) x.foldByKey(1)((x,y)=>x+1).collect.foreach(println(_)) val x = sc.parallelize(List((1,11), (1, 10)),2) val x = sc.parallelize(List((1,11), (1, 10)),10) val x = sc.parallelize(List((1,11),(1,10),(1,9),(2,5),(2,6)),5) x.foldByKey(0)((x,y)=>x+1).collect.foreach(println(_)) val x = sc.parallelize(List((1,2), (1, 4)),3) val x = sc.parallelize(List((1,11),(1,10),(1,9),(2,5),(2,6)),2) x.mapPartitionsWithIndex( (index, it) =>it.toList.map(x => if (index ==0) println(x)).iterator).collect x.mapPartitionsWithIndex( (index, it) =>it.toList.map(x => if (index ==1) println(x)).iterator).collect x.glom.collect

Per-key average val rdd = sc.parallelize(List(("panda", 0), ("pink", 3), ("pirate", 3), ("panda", 1), ("pink", 4))) val average = rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(x=>(x._1/x._2.toFloat)) average.collect.foreach(println(_)) Don’t convert integer to float. val average = rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(x=>(x._1/x._2)) What is the last step to calculate the per-key average?

Word count in Spark Another implementation Warning: val input = sc.textFile("s3://...") val words = input.flatMap(x => x.split(" ")) val result = words.countByValue() Warning: countByValue() is an action Using countByValue() may bring scalability issue https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val lines = sc.parallelize(List("pandas", "i like pandas", "i like dogs", "dogs are not pandas")) val words = lines.flatMap(x => x.split(" ")) val result = words.countByValue() val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y) val finalresult=result.collect()

combineByKey() transformation Turns an RDD[(K, V)] into a result RDD[(K, C)], V: value; C: accumulator Three required component functions createCombiner Used to initialize the accumulator of the value for each key in a partition Called when a key is encountered for the first time in a partition mergeValue Merges V’s to C’s, respectively, inside each partition Called when a key has been seen previously mergeCombiners Merges a list of C’s to a single one

Per-key average using combineByKey() A better version: val result = input.combineByKey( (v)=>(v,1), (acc:(Int,Int),value)=>(acc._1+value,acc._2+1), (ACC:(Int,Int),acci:(Int,Int))=>(ACC._1+acci._1,ACC._2+acci._2) ) val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result.collect.foreach(println(_)) val result2 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(ACC: (Int, Int), acci: (Int, Int)) => (ACC._1 + acci._1, ACC._2 + acci._2)) Use map val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map(x => (x._1, x._2._1 / x._2._2.toFloat)) use mapValues val result1 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)). mapValues(x=>(x._1/x._2.toFloat)) result1.collect.foreach(println(_)) Chang the createCombiner function val result2 = x.combineByKey((v) => (v, 2),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result2.collect.foreach(println(_))

Per-key average using combineByKey() val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result.collect.foreach(println(_)) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map(x => (x._1, x._2._1 / x._2._2.toFloat)) use mapValues val result1 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)). mapValues(x=>(x._1/x._2.toFloat)) result1.collect.foreach(println(_)) Chang the createCombiner function val result2 = x.combineByKey((v) => (v, 2),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result2.collect.foreach(println(_))

map on pair RDDs is same to map{case (key, value) => (key, value._1 / value._2.toFloat)} is same to map({case (key, value) => (key, value._1 / value._2.toFloat)}) { case argument => body } is a Partial Function Can we use the following? map( (key, value) => (key, value._1 / value._2.toFloat) ) How about the following? map( x => (x._1, x._2._1 / x._2._2.toFloat) )

Actions on pair RDDs All of the traditional actions available on the base RDD are also available on pair RDDs Additional actions val x = sc.parallelize(List((1,2), (3,4), (3,6))) x.countByKey x.collectAsMap x.lookup(3) val x = sc.parallelize(List((1,2), (3,6), (3,4)))

Turning the level of parallelism Use repartition() and coalesce() to partition an RDD to a particular number of partitions repartition() can increase or decrease the number of partitions It does a full shuffle, so it is expensive coalesce() only decreases the number of partitions It will try to minimize the data movement across nodes Both repartition() and coalesce() are transformations The original RDDs are not affected partitionBy() can be used on pair RDDs Only applied to pair RDDs Need to provide a partitioner Pairs with the same key will be sent to the same partition val inputrdd=sc.parallelize(List(1,25,8,4,2),50) inputrdd.partitions.size val result=inputrdd.fold(0)((x,y)=>x+1) inputrdd.repartition(10) val rdd2=inputrdd.repartition(10) rdd2.partitions.size rdd2.partitioner val result=rdd2.fold(0)((x,y)=>x+1) val rdd3=inputrdd.repartition(60) rdd3.partitions.size rdd3.partitioner val result=rdd3.fold(0)((x,y)=>x+1) val rdd4=inputrdd.coalesce(10) rdd4.partitions.size rdd4.partitioner val result=rdd4.fold(0)((x,y)=>x+1) val rdd5=inputrdd.coalesce(60) rdd5.partitions.size rdd5.partitioner val result=rdd5.fold(0)((x,y)=>x+1) glom() will keep each partition as an array in the final array, i.e., array[array[type]] val inputrdd=sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12),4) Inputrdd.glom.collect val rdd1=inputrdd.repartition(2) rdd1.glom.collect val rdd2=inputrdd.coalesce(2) rdd2.glom.collect val rdd3=inputrdd.coalesce(12) rdd3.glom.collect val rdd4=inputrdd.repartition(12) rdd4.glom.collect Need to first import the HashPartitioner import org.apache.spark.HashPartitioner val rdd = sc.parallelize(List(("a", 1), ("a", 2), ("b", 1), ("b", 3), ("c",1), ("ef",5), ("a", 3), ("b", 4), ("c", 3))) val rdd1 = rdd.repartition(4) val rdd2 = rdd.partitionBy(new HashPartitioner(4)) val rdd = sc.parallelize(List("a","a","b","b")) val rdd2 = rdd.partitionBy(new HashPartitioner(4)) //error

An example to use partitionBy() Two RDDs A large RDD of (UserID, UserInfo) UserInfo contains a list of topics the user is subscribed to E.g. (Alice, (music, sport, history)) A small RDD of (UserID, LinkInfo) LinkInfo contains a list of links a user has clicked in the last 5 minutes Wish to count how many users visited a link that was not to one of their subscribed topics

An example to use partitionBy() join() transformation In order to join, two elements from two RDDs have to move to the same node.

An example to use partitionBy() The initial approach

An example to use partitionBy() The behavior

An example to use partitionBy() The improved implementation

An example to use partitionBy() The improved behavior

Determining an RDD’s Partitioner Tell how an RDD is partitioned using its partitioner property rdd.partitioner Some transformations will result in a partitioner set on the output RDD e.g., reducebyKey() Some transformations will produce a result with no partitioner e.g., map() val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3))) pairs.partitioner pairs.partitions.size pairs.glom.collect import org.apache.spark.HashPartitioner val partitioned = pairs.partitionBy(new HashPartitioner(2)) partitioned.partitioner partitioned.partitions.size partitioned.glom.collect val rpartitioned=partitioned.reduceByKey((x,y)=>x+y) rpartitioned.partitions.size rpartitioned.partitioner rpartitioned.glom.collect val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3), (1,10)), 2) val rpartitioned=pairs.reduceByKey((x,y)=>x+y) val mpartitioned=partitioned.map(x=>x) mpartitioned.partitions.size mpartitioned.partitioner mpartitioned.glom.collect

PageRank Two RDDs (pageID, link List) (pageID, rank) Algorithm (details are different than the one discussed in MapReduce) Initialize each page’s rank to 1.0 On each iteration, have page p send a contribution of rank(p)/numNeighbors(p) to its neighbors (the pages it has links to) Set each page’s rank to 0.15 + 0.85 * contributionsReceived The last two steps repeat for multiple iterations

PageRank is Spark Example: three nodes 1, 2, 3 links = ((1, (2,3)),(2, (1)),(3, (2))) ranks = ((1, 1),(2, 1),(3, 1)) links.join(ranks) = ((1, ((2,3),1)),(2, ((1),1)),(3, ((2),1))) case (pageId, (pageLinks, rank)) => pageLinks.map(dest => (dest, rank / pageLinks.size)) (1, ((2,3),1)) => ((2, 0.5), (3, 0.5)) (2, ((1),1)) => (1, 1) (3, ((2),1)) => (2, 1) contriubutions = ((2, 0.5), (3, 0.5), (1, 1), (2, 1))

PageRank is Spark Example: three nodes 1, 2, 3 links = ((1, (2,3)),(2, (1)),(3, (2))) ranks = ((1, 1),(2, 1),(3, 1)) links.join(ranks) = ((1, ((2,3),1)),(2, ((1),1)),(3, ((2),1))) case (pageId, (pageLinks, rank)) => pageLinks.map(dest => (dest, rank / pageLinks.size)) (1, ((2,3),1)) => ((2, 0.5), (3, 0.5)) (2, ((1),1)) => (1, 1) (3, ((2),1)) => (2, 1) contriubutions = ((2, 0.5), (3, 0.5), (1, 1), (2, 1))

Custom Partitioner Two built-in partitioners Custom Partitioner HashPartitioner RangePartitioner Custom Partitioner For example, partition pair RDD based on part of the keys http://www.cnn.com/WORLD http://www.cnn.com/US Define the custom partitioner by subclassing org.apache.spark.Partitioner numPartitioners: Int returns the number of partitions you will create getPartition(key: Any): Int returns the partition ID (0 to numPartitions-1) for a given key Equals(): Boolean test your Partitioner object against other instances of itself when it decides whether two of your RDDs are partitioned the same way

Custom Partitioner example