Working with Key/Value Pairs
Creating pair RDDs Convert regular RDDs to pair RDDs Create pair RDDs val lines = sc.textFile(“Readme.md”) val pairs = lines.map(x => (x.split(“ ”)[0], x)) Create pair RDDs val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) x.collect.foreach(println(_))
Transformations Pair RDDs are still RDDs Transformations on regular RDDs can be applied on pair RDDs Is the above example appropriate for the lines of code? https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val lines=sc.textFile("/home/mqhuang/spark-1.6.1/README.md") val lines=sc.textFile("/home/mqhuang/pair.txt") val lines=sc.textFile("README.md") val pairs = lines.map(x=>(x.split(" ")(0),x)) val shortpairs=pairs.filter{case (key, value) => value.length < 20} shortpairs.collect.foreach(println(_))
Transformations: aggregations reduceByKey() and foldByKey() Special cases of combineByKey() operation They are transformations, not actions Run several parallel reduce/fold operations, one for each key in the dataset, where each operation combines values that have the same key Implementation details on Spark Each node carries out local aggregations Use initial value to initialize the accumulator Initialization will not take place if the key does not exist Shuffle Reducer nodes carry out second rounds of aggregations Use the first accumulator for initialization https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val x = sc.parallelize(List((1,2), (3,4), (3,6), (1, 10))) x.partitions.size x.reduceByKey((x,y)=>x+y).collect.foreach(println(_)) x.foldByKey(0)((x,y)=>x+y).collect.foreach(println(_)) x.foldByKey(1)((x,y)=>x+y).collect.foreach(println(_)) val x = sc.parallelize(List((1,2), (3,4), (3,6), (1, 10)),2) val x = sc.parallelize(List((1,11), (1, 10))) x.reduceByKey((x,y)=>x+1).collect.foreach(println(_)) x.foldByKey(1)((x,y)=>x+1).collect.foreach(println(_)) val x = sc.parallelize(List((1,11), (1, 10)),2) val x = sc.parallelize(List((1,11), (1, 10)),10) val x = sc.parallelize(List((1,11),(1,10),(1,9),(2,5),(2,6)),5) x.foldByKey(0)((x,y)=>x+1).collect.foreach(println(_)) val x = sc.parallelize(List((1,2), (1, 4)),3) val x = sc.parallelize(List((1,11),(1,10),(1,9),(2,5),(2,6)),2) x.mapPartitionsWithIndex( (index, it) =>it.toList.map(x => if (index ==0) println(x)).iterator).collect x.mapPartitionsWithIndex( (index, it) =>it.toList.map(x => if (index ==1) println(x)).iterator).collect x.glom.collect
Per-key average val rdd = sc.parallelize(List(("panda", 0), ("pink", 3), ("pirate", 3), ("panda", 1), ("pink", 4))) val average = rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(x=>(x._1/x._2.toFloat)) average.collect.foreach(println(_)) Don’t convert integer to float. val average = rdd.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(x=>(x._1/x._2)) What is the last step to calculate the per-key average?
Word count in Spark Another implementation Warning: val input = sc.textFile("s3://...") val words = input.flatMap(x => x.split(" ")) val result = words.countByValue() Warning: countByValue() is an action Using countByValue() may bring scalability issue https://github.com/apache/spark/blob/master/python/pyspark/rdd.py https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala val lines = sc.parallelize(List("pandas", "i like pandas", "i like dogs", "dogs are not pandas")) val words = lines.flatMap(x => x.split(" ")) val result = words.countByValue() val result = words.map(x => (x, 1)).reduceByKey((x, y) => x + y) val finalresult=result.collect()
combineByKey() transformation Turns an RDD[(K, V)] into a result RDD[(K, C)], V: value; C: accumulator Three required component functions createCombiner Used to initialize the accumulator of the value for each key in a partition Called when a key is encountered for the first time in a partition mergeValue Merges V’s to C’s, respectively, inside each partition Called when a key has been seen previously mergeCombiners Merges a list of C’s to a single one
Per-key average using combineByKey() A better version: val result = input.combineByKey( (v)=>(v,1), (acc:(Int,Int),value)=>(acc._1+value,acc._2+1), (ACC:(Int,Int),acci:(Int,Int))=>(ACC._1+acci._1,ACC._2+acci._2) ) val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result.collect.foreach(println(_)) val result2 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(ACC: (Int, Int), acci: (Int, Int)) => (ACC._1 + acci._1, ACC._2 + acci._2)) Use map val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map(x => (x._1, x._2._1 / x._2._2.toFloat)) use mapValues val result1 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)). mapValues(x=>(x._1/x._2.toFloat)) result1.collect.foreach(println(_)) Chang the createCombiner function val result2 = x.combineByKey((v) => (v, 2),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result2.collect.foreach(println(_))
Per-key average using combineByKey() val x = sc.parallelize(List(("a", 1), ("b", 1), ("a", 2))) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result.collect.foreach(println(_)) val result = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map(x => (x._1, x._2._1 / x._2._2.toFloat)) use mapValues val result1 = x.combineByKey((v) => (v, 1),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)). mapValues(x=>(x._1/x._2.toFloat)) result1.collect.foreach(println(_)) Chang the createCombiner function val result2 = x.combineByKey((v) => (v, 2),(acc: (Int, Int), v) => (acc._1 + v, acc._2 + 1),(acc1: (Int, Int), acc2: (Int, Int)) => (acc1._1 + acc2._1, acc1._2 + acc2._2)).map{ case (key, value) => (key, value._1 / value._2.toFloat)} result2.collect.foreach(println(_))
map on pair RDDs is same to map{case (key, value) => (key, value._1 / value._2.toFloat)} is same to map({case (key, value) => (key, value._1 / value._2.toFloat)}) { case argument => body } is a Partial Function Can we use the following? map( (key, value) => (key, value._1 / value._2.toFloat) ) How about the following? map( x => (x._1, x._2._1 / x._2._2.toFloat) )
Actions on pair RDDs All of the traditional actions available on the base RDD are also available on pair RDDs Additional actions val x = sc.parallelize(List((1,2), (3,4), (3,6))) x.countByKey x.collectAsMap x.lookup(3) val x = sc.parallelize(List((1,2), (3,6), (3,4)))
Turning the level of parallelism Use repartition() and coalesce() to partition an RDD to a particular number of partitions repartition() can increase or decrease the number of partitions It does a full shuffle, so it is expensive coalesce() only decreases the number of partitions It will try to minimize the data movement across nodes Both repartition() and coalesce() are transformations The original RDDs are not affected partitionBy() can be used on pair RDDs Only applied to pair RDDs Need to provide a partitioner Pairs with the same key will be sent to the same partition val inputrdd=sc.parallelize(List(1,25,8,4,2),50) inputrdd.partitions.size val result=inputrdd.fold(0)((x,y)=>x+1) inputrdd.repartition(10) val rdd2=inputrdd.repartition(10) rdd2.partitions.size rdd2.partitioner val result=rdd2.fold(0)((x,y)=>x+1) val rdd3=inputrdd.repartition(60) rdd3.partitions.size rdd3.partitioner val result=rdd3.fold(0)((x,y)=>x+1) val rdd4=inputrdd.coalesce(10) rdd4.partitions.size rdd4.partitioner val result=rdd4.fold(0)((x,y)=>x+1) val rdd5=inputrdd.coalesce(60) rdd5.partitions.size rdd5.partitioner val result=rdd5.fold(0)((x,y)=>x+1) glom() will keep each partition as an array in the final array, i.e., array[array[type]] val inputrdd=sc.parallelize(List(1,2,3,4,5,6,7,8,9,10,11,12),4) Inputrdd.glom.collect val rdd1=inputrdd.repartition(2) rdd1.glom.collect val rdd2=inputrdd.coalesce(2) rdd2.glom.collect val rdd3=inputrdd.coalesce(12) rdd3.glom.collect val rdd4=inputrdd.repartition(12) rdd4.glom.collect Need to first import the HashPartitioner import org.apache.spark.HashPartitioner val rdd = sc.parallelize(List(("a", 1), ("a", 2), ("b", 1), ("b", 3), ("c",1), ("ef",5), ("a", 3), ("b", 4), ("c", 3))) val rdd1 = rdd.repartition(4) val rdd2 = rdd.partitionBy(new HashPartitioner(4)) val rdd = sc.parallelize(List("a","a","b","b")) val rdd2 = rdd.partitionBy(new HashPartitioner(4)) //error
An example to use partitionBy() Two RDDs A large RDD of (UserID, UserInfo) UserInfo contains a list of topics the user is subscribed to E.g. (Alice, (music, sport, history)) A small RDD of (UserID, LinkInfo) LinkInfo contains a list of links a user has clicked in the last 5 minutes Wish to count how many users visited a link that was not to one of their subscribed topics
An example to use partitionBy() join() transformation In order to join, two elements from two RDDs have to move to the same node.
An example to use partitionBy() The initial approach
An example to use partitionBy() The behavior
An example to use partitionBy() The improved implementation
An example to use partitionBy() The improved behavior
Determining an RDD’s Partitioner Tell how an RDD is partitioned using its partitioner property rdd.partitioner Some transformations will result in a partitioner set on the output RDD e.g., reducebyKey() Some transformations will produce a result with no partitioner e.g., map() val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3))) pairs.partitioner pairs.partitions.size pairs.glom.collect import org.apache.spark.HashPartitioner val partitioned = pairs.partitionBy(new HashPartitioner(2)) partitioned.partitioner partitioned.partitions.size partitioned.glom.collect val rpartitioned=partitioned.reduceByKey((x,y)=>x+y) rpartitioned.partitions.size rpartitioned.partitioner rpartitioned.glom.collect val pairs = sc.parallelize(List((1, 1), (2, 2), (3, 3), (1,10)), 2) val rpartitioned=pairs.reduceByKey((x,y)=>x+y) val mpartitioned=partitioned.map(x=>x) mpartitioned.partitions.size mpartitioned.partitioner mpartitioned.glom.collect
PageRank Two RDDs (pageID, link List) (pageID, rank) Algorithm (details are different than the one discussed in MapReduce) Initialize each page’s rank to 1.0 On each iteration, have page p send a contribution of rank(p)/numNeighbors(p) to its neighbors (the pages it has links to) Set each page’s rank to 0.15 + 0.85 * contributionsReceived The last two steps repeat for multiple iterations
PageRank is Spark Example: three nodes 1, 2, 3 links = ((1, (2,3)),(2, (1)),(3, (2))) ranks = ((1, 1),(2, 1),(3, 1)) links.join(ranks) = ((1, ((2,3),1)),(2, ((1),1)),(3, ((2),1))) case (pageId, (pageLinks, rank)) => pageLinks.map(dest => (dest, rank / pageLinks.size)) (1, ((2,3),1)) => ((2, 0.5), (3, 0.5)) (2, ((1),1)) => (1, 1) (3, ((2),1)) => (2, 1) contriubutions = ((2, 0.5), (3, 0.5), (1, 1), (2, 1))
PageRank is Spark Example: three nodes 1, 2, 3 links = ((1, (2,3)),(2, (1)),(3, (2))) ranks = ((1, 1),(2, 1),(3, 1)) links.join(ranks) = ((1, ((2,3),1)),(2, ((1),1)),(3, ((2),1))) case (pageId, (pageLinks, rank)) => pageLinks.map(dest => (dest, rank / pageLinks.size)) (1, ((2,3),1)) => ((2, 0.5), (3, 0.5)) (2, ((1),1)) => (1, 1) (3, ((2),1)) => (2, 1) contriubutions = ((2, 0.5), (3, 0.5), (1, 1), (2, 1))
Custom Partitioner Two built-in partitioners Custom Partitioner HashPartitioner RangePartitioner Custom Partitioner For example, partition pair RDD based on part of the keys http://www.cnn.com/WORLD http://www.cnn.com/US Define the custom partitioner by subclassing org.apache.spark.Partitioner numPartitioners: Int returns the number of partitions you will create getPartition(key: Any): Int returns the partition ID (0 to numPartitions-1) for a given key Equals(): Boolean test your Partitioner object against other instances of itself when it decides whether two of your RDDs are partitioned the same way
Custom Partitioner example