Spark Programming By J. H. Wang May 9, 2017.

Spark Programming By J. H. Wang May 9, 2017

Outline Introduction Spark programs RDD Operations

Introduction

Spark Programs Running the Shell: Running standalone applications:
In Scala: bin/spark-shell In Python: bin/pyspark In R: bin/sparkR Running standalone applications: In Java or Scala: bin/spark-submit --class MyClass file.jar In Python: bin/spark-submit file.py In R: bin/spark-submit file.R

Initializing a SparkContext
In Python: from pyspark import SparkConf, SparkContext conf=SparkConf().setMaster(“local”).setAppName(“My App”) sc=SparkContext(conf=conf) In Scala: import org.apache.spark.SparkConf import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ val conf=new SparkConf().setMaster(“local”).setAppName(“My App”) val sc=new SparkContext(conf)

In Java: import org. apache. spark. SparkConf; import org. apache
In Java: import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaSparkContext; SparkConf conf=new SparkConf().setMaster(“local”).setAppName(“My App”); JavaSparkContext sc=new JavaSparkContext(conf);

Example: WordCount In Java:
SparkConf conf=new SparkConf().setAppName(“wordCount”); JavaSparkContext sc=new JavaSparkContext(conf); JavaRDD<String> input=sc.textFile(inputFile); JavaRDD<String> words=input.flatMap(new FlatMapFunction<String,String>(){ public Iterable<String> call(String x) { return Arrays.asList(x.split(“ “)); }}); JavaPairRDD<String,Integer>counts=words.mapToPair( new PairFunction<String, String, Integer>(){ public Tuple2<String, Integer>call(String x){ return new Tuple2(x, 1); }}).reduceByKey(new Function2<Integer,Integer,Integer>(){ public Integer call(Integer x, Integer y){return x+y;}}); counts.saveAsTextFile(outputFile);

In Scala: Val conf=new SparkConf().setAppName(“wordCount”) val sc=new SparkContext(conf) val input=sc.textFile(inputFile) val words=input.flatMap(line => line.split(“ “)) val counts=words.map(word=>(word,1)).reduceByKey{case (x,y)=> x+y} counts.saveAsTextFile(outputFile)

In Python: conf=SparkConf(). setAppName(“wordCount”) sc=SparkContext(conf=conf) lines=sc.textFile(inputFile) counts=lines.flatMap(lambda x: x.split(“ “)) \ map(lambda x: (x, 1)) \ reduceByKey(lambda a, b: a+b) counts.saveAsTextFile(outputFile)

Example: WordCount in Action (in Scala)
sbt configuration file for dependency: simple.sbt Name:=“Simple Project” version:=“1.0” scalaVersion:=“2.10.4” libraryDependencies+=“org.apache.spark” %% “spark-core” % “1.1.1” Layout .sbt and .scala files: ./simple.sbt ./src ./src/main ./src/main/scala ./src/main/scala/SimpleApp.scala Package a jar: sbt package Running the application: bin/spark-submit --class “SimpleApp” simple-proj.jar

Example: WordCount in Action (in Java)
Maven pom.xml configuration file for dependency: Canonical Maven directory structure: ./pom.xml ./src ./src/main ./src/main/java ./src/main/java/SimpleApp.java Package a jar: mvn package Running the application: bin/spark-submit --class “SimpleApp” simple-proj.jar

RDDs: Resilient Distributed Datasets
a fault-tolerant collection of elements that can be operated on in parallel

Reference: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica, “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing,” Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012). (Best Paper Award)

RDD Operations RDDs support two types of operations:
transformations, which create a new dataset from an existing one E.g.: map: a transformation that passes each dataset element through a function and returns a new RDD representing the results actions, which return a value to the driver program after running a computation on the dataset E.g.: reduce: an action that aggregates all the elements of the RDD using some function and returns the final result to the driver program All transformations are lazy only computed when an action requires a result to be returned to the driver program By default, each transformed RDD may be recomputed each time you run an action on it You can also persist an RDD in memory (with persist or cache method)

Basics Example: val lines = sc.textFile("data.txt")
val lineLengths = lines.map(s => s.length) val totalLength = lineLengths.reduce((a, b) => a + b)

Working with Key-Value Pairs
a few special operations are only available on RDDs of key-value pairs E.g. distributed “shuffle” operations, such as grouping or aggregating the elements by a key Example: reduceByKey val lines = sc.textFile("data.txt") val pairs = lines.map(s => (s, 1)) val counts = pairs.reduceByKey((a, b) => a + b)

Spark Operations

Shuffle operations The shuffle is Spark’s mechanism for re-distributing data so that it’s grouped differently across partitions This typically involves copying data across executors and machines, making the shuffle a complex and costly operation

Example: The reduceByKey operation generates a new RDD where all values for a single key are combined into a tuple - the key and the result of executing a reduce function against all values associated with that key In Spark, data is generally not distributed across partitions to be in the necessary place for a specific operation to organize all the data for a single reduceByKey reduce task to execute, Spark needs to perform an all-to-all operation It must read from all partitions to find all the values for all keys, and then bring together values across partitions to compute the final result for each key - this is called the shuffle

Operations which can cause a shuffle
repartition operations repartition coalesce ‘ByKey operations (except for counting) groupByKey reduceByKey join operations cogroup join

Shared Variables Broadcast variables Accumulators

Thanks for Your Attention!

Spark Programming By J. H. Wang May 9, 2017.

Similar presentations

Presentation on theme: "Spark Programming By J. H. Wang May 9, 2017."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark Programming By J. H. Wang May 9, 2017.

Similar presentations

Presentation on theme: "Spark Programming By J. H. Wang May 9, 2017."— Presentation transcript:

Similar presentations

About project

Feedback