Matei Zaharia UC Berkeley www.spark-project.org Writing Standalone Spark Programs UC BERKELEY.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

The map and reduce functions in MapReduce are easy to test in isolation, which is a consequence of their functional style. For known inputs, they produce.
MapReduce.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Introduction to Spark Internals
O’Reilly – Hadoop: The Definitive Guide Ch.5 Developing a MapReduce Application 2 July 2010 Taewhi Lee.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark Performance Patrick Wendell Databricks.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Parallel Programming With Spark
Pat McDonough - Databricks
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Overview of Spark project Presented by Yin Zhu Materials from Hadoop in Practice by A.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Patrick Wendell Databricks Deploying and Administering Spark.
Outline | Motivation| Design | Results| Status| Future
Introduction to Apache Spark
SparkR: Enabling Interactive Data Science at Scale
Parallel Programming With Spark
MAVEN-BLUEMARTINI Yannick Robin. What is maven-bluemartini?  maven-bluemartini is Maven archetypes for Blue Martini projects  Open source project on.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Matei Zaharia UC Berkeley Parallel Programming With Spark UC BERKELEY.
Parallel Programming With Spark March 15, 2013 ECNU, Shanghai, China.
Intro to Spark Lightning-fast cluster computing. What is Spark? Spark Overview: A fast and general-purpose cluster computing system.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
HAMS Technologies 1
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Ch 1. A Python Q&A Session Spring Why do people use Python? Software quality Developer productivity Program portability Support libraries Component.
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
CS 347Notes101 CS 347 Parallel and Distributed Data Processing Distributed Information Retrieval Hector Garcia-Molina Zoltan Gyongyi.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Filtering, aggregating and histograms A FEW COMPLETE EXAMPLES WITH MR, SPARK LUCA MENICHETTI, VAG MOTESNITSALIS.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Spark Programming By J. H. Wang May 9, 2017.
ITCS-3190.
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Use of Spark for Proteomic Scoring
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Working with Key/Value Pairs
Introduction to Spark.
Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov
Fast, Interactive, Language-Integrated Cluster Computing
Working with Key/Value Pairs
Plug-In Architecture Pattern
Presentation transcript:

Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Building Spark Requires: Java 6+, Scala git clone git://github.com/mesos/spark cd spark sbt/sbt compile # Build Spark + dependencies into single JAR # (gives core/target/spark*assembly*.jar) sbt/sbt assembly # Publish Spark to local Maven cache sbt/sbt publish-local

Adding it to Your Project Either include the Spark assembly JAR, or add a Maven dependency on: groupId: org.spark-project artifactId: spark-core_2.9.1 version: SNAPSHOT

Creating a SparkContext import spark.SparkContext import spark.SparkContext._ val sc = new SparkContext( “masterUrl”, “name”, “sparkHome”, Seq(“job.jar”)) Mesos cluster URL, or local / local[N] Job name Spark install path on cluster List of JARs with your code (to ship) Important to get some implicit conversions

Complete Example import spark.SparkContext import spark.SparkContext._ object WordCount { def main(args: Array[String]) { val sc = new SparkContext( “local”, “WordCount”, args(0), Seq(args(1))) val file = sc.textFile(args(2)) file.map(_.split(“ ”)).flatMap(word => (word, 1)).reduceByKey(_ + _).saveAsTextFile(args(3)) } } Static singleton object

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Why PageRank? Good example of a more complex algorithm »Multiple stages of map & reduce Benefits from Spark’s in-memory caching »Multiple iterations over the same data

Basic Idea Give pages ranks (scores) based on links to them  Links from many pages  high rank  Link from a high-rank page  high rank Image: en.wikipedia.org/wiki/File:PageRank-hi-res-2.png

Algorithm Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs

Algorithm 1.Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs

Algorithm 1.Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs

Algorithm 1.Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs

Algorithm 1.Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs

Algorithm 1.Start each page at a rank of 1 2.On each iteration, have page p contribute rank p / |neighbors p | to its neighbors 3.Set each page’s rank to × contribs Final state:

Spark Program val links = // RDD of (url, neighbors) pairs var ranks = // RDD of (url, rank) pairs for (i <- 1 to ITERATIONS) { val contribs = links.join(ranks).flatMap { case (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) } ranks = contribs.reduceByKey(_ + _).mapValues( * _) } ranks.saveAsTextFile(...)

Coding It Up

PageRank Performance

Other Iterative Algorithms Logistic RegressionK-Means Clustering

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Differences in Java API Implement functions by extending classes  spark.api.java.function.Function, Function2, etc Use special Java RDDs in spark.api.java  Same methods as Scala RDDs, but take Java Functions Special PairFunction and JavaPairRDD provide operations on key-value pairs »To maintain type safety as in Scala

Examples import spark.api.java.*; import spark.api.java.function.*; JavaSparkContext sc = new JavaSparkContext(...); JavaRDD lines = ctx.textFile(“hdfs://...”); JavaRDD words = lines.flatMap( new FlatMapFunction () { public Iterable call(String s) { return Arrays.asList(s.split(" ")); } } ); System.out.println(words.count());

Examples import spark.api.java.*; import spark.api.java.function.*; JavaSparkContext sc = new JavaSparkContext(...); JavaRDD lines = ctx.textFile(args[1], 1); class Split extends FlatMapFunction { public Iterable call(String s) { return Arrays.asList(s.split(" ")); } ); JavaRDD words = lines.flatMap(new Split()); System.out.println(words.count());

Key-Value Pairs import scala.Tuple2; JavaPairRDD ones = words.map( new PairFunction () { public Tuple2 call(String s) { return new Tuple2(s, 1); } } ); JavaPairRDD counts = ones.reduceByKey( new Function2 () { public Integer call(Integer i1, Integer i2) { return i1 + i2; } } );

Java PageRank

Outline Setting up for Spark development Example: PageRank PageRank in Java Testing and debugging

Developing in Local Mode Just pass local or local[k] as master URL Still serializes tasks to catch marshaling errors Debug in any Java/Scala debugger

Running on a Cluster Set up Mesos as per Spark wiki »github.com/mesos/spark/wiki/Running-spark-on-mesosgithub.com/mesos/spark/wiki/Running-spark-on-mesos Basically requires building Mesos and creating config files with locations of slaves Pass master:port as URL (default port is 5050)

Running on EC2 Easiest way to launch a Spark cluster git clone git://github.com/mesos/spark.git cd spark/ec2./spark-ec2 -k keypair –i id_rsa.pem –s slaves \ [launch|stop|start|destroy] clusterName Details: tinyurl.com/spark-ec2tinyurl.com/spark-ec2

Viewing Logs Click through the web UI at master:8080 Or, look at stdout and stdout files in the Mesos “work” directories for your program, such as: /tmp/mesos/slaves/ /frameworks/ /executors/0/runs/0/stderr FrameworkID is printed when Spark connects, SlaveID is printed when a task starts

Common Problems Exceptions in tasks: will be reported at master 17:57:00 INFO TaskSetManager: Lost TID 1 (task 0.0:1) 17:57:00 INFO TaskSetManager: Loss was due to java.lang.ArithmeticException: / by zero at BadJob$$anonfun$1.apply$mcII$sp(BadJob.scala:10) at BadJob$$anonfun$1.apply(BadJob.scala:10) at... Fetch failure: couldn’t communicate with a node »Most likely, it crashed earlier »Always look at first problem in log

Common Problems NotSerializableException:  Set sun.io.serialization.extendedDebugInfo=true to get a detailed trace (in SPARK_JAVA_OPTS ) »Beware of closures using fields/methods of outer object (these will reference the whole object)

For More Help Join the Spark Users mailing list: groups.google.com/group/spark-users Come to the Bay Area meetup: