Matei Zaharia UC Berkeley www.spark-project.org Parallel Programming With Spark UC BERKELEY.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Introduction to Spark Internals
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Spark Performance Patrick Wendell Databricks.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
The Hadoop Stack, Part 3 Introduction to Spark
Parallel Programming With Spark
Pat McDonough - Databricks
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Introduction to Apache Spark
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Parallel Programming With Spark
Parallel Programming With Spark March 15, 2013 ECNU, Shanghai, China.
Intro to Spark Lightning-fast cluster computing. What is Spark? Spark Overview: A fast and general-purpose cluster computing system.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Spark Streaming Large-scale near-real-time stream processing
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Other Map-Reduce (ish) Frameworks: Spark William Cohen 1.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
PySpark Tutorial - Learn to use Apache Spark with Python
Architecture and design
Spark: Cluster Computing with Working Sets
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Kishore Pusukuri, Spring 2018
CS110: Discussion about Spark
Spark and Scala.
Introduction to Spark.
CS5412 / Lecture 25 Apache Spark and RDDs
Fast, Interactive, Language-Integrated Cluster Computing
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Presentation transcript:

Matei Zaharia UC Berkeley Parallel Programming With Spark UC BERKELEY

What is Spark? Fast and expressive cluster computing system compatible with Apache Hadoop »Works with any Hadoop-supported storage system and data format (HDFS, S3, SequenceFile, …) Improves efficiency through: »In-memory computing primitives »General computation graphs Improves usability through rich Scala and Java APIs and interactive shell As much as 30× faster Often 2-10× less code

How to Run It Local multicore: just a library in your program EC2: scripts for launching a Spark cluster Private cluster: Mesos, YARN *, standalone * * Coming soon in Spark 0.6

Scala vs Java APIs Spark originally written in Scala, which allows concise function syntax and interactive use Recently added Java API for standalone apps (dev branch on GitHub) Interactive shell still in Scala This course: mostly Scala, with translations to Java

Outline Introduction to Scala & functional programming Spark concepts Tour of Spark operations Job execution

About Scala High-level language for the Java VM »Object-oriented + functional programming Statically typed »Comparable in speed to Java »But often no need to write types due to type inference Interoperates with Java »Can use any Java class, inherit from it, etc; can also call Scala code from Java

Best Way to Learn Scala Interactive shell: just type scala Supports importing libraries, tab completion, and all constructs in the language

Quick Tour Declaring variables: var x: Int = 7 var x = 7 // type inferred val y = “hi” // read-only Java equivalent: int x = 7; final String y = “hi”; Functions: def square(x: Int): Int = x*x def square(x: Int): Int = { x*x } def announce(text: String) { println(text) } Java equivalent: int square(int x) { return x*x; } void announce(String text) { System.out.println(text); } Last expression in block returned

Quick Tour Generic types: var arr = new Array[Int](8) var lst = List(1, 2, 3) // type of lst is List[Int] Java equivalent: int[] arr = new int[8]; List lst = new ArrayList (); lst.add(...) Indexing: arr(5) = 7 println(lst(5)) Java equivalent: arr[5] = 7; System.out.println(lst.get(5)); Factory method Can’t hold primitive types

Processing collections with functional programming: val list = List(1, 2, 3) list.foreach(x => println(x)) // prints 1, 2, 3 list.foreach(println) // same list.map(x => x + 2) // => List(3, 4, 5) list.map(_ + 2) // same, with placeholder notation list.filter(x => x % 2 == 1) // => List(1, 3) list.filter(_ % 2 == 1) // => List(1, 3) list.reduce((x, y) => x + y) // => 6 list.reduce(_ + _) // => 6 Quick Tour Function expression (closure) All of these leave the list unchanged (List is immutable)

Scala Closure Syntax (x: Int) => x + 2 // full version x => x + 2 // type inferred _ + 2 // when each argument is used exactly once x => { // when body is a block of code val numberToAdd = 2 x + numberToAdd } // If closure is too long, can always pass a function def addTwo(x: Int): Int = x + 2 list.map(addTwo) Scala allows defining a “local function” inside another function

Other Collection Methods Scala collections provide many other functional methods; for example, Google for “Scala Seq” Method on Seq[T]Explanation map(f: T => U): Seq[U] Pass each element through f flatMap(f: T => Seq[U]): Seq[U] One-to-many map filter(f: T => Boolean): Seq[T] Keep elements passing f exists(f: T => Boolean): Boolean True if one element passes forall(f: T => Boolean): Boolean True if all elements pass reduce(f: (T, T) => T): T Merge elements using f groupBy(f: T => K): Map[K,List[T]] Group elements by f(element) sortBy(f: T => K): Seq[T] Sort elements by f(element)...

Outline Introduction to Scala & functional programming Spark concepts Tour of Spark operations Job execution

Spark Overview Goal: work with distributed collections as you would with local ones Concept: resilient distributed datasets (RDDs) »Immutable collections of objects spread across a cluster »Built through parallel transformations (map, filter, etc) »Automatically rebuilt on failure »Controllable persistence (e.g. caching in RAM) for reuse

Main Primitives Resilient distributed datasets (RDDs) »Immutable, partitioned collections of objects Transformations (e.g. map, filter, groupBy, join) »Lazy operations to build RDDs from other RDDs Actions (e.g. count, collect, save) »Return a result or write it to storage

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns val lines = spark.textFile(“hdfs://...”) val errors = lines.filter(_.startsWith(“ERROR”)) val messages = errors.map(_.split(‘\t’)(2)) messages.cache() Block 1 Block 2 Block 3 Worker Driver messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count... tasks results Cac he 1 Cache 2 Cache 3 Base RDD Transformed RDD Action Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)

RDD Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) FilteredRDD func = _.contains(...) MappedRDD func = _.split(…) MappedRDD func = _.split(…)

Fault Recovery Test Failure happens

Behavior with Less RAM

How it Looks in Java lines.filter(_.contains(“error”)).count() JavaRDD lines =...; lines.filter(new Function () { Boolean call(String s) { return s.contains(“error”); } }).count(); More examples in the next talk

Outline Introduction to Scala & functional programming Spark concepts Tour of Spark operations Job execution

Learning Spark Easiest way: Spark interpreter ( spark-shell ) »Modified version of Scala interpreter for cluster use Runs in local mode on 1 thread by default, but can control through MASTER environment var: MASTER=local./spark-shell # local, 1 thread MASTER=local[2]./spark-shell # local, 2 threads MASTER=host:port./spark-shell # run on Mesos

First Stop: SparkContext Main entry point to Spark functionality Created for you in spark-shell as variable sc In standalone programs, you’d make your own (see later for details)

Creating RDDs // Turn a Scala collection into an RDD sc.parallelize(List(1, 2, 3)) // Load text file from local FS, HDFS, or S3 sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) // Use any existing Hadoop InputFormat sc.hadoopFile(keyClass, valClass, inputFmt, conf)

Basic Transformations val nums = sc.parallelize(List(1, 2, 3)) // Pass each element through a function val squares = nums.map(x => x*x) // {1, 4, 9} // Keep elements passing a predicate val even = squares.filter(_ % 2 == 0) // {4} // Map each element to zero or more others nums.flatMap(x => 1 to x) // => {1, 1, 2, 1, 2, 3} Range object (sequence of numbers 1, 2, …, x)

Basic Actions val nums = sc.parallelize(List(1, 2, 3)) // Retrieve RDD contents as a local collection nums.collect() // => Array(1, 2, 3) // Return first K elements nums.take(2) // => Array(1, 2) // Count number of elements nums.count() // => 3 // Merge elements with an associative function nums.reduce(_ + _) // => 6 // Write elements to a text file nums.saveAsTextFile(“hdfs://file.txt”)

Working with Key-Value Pairs Spark’s “distributed reduce” transformations operate on RDDs of key-value pairs Scala pair syntax: val pair = (a, b) // sugar for new Tuple2(a, b) Accessing pair elements: pair._1 // => a pair._2 // => b

Some Key-Value Operations val pets = sc.parallelize( List((“cat”, 1), (“dog”, 1), (“cat”, 2))) pets.reduceByKey(_ + _) // => {(cat, 3), (dog, 1)} pets.groupByKey() // => {(cat, Seq(1, 2)), (dog, Seq(1)} pets.sortByKey() // => {(cat, 1), (cat, 2), (dog, 1)} reduceByKey also automatically implements combiners on the map side

val lines = sc.textFile(“hamlet.txt”) val counts = lines.flatMap(line => line.split(“ ”)).map(word => (word, 1)).reduceByKey(_ + _) Example: Word Count “to be or” “not to be” “to” “be” “or” “not” “to” “be” (to, 1) (be, 1) (or, 1) (not, 1) (to, 1) (be, 1) (be, 2) (not, 1) (or, 1) (to, 2)

Other Key-Value Operations val visits = sc.parallelize(List( (“index.html”, “ ”), (“about.html”, “ ”), (“index.html”, “ ”))) val pageNames = sc.parallelize(List( (“index.html”, “Home”), (“about.html”, “About”))) visits.join(pageNames) // (“index.html”, (“ ”, “Home”)) // (“index.html”, (“ ”, “Home”)) // (“about.html”, (“ ”, “About”)) visits.cogroup(pageNames) // (“index.html”, (Seq(“ ”, “ ”), Seq(“Home”))) // (“about.html”, (Seq(“ ”), Seq(“About”)))

Controlling The Number of Reduce Tasks All the pair RDD operations take an optional second parameter for number of tasks words.reduceByKey(_ + _, 5) words.groupByKey(5) visits.join(pageViews, 5) Can also set spark.default.parallelism property

Using Local Variables Any external variables you use in a closure will automatically be shipped to the cluster: val query = Console.readLine() pages.filter(_.contains(query)).count() Some caveats: »Each task gets a new copy (updates aren’t sent back) »Variable must be Serializable »Don’t use fields of an outer object (ships all of it!)

Closure Mishap Example class MyCoolRddApp { val param = 3.14 val log = new Log(...)... def work(rdd: RDD[Int]) { rdd.map(x => x + param).reduce(...) } } How to get around it: class MyCoolRddApp {... def work(rdd: RDD[Int]) { val param_ = param rdd.map(x => x + param_).reduce(...) } } NotSerializableException: MyCoolRddApp (or Log) References only local variable instead of this.param

Other RDD Operations sample() : deterministically sample a subset union() : merge two RDDs cartesian() : cross product pipe() : pass through external program See Programming Guide for more:

Outline Introduction to Scala & functional programming Spark concepts Tour of Spark operations Job execution

Software Components Spark runs as a library in your program (1 instance per app) Runs tasks locally or on Mesos »dev branch also supports YARN, standalone deployment Accesses storage systems via Hadoop InputFormat API »Can use HBase, HDFS, S3, … Your application SparkContext Local threads Mesos master Slave Spark worker Slave Spark worker HDFS or other storage

Task Scheduler Runs general task graphs Pipelines functions where possible Cache-aware data reuse & locality Partitioning-aware to avoid shuffles join filter groupBy Stage 3 Stage 1 Stage 2 A: B: C:D: E: F: = cached partition= RDD map

Data Storage Cached RDDs normally stored as Java objects »Fastest access on JVM, but can be larger than ideal Can also store in serialized format  Spark 0.5: spark.cache.class=spark.SerializingCache Default serialization library is Java serialization »Very slow for large data!  Can customize through spark.serializer (see later)

How to Get Started git clone git://github.com/mesos/spark cd spark sbt/sbt compile./spark-shell

More Information Scala resources: » (First Steps to Scala) » (free book) Spark documentation: project.org/documentation.html project.org/documentation.html