Www.spark-project.org Spark Lightning-Fast Cluster Computing UC BERKELEY.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Spark Streaming Large-scale near-real-time stream processing
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)
Introduction to Spark Internals
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Parallel Programming With Spark
Pat McDonough - Databricks
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)
Berkley Data Analysis Stack (BDAS)
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Patrick Wendell Databricks Deploying and Administering Spark.
Introduction to Apache Spark
SparkR: Enabling Interactive Data Science at Scale
Parallel Programming With Spark
Matei Zaharia UC Berkeley Parallel Programming With Spark UC BERKELEY.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark Streaming Large-scale near-real-time stream processing
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.
Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.
ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.
COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT
Big Data is a Big Deal!.
Spark Programming By J. H. Wang May 9, 2017.
PROTECT | OPTIMIZE | TRANSFORM
Fast, Interactive, Language-Integrated Cluster Computing
Sushant Ahuja, Cassio Cristovao, Sameep Mohta
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
ITCS-3190.
Spark.
Hadoop Tutorials Spark
Spark Presentation.
Data Platform and Analytics Foundational Training
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Kishore Pusukuri, Spring 2018
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
An Overview of Apache Spark
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Spark and Scala.
Introduction to Spark.
Fast, Interactive, Language-Integrated Cluster Computing
Big-Data Analytics with Azure HDInsight
Lecture 29: Distributed Systems
Presentation transcript:

Spark Lightning-Fast Cluster Computing UC BERKELEY

This Meetup 1.Project history and plans 2.Spark tutorial »Running locally and on EC2 »Interactive shell »Standalone jobs 3.Spark at Quantifind

Project Goals AMP Lab: design next-gen data analytics stack »By scaling up Algorithms, Machines and People Mesos: cluster manager »Make it easy to write and deploy distributed apps Spark: parallel computing system »General and efficient computing model supporting in-memory execution »High-level API in Scala language »Substrate for even higher APIs (SQL, Pregel, …)

Where We’re Going Mesos Spark Private Cluster Amazon EC2 … Hadoop MPI Bagel (Pregel on Spark) Bagel (Pregel on Spark) Shark (Hive on Spark) Shark (Hive on Spark) … Debug Tools Streaming Spark

Some Users

Project Stats Core classes:8,700 LOC Interpreter:3,300 LOC Examples:1,300 LOC Tests:1,100 LOC Total: 15,000 LOC

Getting Spark Requirements: Java 6+, Scala git clone git://github.com/mesos/spark.git cd spark sbt/sbt compile Wiki data: tinyurl.com/wikisampletinyurl.com/wikisample These slides: tinyurl.com/sum-talktinyurl.com/sum-talk

Running Locally # run one of the example jobs:./run spark.examples.SparkPi local # launch the interpreter:./spark-shell

Running on EC2 git clone git://github.com/apache/mesos.git cd mesos/ec2./mesos-ec2 -k keypair –i id_rsa.pem –s slaves \ [launch|stop|start|destroy] clusterName Details: tinyurl.com/mesos-ec2tinyurl.com/mesos-ec2

Programming Concepts SparkContext: entry point to Spark functions Resilient distributed datasets (RDDs): »Immutable, distributed collections of objects »Can be cached in memory for fast reuse Operations on RDDs: »Transformations: define a new RDD (map, join, …) »Actions: return or output a result (count, save, …)

Creating a SparkContext import spark.SparkContext import spark.SparkContext._ val sc = new SparkContext(“master”, “jobName”) // Master can be: // local – run locally with 1 thread // local[K] – run locally with K threads //

Creating RDDs // turn a Scala collection into an RDD sc.parallelize(List(1, 2, 3)) // text file from local FS, HDFS, S3, etc sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) // general Hadoop InputFormat: sc.hadoopFile(keyCls, valCls, inputFmt, conf)

RDD Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey cogroup flatMap union join cross mapValues... Actions (output a result) collect reduce take fold count saveAsHadoopFile saveAsTextFile... Persistence cache (keep RDD in RAM)

Standalone Jobs Without Maven: package Spark into a jar sbt/sbt assembly # use core/target/spark-core-assembly-*.jar With Maven: sbt/sbt publish-local # add dep. on org.spark-project / spark-core

Standalone Jobs Configure Spark’s install location and your job’s classpath as environment variables: export SPARK_HOME=... export SPARK_CLASSPATH=... Or pass extra args to SparkContext: new SparkContext(master, name, sparkHome, jarList)

Where to Go From Here Programming guide: project.org/documentation.htmlwww.spark- project.org/documentation.html Example programs: examples/src/main/scalaexamples/src/main/scala RDD ops: RDD.scala, PairRDDFunctions.scalaRDD.scalaPairRDDFunctions.scala

Next Meetup Thursday, Feb 23 rd at 6:30 PM Conviva, Inc 2 Waters Park Drive San Mateo, CA 94403