Www.spark-project.org Spark Lightning-Fast Cluster Computing UC BERKELEY.

Slides:

Advertisements

Similar presentations

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,

Advertisements

Spark Streaming Large-scale near-real-time stream processing

Spark Streaming Large-scale near-real-time stream processing

Spark Streaming Large-scale near-real-time stream processing

Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Introduction to Spark Internals

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Parallel Programming With Spark

Pat McDonough - Databricks

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Introduction to Spark Shannon Quinn (with thanks to Paco Nathan and Databricks)

Berkley Data Analysis Stack (BDAS)

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Fast and Expressive Big Data Analytics with Python

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Cluster Computing for Iterative and Interactive Applications

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Spark Resilient Distributed Datasets:

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Patrick Wendell Databricks Deploying and Administering Spark.

Introduction to Apache Spark

SparkR: Enabling Interactive Data Science at Scale

Parallel Programming With Spark

Matei Zaharia UC Berkeley Parallel Programming With Spark UC BERKELEY.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

Spark Streaming Large-scale near-real-time stream processing

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Data Engineering How MapReduce Works

Matthew Winter and Ned Shawa

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

SPARK CLUSTER SETTING 賴家民 1. 2  Why Use Spark  Spark Runtime Architecture  Build Spark  Local mode  Standalone Cluster Manager  Hadoop Yarn  SparkContext.

Big Data Infrastructure Week 3: From MapReduce to Spark (2/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.

Matei Zaharia UC Berkeley Writing Standalone Spark Programs UC BERKELEY.

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Centre de Calcul de l’Institut National de Physique Nucléaire et de Physique des Particules Apache Spark Osman AIDEL.

ORNL is managed by UT-Battelle for the US Department of Energy Spark On Demand Deploying on Rhea Dale Stansberry John Harney Advanced Data and Workflows.

COURSE DETAILS SPARK ONLINE TRAINING COURSE CONTENT

Big Data is a Big Deal!.

Spark Programming By J. H. Wang May 9, 2017.

PROTECT | OPTIMIZE | TRANSFORM

Fast, Interactive, Language-Integrated Cluster Computing

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

Hadoop Tutorials Spark

Spark Presentation.

Data Platform and Analytics Foundational Training

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

Kishore Pusukuri, Spring 2018

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

An Overview of Apache Spark

Spark and Scala.

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Spark and Scala.

Introduction to Spark.

Fast, Interactive, Language-Integrated Cluster Computing

Big-Data Analytics with Azure HDInsight

Lecture 29: Distributed Systems

Presentation transcript:

Spark Lightning-Fast Cluster Computing UC BERKELEY

This Meetup 1.Project history and plans 2.Spark tutorial »Running locally and on EC2 »Interactive shell »Standalone jobs 3.Spark at Quantifind

Project Goals AMP Lab: design next-gen data analytics stack »By scaling up Algorithms, Machines and People Mesos: cluster manager »Make it easy to write and deploy distributed apps Spark: parallel computing system »General and efficient computing model supporting in-memory execution »High-level API in Scala language »Substrate for even higher APIs (SQL, Pregel, …)

Where We’re Going Mesos Spark Private Cluster Amazon EC2 … Hadoop MPI Bagel (Pregel on Spark) Bagel (Pregel on Spark) Shark (Hive on Spark) Shark (Hive on Spark) … Debug Tools Streaming Spark

Some Users

Project Stats Core classes:8,700 LOC Interpreter:3,300 LOC Examples:1,300 LOC Tests:1,100 LOC Total: 15,000 LOC

Getting Spark Requirements: Java 6+, Scala git clone git://github.com/mesos/spark.git cd spark sbt/sbt compile Wiki data: tinyurl.com/wikisampletinyurl.com/wikisample These slides: tinyurl.com/sum-talktinyurl.com/sum-talk

Running Locally # run one of the example jobs:./run spark.examples.SparkPi local # launch the interpreter:./spark-shell

Running on EC2 git clone git://github.com/apache/mesos.git cd mesos/ec2./mesos-ec2 -k keypair –i id_rsa.pem –s slaves \ [launch|stop|start|destroy] clusterName Details: tinyurl.com/mesos-ec2tinyurl.com/mesos-ec2

Programming Concepts SparkContext: entry point to Spark functions Resilient distributed datasets (RDDs): »Immutable, distributed collections of objects »Can be cached in memory for fast reuse Operations on RDDs: »Transformations: define a new RDD (map, join, …) »Actions: return or output a result (count, save, …)

Creating a SparkContext import spark.SparkContext import spark.SparkContext._ val sc = new SparkContext(“master”, “jobName”) // Master can be: // local – run locally with 1 thread // local[K] – run locally with K threads //

Creating RDDs // turn a Scala collection into an RDD sc.parallelize(List(1, 2, 3)) // text file from local FS, HDFS, S3, etc sc.textFile(“file.txt”) sc.textFile(“directory/*.txt”) sc.textFile(“hdfs://namenode:9000/path/file”) // general Hadoop InputFormat: sc.hadoopFile(keyCls, valCls, inputFmt, conf)

RDD Operations Transformations (define a new RDD) map filter sample groupByKey reduceByKey cogroup flatMap union join cross mapValues... Actions (output a result) collect reduce take fold count saveAsHadoopFile saveAsTextFile... Persistence cache (keep RDD in RAM)

Standalone Jobs Without Maven: package Spark into a jar sbt/sbt assembly # use core/target/spark-core-assembly-*.jar With Maven: sbt/sbt publish-local # add dep. on org.spark-project / spark-core

Standalone Jobs Configure Spark’s install location and your job’s classpath as environment variables: export SPARK_HOME=... export SPARK_CLASSPATH=... Or pass extra args to SparkContext: new SparkContext(master, name, sparkHome, jarList)

Where to Go From Here Programming guide: project.org/documentation.htmlwww.spark- project.org/documentation.html Example programs: examples/src/main/scalaexamples/src/main/scala RDD ops: RDD.scala, PairRDDFunctions.scalaRDD.scalaPairRDDFunctions.scala

Next Meetup Thursday, Feb 23 rd at 6:30 PM Conviva, Inc 2 Waters Park Drive San Mateo, CA 94403