An Overview of Apache Spark

An Overview of Apache Spark
Spark is really cool… This is NOT a silver bullet… This is another GREAT tool to have… Stop using hammers for putting in screws. Jim Scott, Director, Enterprise Strategy and Architecture July 10, 2014

Agenda MapReduce Refresher What is Spark? The Difference with Spark
Preexisting MapReduce Examples and Resources

MapReduce Refresher

MapReduce Basics Foundational model is based on a distributed file system Scalability and fault-tolerance Map Loading of the data and defining a set of keys Many use cases do not utilize a reduce task Reduce Collects the organized key-based data to process and output Performance can be tweaked based on known details of your source files and cluster shape (size, total number)

Languages and Frameworks
Java, Scala, Clojure Python, Ruby Higher Level Languages Hive Pig Frameworks Cascading, Crunch DSLs Scalding, Scrunch, Scoobi, Cascalog When do you use regular mapreduce over higher level languages? When Hive? When Pig? When anything?

MapReduce Processing Model
Define mappers Shuffling is automatic Define reducers For complex work, chain jobs together Use a higher level language or DSL that does this for you You may have use cases you don’t feel comfortable with in Spark (e.g. image processing)

What is Spark?

Apache Spark Originally developed in 2009 in UC Berkeley’s AMP Lab
Fully open sourced in 2010 – now a Top Level Project at the Apache Software Foundation You can find Project Resources on the Apache. You’ll also find information about the mailing list there (including archives) spark.apache.org github.com/apache/spark

The Spark Community Yahoo and Adobe are in production with Spark.

Spark is the Most Active Open Source Project in Big Data
Project contributors in past year

Unified Platform Spark SQL (SQL) Spark Streaming (Streaming) MLlib
(Machine learning) GraphX (Graph computation) Spark (General execution engine) Continued innovation bringing new functionality, e.g.: Java 8 (Closures, Lamba Expressions) BlinkDB (Approximate Queries) SparkR (R wrapper for Spark)

Machine Learning - MLlib
K-Means L1 and L2-regularized Linear Regression L1 and L2-regularized Logistic Regression Alternating Least Squares Naive Bayes Stochastic Gradient Descent … ** Mahout is no longer accepting MapReduce algorithm submissions in lieu of Spark

Data Sources Local Files S3 Hadoop Distributed Filesystem HBase
file:///opt/httpd/logs/access_log S3 Hadoop Distributed Filesystem Regular files, sequence files, any other Hadoop InputFormat HBase Spark is ignorant to the backends

Deploying Spark – Cluster Manager Types
Mesos EC2 GCE Standalone mode YARN Best use case? Standalone followed by Mesos… My personal opinion is that Mesos is where the future will take us.

Supported Languages Java Scala Python Hive?

Execution environment
The Spark Stack from 100,000 ft Spark ecosystem 4 Spark core engine 3 Execution environment 2 Data comes from somewhere… local filesystem, hadoop, s3… Mesos, EC2, GCE, you could even try Yarn Data platform 1

The Difference with Spark

Up to 10× faster on disk, 100× in memory
Easy and Fast Big Data Easy to Develop Rich APIs in Java, Scala, Python Interactive shell Fast to Run General execution graphs In-memory storage This sounds a lot like the reason to consider Pig vs. Java MapReduce 2-5× less code Up to 10× faster on disk, 100× in memory

Resilient Distributed Datasets (RDD)
Spark revolves around RDDs Fault-tolerant collection of elements that can be operated on in parallel Parallelized Collection: Scala collection which is run in parallel Hadoop Dataset: records of files supported by Hadoop

Directed Acylic Graph (DAG)
Only in a single direction Acyclic No looping Why does this matter? This supports fault-tolerance Looks kind of like a source control tree Predecessors + actions & transformations are used to reproduce data for a failed node

filter (func = startsWith(…))
RDD Fault Recovery RDDs track lineage information that can be used to efficiently recompute lost data msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“\t”)[2]) HDFS File Filtered RDD Mapped RDD filter (func = startsWith(…)) map (func = split(...))

RDD Operations Transformations Actions
Creation of a new dataset from an existing map, filter, distinct, union, sample, groupByKey, join, etc… Actions Return a value after running a computation collect, count, first, takeSample, foreach, etc… Check the documentation for a complete list

RDD Persistence / Caching
Variety of storage levels memory_only (default), memory_and_disk, etc… API Calls persist(StorageLevel) cache() – shorthand for persist(StorageLevel.MEMORY_ONLY) Considerations Read from disk vs. recompute (memory_and_disk) Total memory storage size (memory_only_ser) Replicate to second node for faster fault recovery (memory_only_2) Think about this option if supporting a web application

Cache Scaling Matters Gracefully

Comparison to Storm Higher throughput than Storm
Spark Streaming: 670k records/sec/node Storm: 115k records/sec/node Commercial systems: k records/sec/node Why is this faster? Windows & micro-batching… Do you really need sub half second streaming?

Interactive Shell Iterative Development Scala – spark-shell
Cache those RDDs Open the shell and ask questions We have all wished we could do this with MapReduce Compile / save your code for scheduled jobs later Scala – spark-shell Python – pyspark You can import the MLlib to use here in the shell!

Preexisting MapReduce

Existing Jobs Java MapReduce Pig Scripts Hive Queries….
Port them over if you need better performance Be sure to share the results and learning's Pig Scripts Port them over Try SPORK! Hive Queries….

Spark SQL Shark is officially dead, long-live Spark SQL
Hive-compatible (HiveQL, UDFs, metadata) Works in existing Hive warehouses without changing queries or data! Augments Hive In-memory tables and columnar memory store Fast execution engine Uses Spark as the underlying execution engine Low-latency, interactive queries Scale-out and tolerates worker failures People I know say nearly everything from Hive runs faster on SparkSQL, but you may still hit an edge case.

Examples and Resources

Word Count Java MapReduce (~15 lines of code)
Java Spark (~ 7 lines of code) Scala and Python (4 lines of code) interactive shell: skip line 1 and replace the last line with counts.collect() Java8 (4 lines of code) val sc = new SparkContext(master, appName, [sparkHome], [jars]) val file = sc.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") SparkContext sc = new SparkContext(master, appName, [sparkHome], [jars]); JavaRDD<String> file = sc.textFile("hdfs://..."); JavaRDD<String> counts = file.flatMap(line -> Arrays.asList(line.split(" "))) .mapToPair(w -> new Tuple2<String, Integer>(w, 1)) .reduceByKey((x, y) -> x + y); counts.saveAsTextFile("hdfs://...");

Network Word Count – Streaming
// Create the context with a 1 second batch size val ssc = new StreamingContext(args(0), "NetworkWordCount", Seconds(1), System.getenv("SPARK_HOME"), StreamingContext.jarOfClass(this.getClass)) // Create a NetworkInputDStream on target host:port and count the // words in input stream of \n delimited text (eg. generated by 'nc') val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_ONLY_SER) val words = lines.flatMap(_.split(" ")) val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _) wordCounts.print() ssc.start()

Remember If you want to use a new technology you must learn that new technology For those who have been using Hadoop for a while, at one time you had to learn all about MapReduce and how to manage and tune it To get the most out of a new technology you need to learn that technology, this includes tuning There are switches you can use to optimize your work Don’t forget to share your experiences. This is really what the community is about. Don’t have time to contribute to open source, use it and share your experiences!

Configuration http://spark.apache.org/docs/latest/ Most Important
Application Configuration Standalone Cluster Configuration Tuning Guide

Resources Pig on Spark Latest on Spark
Latest on Spark This isn’t all proven out yet, but some of it should just work already.

Q & A Engage with us! @kingmesal maprtech MapR mapr-technologies
maprtech

An Overview of Apache Spark

Similar presentations

Presentation on theme: "An Overview of Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

An Overview of Apache Spark

Similar presentations

Presentation on theme: "An Overview of Apache Spark"— Presentation transcript:

Similar presentations

About project

Feedback