High-Speed In-Memory Analytics over Hadoop and Hive Data

Slides:

Advertisements

Similar presentations

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,

Advertisements

Shark:SQL and Rich Analytics at Scale

Shark Hive SQL on Spark Michael Armbrust.

UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

Spark Streaming Large-scale near-real-time stream processing UC BERKELEY Tathagata Das (TD)

Chapter 2 Data Models Database Systems: Design, Implementation, and Management, Eleventh Edition, Coronel & Morris.

Spark in the Hadoop Ecosystem Eric Baldeschwieler (a.k.a. Eric14)

Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.

Comp6611 Course Lecture Big data applications Yang PENG Network and System Lab CSE, HKUST Monday, March 11, 2013 Material adapted from.

UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.

Berkeley Data Analytics Stack

Spark: Cluster Computing with Working Sets

Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Frameworks (and Stream Processing) Aditya Akella.

Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.

Spark Fast, Interactive, Language-Integrated Cluster Computing.

Matei Zaharia Large-Scale Matrix Operations Using a Data Flow Engine.

Discretized Streams Fault-Tolerant Streaming Computation at Scale Matei Zaharia, Tathagata Das (TD), Haoyuan (HY) Li, Timothy Hunter, Scott Shenker, Ion.

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,

Berkley Data Analysis Stack (BDAS)

Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Fast and Expressive Big Data Analytics with Python

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Cluster Computing for Iterative and Interactive Applications

Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.

Spark Resilient Distributed Datasets:

In-Memory Cluster Computing for Iterative and Interactive Applications

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

In-Memory Cluster Computing for Iterative and Interactive Applications

AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)

Outline | Motivation| Design | Results| Status| Future

Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

Storage in Big Data Systems

HAMS Technologies 1

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.

Spark Streaming Large-scale near-real-time stream processing

Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –

Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.

Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.

Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.

Team3: Xiaokui Shu, Ron Cohen CS5604 at Virginia Tech December 6, 2010.

Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,

Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.

Paper By: Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica Presentaed By :Jacob Komarovski Based on the slides of :Kirti.

Massive Data Processing – In-Memory Computing & Spark Stream Process.

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.

Image taken from: slideshare

Spark: Cluster Computing with Working Sets

Berkeley Data Analytics Stack - Apache Spark

Fast, Interactive, Language-Integrated Cluster Computing

Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.

Introduction to Spark.

CS110: Discussion about Spark

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Fast, Interactive, Language-Integrated Cluster Computing

Lecture 29: Distributed Systems

Presentation transcript:

High-Speed In-Memory Analytics over Hadoop and Hive Data Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data Reynold Xin 辛湜 (shi2) UC Berkeley UC BERKELEY

Prof. Michael Franklin

My Background PhD student in the AMP Lab at UC Berkeley 50-person lab focusing on big data Works on the Berkeley Data Analytics System (BDAS) Work/intern experience at Google Research, IBM DB2, Altera

AMPLab @ Berkeley NSF & DARPA funded Sponsors: Amazon Web Services / Google / SAP Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies) Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

What is Spark? Not a modified version of Hadoop Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 100x faster than Hadoop MapReduce Compatible with Hadoop’s storage APIs Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

What is Shark? A SQL analytics engine built on top of Spark Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 100x

Project History Spark project started in 2009, open sourced 2010 Shark started summer 2011, open sourced 2012 Spark 0.7 released Feb 2013 Streaming alpha in Spark 0.7 GraphLab support soon

Adoption In use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease) 600+ member meetup, 800+ watchers on GitHub 30+ contributors (including Intel Shanghai) AMPCamp 150 attendees, 5000 online attendees AMPCamp on Mar 15, ECNU, Shanghai

Hive A data warehouse initially developed by Facebook puts structure onto HDFS data (schema-on-read) compiles HiveQL queries into MapReduce jobs flexible and extensible: support UDFs, scripts, custom serializers, storage formats Popular: 90+% of Facebook Hadoop jobs generated by Hive OLTP (serving) vs OLAP (analytics)

Determinism & Idempotence Map and reduce functions are: deterministic side-effect free Tasks are thus idempotent: Rerunning them gets you the same result

Determinism & Idempotence Map and reduce functions are: deterministic side-effect free Tasks are thus idempotent: Rerunning them gets you the same result

Determinism & Idempotence Fault-tolerance: Rerun tasks originally scheduled on failed nodes Straggers: Rerun tasks originally scheduled on slow nodes

This Talk Introduction Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

Why go Beyond MapReduce? MapReduce simplified big data analysis by giving a reliable programming model for large clusters But as soon as it got popular, users wanted more: More complex, multi-stage applications More interactive ad-hoc queries

Why go Beyond MapReduce? Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Stage 3 Stage 2 Stage 1 Iterative algorithm Query 1 Query 2 Query 3 Interactive data mining

Why go Beyond MapReduce? Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Stage 1 Stage 2 Stage 3 Query 1 In MapReduce, the only way to share data across jobs is stable storage (e.g. HDFS) -> slow! Query 2 Query 3 Iterative algorithm Interactive data mining

I/O and serialization can take 90% of the time Examples HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 . . . Input Input query 1 query 2 query 3 result 1 result 2 result 3 . . . HDFS read Each iteration is, for example, a MapReduce job I/O and serialization can take 90% of the time

Goal: In-Memory Data Sharing iter. 1 iter. 2 . . . Input query 1 one-time processing query 2 query 3 Input Distributed memory . . . 10-100× faster than network and disk

Solution: Resilient Distributed Datasets (RDDs) Distributed collections of objects that can be stored in memory for fast reuse Automatically recover lost data on failure Support a wide range of applications RDDs = first-class way to manipulate and persist intermediate datasets

Programming Model Resilient distributed datasets (RDDs) Immutable, partitioned collections of objects Can be cached in memory for efficient reuse Transformations (e.g. map, filter, groupBy, join) Build RDDs from other RDDs Actions (e.g. count, collect, save) Return a result or write it to storage You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count Add “variables” to the “functions” in functional programming cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum));

Word Count myrdd.flatMap { doc => doc.split(“\s”) } .map { word => (word, 1) } .reduceByKey { case(v1, v2) => v1 + v2 } myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)

Fault Recovery Results Failure happens These are for K-means

Tradeoff Space Granularity of Updates Write Throughput Fine Coarse Low Network bandwidth Memory bandwidth Fine K-V stores, databases, RAMCloud Transactional workloads Granularity of Updates Batch analytics HDFS RDDs Coarse Low High Write Throughput

Behavior with Not Enough RAM

Example: Logistic Regression val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Load data in memory once Initial parameter vector Repeated MapReduce steps to do gradient descent Key idea: add “variables” to the “functions” in functional programming

Logistic Regression Performance 110 s / iteration first iteration 80 s further iterations 1 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Supported Operators map filter groupBy sort join leftOuterJoin rightOuterJoin reduce count reduceByKey groupByKey first union cross sample cogroup take partitionBy pipe save ...

Scheduler Dryad-like task DAG Pipelines functions within a stage Cache-aware data locality & reuse Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = previously computed partition

Implementation Use Mesos / YARN to share resources with Hadoop & other frameworks Can access any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … YARN Core engine is only 20k lines of code

User Applications In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) . . .

Conviva GeoReport Time (hours) Group aggregations on many keys w/ same filter 40× gain over Hive from avoiding repeated reading, deserialization and filtering

Mobile Millennium Project Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

This Talk Introduction Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

Challenges Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis: machine learning, graph algorithms, etc. Low-latency, interactivity.

MPP Databases Vertica, SAP HANA, Teradata, Google Dremel... Fast! Generally not fault-tolerant; challenging for long running queries as clusters scale up. Lack rich analytics such as machine learning and graph algorithms.

MapReduce Apache Hive, Google Tenzing, Turn Cheetah … Deterministic, idempotent tasks: enables fine- grained fault-tolerance and resource sharing. Expressive Machine Learning algorithms. High-latency, dismissed for interactive workloads.

Shark A data warehouse that builds on Spark, scales out and is fault-tolerant, supports low-latency, interactive queries through in-memory computation, supports both SQL and complex analytics, is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

Hive Architecture Meta store Client CLI JDBC Driver SQL Parser Query Optimizer Physical Plan Execution MapReduce HDFS

Shark Architecture Meta store Client CLI JDBC Driver Cache Mgr. SQL Parser Query Optimizer Physical Plan Execution Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join Spark HDFS

Engine Features Columnar Memory Store Machine Learning Integration Partial DAG Execution Hash-based Shuffle vs Sort-based Shuffle Data Co-partitioning Partition Pruning based on Range Statistics ...

Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Efficient In-Memory Storage Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage Benefit: similarly compact size to serialized data, but >5x faster to access 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Machine Learning Integration Unified system for query processing and machine learning Query processing and ML share the same set of workers and caches

Performance 1.7 TB Real Warehouse Data on 100 EC2 nodes

This Talk Introduction Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

Why are previous MR-based systems slow? Disk-based intermediate outputs. Inferior data format and layout (no control of data co-partitioning). Execution strategies (lack of optimization based on data statistics). Task scheduling and launch overhead!

Task Launch Overhead Hadoop uses heartbeat to communicate scheduling decisions. Task launch delay 5 - 10 seconds. Spark uses an event-driven architecture and can launch tasks in 5ms better parallelism easier straggler mitigation elasticity multi-tenancy resource sharing

Task Launch Overhead

This Talk Hadoop & MapReduce Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

What’s Next? Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps) Another emerging use case that needs fast data sharing is stream processing Track and update state in memory as events arrive Large-scale reporting, click analysis, spam filtering, etc

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) T=1 T=2 … [Zaharia et al, HotCloud 2012]

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=1 Result: can process 42 million records/second (4 GB/s) on 100 nodes at sub-second latency T=2 … [Zaharia et al, HotCloud 2012]

Streaming Spark Alpha released Feb 2013 Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=1 Alpha released Feb 2013 T=2 … [Zaharia et al, HotCloud 2012]

Conclusion Spark and Shark speed up your interactive and complex analytics on Hadoop data Download and docs: www.spark-project.org Easy to run locally, on EC2, or on Mesos and YARN Email: rxin@cs.berkeley.edu Twitter: @rxin Weibo: @hashjoin