High-Speed In-Memory Analytics over Hadoop and Hive Data

Name: High-Speed In-Memory Analytics over Hadoop and Hive Data
Uploaded: 2017-11-27T15:48:09+00:00
Duration: PTM24S59
Channel: Lisa Ryan
Description: High-Speed In-Memory Analytics over Hadoop and Hive Data

High-Speed In-Memory Analytics over Hadoop and Hive Data
Spark and Shark High-Speed In-Memory Analytics over Hadoop and Hive Data Reynold Xin 辛湜 (shi2) UC Berkeley UC BERKELEY

Prof. Michael Franklin

My Background PhD student in the AMP Lab at UC Berkeley
50-person lab focusing on big data Works on the Berkeley Data Analytics System (BDAS) Work/intern experience at Google Research, IBM DB2, Altera

Berkeley NSF & DARPA funded Sponsors: Amazon Web Services / Google / SAP Intel, Huawei, Facebook, Hortonworks, Cloudera … (21 companies) Collaboration databases (Mike Franklin – sitting right here), systems (Ion Stoica), networking (Scott Shenker), architecture (David Patterson), machine learning (Michael Jordan).

What is Spark? Not a modified version of Hadoop
Separate, fast, MapReduce-like engine In-memory data storage for very fast iterative queries General execution graphs and powerful optimizations Up to 100x faster than Hadoop MapReduce Compatible with Hadoop’s storage APIs Can read/write to any Hadoop-supported system, including HDFS, HBase, SequenceFiles, etc

What is Shark? A SQL analytics engine built on top of Spark Compatible with Apache Hive data, metastores, and queries (HiveQL, UDFs, etc) Similar speedups of up to 100x

Project History Spark project started in 2009, open sourced 2010 Shark started summer 2011, open sourced 2012 Spark 0.7 released Feb 2013 Streaming alpha in Spark 0.7 GraphLab support soon

Adoption In use at Yahoo!, Foursquare, Berkeley, Princeton & many others (possibly Taobao, Netease) 600+ member meetup, 800+ watchers on GitHub 30+ contributors (including Intel Shanghai) AMPCamp 150 attendees, 5000 online attendees AMPCamp on Mar 15, ECNU, Shanghai

Hive A data warehouse initially developed by Facebook puts structure onto HDFS data (schema-on-read) compiles HiveQL queries into MapReduce jobs flexible and extensible: support UDFs, scripts, custom serializers, storage formats Popular: 90+% of Facebook Hadoop jobs generated by Hive OLTP (serving) vs OLAP (analytics)

Determinism & Idempotence
Map and reduce functions are: deterministic side-effect free Tasks are thus idempotent: Rerunning them gets you the same result

Determinism & Idempotence
Fault-tolerance: Rerun tasks originally scheduled on failed nodes Straggers: Rerun tasks originally scheduled on slow nodes

This Talk Introduction Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

Why go Beyond MapReduce?
MapReduce simplified big data analysis by giving a reliable programming model for large clusters But as soon as it got popular, users wanted more: More complex, multi-stage applications More interactive ad-hoc queries

Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Stage 3 Stage 2 Stage 1 Iterative algorithm Query 1 Query 2 Query 3 Interactive data mining

Complex jobs and interactive queries both need one thing that MapReduce lacks: Efficient primitives for data sharing Stage 1 Stage 2 Stage 3 Query 1 In MapReduce, the only way to share data across jobs is stable storage (e.g. HDFS) -> slow! Query 2 Query 3 Iterative algorithm Interactive data mining

I/O and serialization can take 90% of the time
Examples HDFS read HDFS write HDFS read HDFS write iter. 1 iter. 2 Input Input query 1 query 2 query 3 result 1 result 2 result 3 HDFS read Each iteration is, for example, a MapReduce job I/O and serialization can take 90% of the time

Goal: In-Memory Data Sharing
iter. 1 iter. 2 Input query 1 one-time processing query 2 query 3 Input Distributed memory 10-100× faster than network and disk

Solution: Resilient Distributed Datasets (RDDs)
Distributed collections of objects that can be stored in memory for fast reuse Automatically recover lost data on failure Support a wide range of applications RDDs = first-class way to manipulate and persist intermediate datasets

Programming Model Resilient distributed datasets (RDDs)
Immutable, partitioned collections of objects Can be cached in memory for efficient reuse Transformations (e.g. map, filter, groupBy, join) Build RDDs from other RDDs Actions (e.g. count, collect, save) Return a result or write it to storage You write a single program  similar to DryadLINQ Distributed data sets with parallel operations on them are pretty standard; the new thing is that they can be reused across ops Variables in the driver program can be used in parallel ops; accumulators useful for sending information back, cached vars are an optimization Mention cached vars useful for some workloads that won’t be shown here Mention it’s all designed to be easy to distribute in a fault-tolerant fashion

Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data)
Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Base RDD Cache 1 Transformed RDD lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(2)) cachedMsgs = messages.cache() Worker Driver results tasks Block 1 Action cachedMsgs.filter(_.contains(“foo”)).count Add “variables” to the “functions” in functional programming cachedMsgs.filter(_.contains(“bar”)).count Cache 2 . . . Cache 3 Block 2 Result: scaled to 1 TB data in 5-7 sec (vs 170 sec for on-disk data) Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data) Block 3

public static class WordCountMapClass extends MapReduceBase
implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, int sum = 0; while (values.hasNext()) { sum += values.next().get(); output.collect(key, new IntWritable(sum));

Word Count myrdd.flatMap { doc => doc.split(“\s”) } .map { word => (word, 1) } .reduceByKey { case(v1, v2) => v1 + v2 } myrdd.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Fault Tolerance RDDs track the series of transformations used to build them (their lineage) to recompute lost data E.g: messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2)) HadoopRDD path = hdfs://… FilteredRDD func = _.contains(...) MappedRDD func = _.split(…)

Fault Recovery Results
Failure happens These are for K-means

Tradeoff Space Granularity of Updates Write Throughput Fine Coarse Low
Network bandwidth Memory bandwidth Fine K-V stores, databases, RAMCloud Transactional workloads Granularity of Updates Batch analytics HDFS RDDs Coarse Low High Write Throughput

Behavior with Not Enough RAM

Example: Logistic Regression
val data = spark.textFile(...).map(readPoint).cache() var w = Vector.random(D) for (i <- 1 to ITERATIONS) { val gradient = data.map(p => (1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x ).reduce(_ + _) w -= gradient } println("Final w: " + w) Load data in memory once Initial parameter vector Repeated MapReduce steps to do gradient descent Key idea: add “variables” to the “functions” in functional programming

Logistic Regression Performance
110 s / iteration first iteration 80 s further iterations 1 s This is for a 29 GB dataset on 20 EC2 m1.xlarge machines (4 cores each)

Supported Operators map filter groupBy sort join leftOuterJoin rightOuterJoin reduce count reduceByKey groupByKey first union cross sample cogroup take partitionBy pipe save ...

Scheduler Dryad-like task DAG Pipelines functions within a stage Cache-aware data locality & reuse Partitioning-aware to avoid shuffles join union groupBy map Stage 3 Stage 1 Stage 2 A: B: C: D: E: F: G: NOT a modified version of Hadoop = previously computed partition

Implementation Use Mesos / YARN to share resources with Hadoop & other frameworks Can access any Hadoop input source (HDFS, S3, …) Spark Hadoop MPI Mesos Node … YARN Core engine is only 20k lines of code

User Applications In-memory analytics & anomaly detection (Conviva) Interactive queries on data streams (Quantifind) Exploratory log analysis (Foursquare) Traffic estimation w/ GPS data (Mobile Millennium) Twitter spam classification (Monarch) . . .

Conviva GeoReport Time (hours) Group aggregations on many keys w/ same filter 40× gain over Hive from avoiding repeated reading, deserialization and filtering

Mobile Millennium Project
Estimate city traffic from crowdsourced GPS data Iterative EM algorithm scaling to 160 nodes Credit: Tim Hunter, with support of the Mobile Millennium team; P.I. Alex Bayen; traffic.berkeley.edu

Challenges Data volumes expanding. Faults and stragglers complicate parallel database design. Complexity of analysis: machine learning, graph algorithms, etc. Low-latency, interactivity.

MPP Databases Vertica, SAP HANA, Teradata, Google Dremel... Fast! Generally not fault-tolerant; challenging for long running queries as clusters scale up. Lack rich analytics such as machine learning and graph algorithms.

MapReduce Apache Hive, Google Tenzing, Turn Cheetah … Deterministic, idempotent tasks: enables fine- grained fault-tolerance and resource sharing. Expressive Machine Learning algorithms. High-latency, dismissed for interactive workloads.

Shark A data warehouse that builds on Spark,
scales out and is fault-tolerant, supports low-latency, interactive queries through in-memory computation, supports both SQL and complex analytics, is compatible with Apache Hive (storage, serdes, UDFs, types, metadata).

Hive Architecture Meta store Client CLI JDBC Driver SQL Parser
Query Optimizer Physical Plan Execution MapReduce HDFS

Shark Architecture Meta store Client CLI JDBC Driver Cache Mgr.
SQL Parser Query Optimizer Physical Plan Execution Query planning is also better in Shark due to (1) more optimizations and (2) use of more optimized Spark operators such as hash-based join Spark HDFS

Engine Features Columnar Memory Store Machine Learning Integration Partial DAG Execution Hash-based Shuffle vs Sort-based Shuffle Data Co-partitioning Partition Pruning based on Range Statistics ...

Efficient In-Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Efficient In-Memory Storage
Simply caching Hive records as Java objects is inefficient due to high per-object overhead Instead, Shark employs column-oriented storage using arrays of primitive types Row Storage Column Storage Benefit: similarly compact size to serialized data, but >5x faster to access 1 john 4.1 1 2 3 2 mike 3.5 john mike sally 3 sally 6.4 4.1 3.5 6.4

Machine Learning Integration
Unified system for query processing and machine learning Query processing and ML share the same set of workers and caches

Performance 1.7 TB Real Warehouse Data on 100 EC2 nodes

Why are previous MR-based systems slow?
Disk-based intermediate outputs. Inferior data format and layout (no control of data co-partitioning). Execution strategies (lack of optimization based on data statistics). Task scheduling and launch overhead!

Task Launch Overhead Hadoop uses heartbeat to communicate scheduling decisions. Task launch delay seconds. Spark uses an event-driven architecture and can launch tasks in 5ms better parallelism easier straggler mitigation elasticity multi-tenancy resource sharing

Task Launch Overhead

This Talk Hadoop & MapReduce Spark Shark: SQL on Spark Why is Hadoop MapReduce slow? Streaming Spark

What’s Next? Recall that Spark’s model was motivated by two emerging uses (interactive and multi-stage apps) Another emerging use case that needs fast data sharing is stream processing Track and update state in memory as events arrive Large-scale reporting, click analysis, spam filtering, etc

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(“5s”, _ + _) T=1 T=2 … [Zaharia et al, HotCloud 2012]

Streaming Spark Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=1 Result: can process 42 million records/second (4 GB/s) on 100 nodes at sub-second latency T=2 … [Zaharia et al, HotCloud 2012]

Streaming Spark Alpha released Feb 2013
Extends Spark to perform streaming computations Runs as a series of small (~1 s) batch jobs, keeping state in memory as fault-tolerant RDDs Intermix seamlessly with batch and ad-hoc queries map reduceByWindow tweetStream .flatMap(_.toLower.split) .map(word => (word, 1)) .reduceByWindow(5, _ + _) T=1 Alpha released Feb 2013 T=2 … [Zaharia et al, HotCloud 2012]

Conclusion Spark and Shark speed up your interactive and complex analytics on Hadoop data Download and docs: Easy to run locally, on EC2, or on Mesos and YARN

High-Speed In-Memory Analytics over Hadoop and Hive Data

Similar presentations

Presentation on theme: "High-Speed In-Memory Analytics over Hadoop and Hive Data"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High-Speed In-Memory Analytics over Hadoop and Hive Data

Similar presentations

Presentation on theme: "High-Speed In-Memory Analytics over Hadoop and Hive Data"— Presentation transcript:

Similar presentations

About project

Feedback