Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Slides:



Advertisements
Similar presentations
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Advertisements

Shark:SQL and Rich Analytics at Scale
Berkley Data Analysis Stack Shark, Bagel. 2 Previous Presentation Summary Mesos, Spark, Spark Streaming Infrastructure Storage Data Processing Application.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark Lightning-Fast Cluster Computing UC BERKELEY.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Berkeley Data Analytics Stack
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Frameworks (and Stream Processing) Aditya Akella.
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Berkley Data Analysis Stack (BDAS)
Reynold Xin Shark: Hive (SQL) on Spark. Stage 0: Map-Shuffle-Reduce Mapper(row) { fields = row.split("\t") emit(fields[0], fields[1]); } Reducer(key,
Fast and Expressive Big Data Analytics with Python
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
Hive: A data warehouse on Hadoop
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Spark Resilient Distributed Datasets:
In-Memory Cluster Computing for Iterative and Interactive Applications
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
In-Memory Cluster Computing for Iterative and Interactive Applications
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Outline | Motivation| Design | Results| Status| Future
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
UC Berkeley Spark A framework for iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Tachyon: memory-speed data sharing Haoyuan (HY) Li, Ali Ghodsi, Matei Zaharia, Scott Shenker, Ion Stoica Good morning everyone. My name is Haoyuan,
Resilient Distributed Datasets (NSDI 2012) A Fault-Tolerant Abstraction for In-Memory Cluster Computing Piccolo (OSDI 2010) Building Fast, Distributed.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Lecture 7: Practical Computing with Large Data Sets cont. CS 6071 Big Data Engineering, Architecture, and Security Fall 2015, Dr. Rozier Special thanks.
Matei Zaharia Introduction to. Outline The big data problem Spark programming model User community Newest addition: DataFrames.
Data Engineering How MapReduce Works
Berkeley Data Analytics Stack Prof. Chi (Harold) Liu November 2015.
Operating Systems and The Cloud, Part II: Search => Cluster Apps => Scalable Machine Learning David E. Culler CS162 – Operating Systems and Systems Programming.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Spark Debugger Ankur Dave, Matei Zaharia, Murphy McCauley, Scott Shenker, Ion Stoica UC BERKELEY.
Spark System Background Matei Zaharia  [June HotCloud ]  Spark: Cluster Computing with Working Sets  [April NSDI.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Paper By: Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica Presentaed By :Jacob Komarovski Based on the slides of :Kirti.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Massive Data Processing – In-Memory Computing & Spark Stream Process.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
CSCI5570 Large Scale Data Processing Systems Distributed Data Analytics Systems Slide Ack.: modified based on the slides from Matei Zaharia James Cheng.
Spark: Cluster Computing with Working Sets
Berkeley Data Analytics Stack - Apache Spark
Fast, Interactive, Language-Integrated Cluster Computing
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark Presentation.
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Spark and Scala.
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
EECS 262a Advanced Topics in Computer Systems Lecture 21 Comparison of Parallel DB, CS, MR and Spark November 11th, 2018 John Kubiatowicz Electrical.
Fast, Interactive, Language-Integrated Cluster Computing
Presentation transcript:

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark

Spark Review Resilient distributed datasets (RDDs): – Immutable, distributed collections of objects – Can be cached in memory for fast reuse Operations on RDDs: – Transformations: define a new RDD (map, join, …) – Actions: return or output a result (count, save, …)

Generality of RDDs Despite coarse-grained interface, RDDs can express surprisingly many parallel algorithms – These naturally apply the same operation to many items Capture many current programming models – Data flow models: MapReduce, Dryad, SQL, … – Specialized models for iterative apps: BSP (Pregel), iterative MapReduce, incremental (CBP) Support new apps that these models don’t

Spark Review: Fault Tolerance RDDs maintain lineage information that can be used to reconstruct lost partitions Ex: messages = textFile(...).filter(_.startsWith(“ERROR”)).map(_.split(‘\t’)(2)) HDFSFile FilteredRDD MappedRDD filter (func = _.startsWith(...)) map (func = _.split(...))

Background: Apache Hive Data warehouse solution developed at Facebook SQL-like language called HiveQL to query structured data stored in HDFS Queries compile to Hadoop MapReduce jobs

Hive Architecture

Hive Principles SQL provides a familiar interface for users Extensible types, functions, and storage formats Horizontally scalable with high performance on large datasets

Hive Applications Reporting Ad hoc analysis ETL for machine learning …

Hive Downsides Not interactive – Hadoop startup latency is ~20 seconds, even for small jobs No query locality – If queries operate on the same subset of data, they still run from scratch – Reading data from disk is often bottleneck Requires separate machine learning dataflow

Shark Motivation Working set of data can often fit in memory to be reused between queries Provide low latency on small queries Integrate distributed UDF’s into SQL

Introducing Shark Shark = Spark + Hive Run HiveQL queries through Spark with Hive UDF, UDAF, SerDe Utilize Spark’s in-memory RDD caching and flexible language capabilities

Shark in the AMP Stack Mesos Spark Private Cluster Amazon EC2 … Hadoop MPI Bagel (Pregel on Spark) Bagel (Pregel on Spark) Shark … Debug Tools Streaming Spark

Shark ~2500 lines of Scala/Java code Implements relational operators using RDD transformations Scalable, fault-tolerant, fast Compatible with Hive – Run HiveQL queries on existing HDFS data using Hive metadata, without modifications

Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns lines = spark.textFile(“hdfs://...”) errors = lines.filter(_.startsWith(“ERROR”)) messages = errors.map(_.split(‘\t’)(1)) messages.cache() messages.filter(_.contains(“foo”)).count messages.filter(_.contains(“bar”)).count Spark: CREATE TABLE log(header string, message string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LOCATION “hdfs://...”; CREATE TABLE errors_cached AS SELECT message FROM log WHERE header == “ERROR”; SELECT count(*) FROM errors_cached WHERE message LIKE “%foo%”; SELECT count(*) FROM errors_cached WHERE message LIKE “%bar%”; Shark:

Shark Architecture Reuse as much Hive code as possible – Convert logical query plan generated from Hive into Spark execution graph Fully support Hive UDFs, UDAFs, storage formats, SerDe’s to ensure compatibility Rely on Spark’s fast execution, fault tolerance, and in-memory RDD’s

Shark Architecture

Preliminary Benchmarks Brown/Stonebraker benchmark 70GB 1 – Also used on Hive mailing list 2 10 Amazon EC2 High Memory Nodes (30GB of RAM/node) Naively cache input tables Compare Shark to Hive https://issues.apache.org/jira/browse/HIVE-396

Benchmarks: Query 1 SELECT * FROM grep WHERE field LIKE ‘%XYZ%’; 30GB input table

Benchmark: Query 2 5 GB input table SELECT pagerank, pageURL FROM rankings WHERE pagerank > 10;

Benchmark: Query 3 30 GB input table SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP;

Current Status Most of HiveQL fully implemented in Shark User selected caching with CTAS Adding in optimizations such as Map-Side Join Performing alpha testing of Shark on Conviva cluster

Future Work Automatic caching based on query analysis – Multi-query optimization Distributed UDFs using Shark + Spark – Allow users to implement sophisticated algorithms as UDFs in Spark – Shark operators and Spark UDFs take/emit RDDs – Query processing UDFs are streamlined