Shark:SQL and Rich Analytics at Scale

Slides:



Advertisements
Similar presentations
Introduction to Apache HIVE
Advertisements

Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Cliff Engle, Michael Franklin, Haoyuan Li, Antonio Lupher, Justin Ma,
Shark Hive SQL on Spark Michael Armbrust.
UC Berkeley a Spark in the cloud iterative and interactive cluster computing Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Spark Fast, Interactive, Language-Integrated Cluster Computing Wen Zhiguang
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Discretized Streams: Fault-Tolerant Streaming Computation at Scale Wenting Wang 1.
Spark Fast, Interactive, Language-Integrated Cluster Computing.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Clydesdale: Structured Data Processing on MapReduce Jackie.
Distributed Computations
Hive: A data warehouse on Hadoop
Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Hive – A Warehousing Solution Over a Map-Reduce Framework Presented by: Atul Bohara Feb 18, 2014.
1 A Comparison of Approaches to Large-Scale Data Analysis Pavlo, Paulson, Rasin, Abadi, DeWitt, Madden, Stonebraker, SIGMOD’09 Shimin Chen Big data reading.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.
Cloud Computing Other High-level parallel processing languages Keke Chen.
Introduction to Hadoop and HDFS
HadoopDB Presenters: Serva rashidyan Somaie shahrokhi Aida parbale Spring 2012 azad university of sanandaj 1.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
An Introduction to HDInsight June 27 th,
Spark. Spark ideas expressive computing system, not limited to map-reduce model facilitate system memory – avoid saving intermediate results to disk –
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
Data Engineering How MapReduce Works
Matthew Winter and Ned Shawa
A Comparison of Approaches to Large-Scale Data Analysis Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Paper By: Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, Scott Shenker, Ion Stoica Presentaed By :Jacob Komarovski Based on the slides of :Kirti.
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Image taken from: slideshare
Presented by: Omar Alqahtani Fall 2016
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER
Spark Presentation.
Data Platform and Analytics Foundational Training
Database Performance Tuning and Query Optimization
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
A Comparison of Approaches to Large-Scale Data Analysis
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC
Overview of big data tools
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Zaharia, et al (2012)
Declarative Transfer Learning from Deep CNNs at Scale
Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC
Charles Tappert Seidenberg School of CSIS, Pace University
Cloud Computing for Data Analysis Pig|Hive|Hbase|Zookeeper
Chapter 11 Database Performance Tuning and Query Optimization
Fast, Interactive, Language-Integrated Cluster Computing
Lecture 29: Distributed Systems
CS639: Data Management for Data Science
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Map Reduce, Types, Formats and Features
Presentation transcript:

Shark:SQL and Rich Analytics at Scale Presentaed By Kirti Dighe Drushti Gawade

What is Shark? Built on the top of the RDD and spark A new data analysis system Built on the top of the RDD and spark Compatible with Apache Hive data, metastores, and queries(HiveQL, UDFs, etc) Similar speedups of up to 100x Supports low-latency, interactive queries through in-memory computation Supports both SQL and complex analytics such as machine learning

Shark Architecture Diagram of Architecture Used to query an existing Hive warehouse returns result much faster without modification Diagram of Architecture

Spark Support partial DAG execution Optimization of joint algorithm Features of shark Supports general computation Provides in-memory storage abstraction-RDD Engine is optimized for low latency

RDD Sparks main abstraction-RDD Collection stored in external storage system or derived data set Contains arbitrary data types Benefits of RDD’s Return at the speed of DRAM Use of lineage Speedy recovery Immutable-foundation for relational processing.

Fault tolerance guarantees Shark can tolerate the loss of any set of worker nodes. Recovery is parallelized across the cluster. The deterministic nature of RDDs also enables straggler mitigation Recovery works even in queries that combine SQL and machine learning UDFs

Executing sql over RDDs Process of executing sql queries which includes Query parsing Logical plan generation Physical plan generation

Engine extension Partial DAG execution(PDE) Static query optimization Dynamic query optimization Modification of statistics Example of statistics Partition size record count List of “heavy hitters” Approximate histogram

Join Optimization Skew handling and degree parallelism Task scheduling overhead

Columnar Memory Store Simply catching records as JVM objects is insuffiecient Shark employs column oriented storage , a partition of columns is one MaoReduce “record” Benefits: compact representation, cpu efficient compression, cache locality

Machine learning support Shark supports machine learning-first class citizen Programming model design to express machine learning algorithm: 1. Language Integration Shark allows queries to perform logistic regression over a user database. Ex: Data analysis pipeline that performs logistic regression over database.

2. Execution Engine Integration Common abstraction allows machine learning computation and SQl queries to share workers and cached data. Enables end to end fault tolerance

Implementation Minimize tail latency CPU cost processing of each How to improve Query Processing Speed Minimize tail latency CPU cost processing of each Memory-based shuffle Temporary object creation Bytecode compilation of expression evaluation

Experiments Evaluation of the shark using database Pavlo et al. Benchmark: 2.1 TB of data reproducing Pavlo et al.’s comparison of MapReduce vs. analytical DBMSs [25]. TPC-H Dataset: 100 GB and 1 TB datasets generated by the DBGEN program [29]. Real Hive Warehouse: 1.7 TB of sampled Hive warehouse data from an early industrial user of Shark. Machine Learning Dataset: 100 GB synthetic dataset to measure the performance of machine learning algorithms. Shark perform 100x faster than hive

Methodology and cluster setup Amazon EC2 with 100m2.4xlarge nodes 8 virtual code 68 GB of memory 1.6 TB of local storage Pavlo etal. Benchmarks 1 GB/node ranking table 20 GB/node uservisits table Selection Query (cluster index) SELECT pageURL, pageRank FROM rankings WHERE pageRank > X;

Aggregation Queries SELECT sourceIP, SUM(adRevenue) FROM uservisits GROUP BY sourceIP; SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM uservisits GROUP BY SUBSTR(sourceIP, 1, 7);

Join Query SELECT INTO Temp sourceIP, AVG(pageRank), SUM(adRevenue) as totalRevenue FROM rankings AS R, uservisits AS UV WHERE R.pageURL = UV.destURL AND UV.visitDate BETWEEN Date(’2000-01-15’) AND Date(’2000-01-22’) GROUP BY UV.sourceIP; Join query runtime from Join stategies Pavlo Benchmark chosen by optimizers

Data Loading Micro-Benchmarks To query data in HDFS directly,which means its data ingress rate is at least as fast as Hadoop’s. Micro-Benchmarks Aggregation performance SELECT [GROUP_BY_COLUMN], COUNT(*) FROM lineitem GROUP BY [GROUP_BY_COLUMN]

Join selection at runtime Fault tolerence Measuring sharks performance in presence of node failures –simulate failures and measure query performance, before,during and after failure recovery.

Real hive warehouse 1. Query 1 computes summary statistics in 12 dimensions for users of a specific customer on a specific day. 2. Query 2 counts the number of sessions and distinct customer/client combination grouped by countries with filter cates on eight columns. 3. Query 3 counts the number of sessions and distinct users for all but 2 countries. 4. Query 4 computes summary statistics in 7 dimensions grouping by a column, and showing the top groups sorted in descending order.

Machine learning Algorithms Compare performance of shark running the same work flow in Hive and Hadoop Workflow consisted of three steps: 1)Selecting the data of interesr from the warehouse using SQL 2)Extracting Features 3)Applying Iterartive Algorithms Logistic Regresion K-Means Clustering

Logistic Regression,pre-iterarion runtime(seconds) K-means Cllustering,pre-iteration algorithm

Conclusion Warehouse combining relational queries and complex analytics Generalizes map reduce using both Traditional Databse Techniques Novel Partial DAG Execution Shark faster than Hive and Hadoop