Spark SQL.

Slides:



Advertisements
Similar presentations
Starfish: A Self-tuning System for Big Data Analytics.
Advertisements

Parallel Computing MapReduce Examples Parallel Efficiency Assignment
UC Berkeley Spark Cluster Computing with Working Sets Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.
Spark: Cluster Computing with Working Sets
Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.
DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.
Cloud Computing Other Mapreduce issues Keke Chen.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.
Pig: Making Hadoop Easy Wednesday, June 10, 2009 Santa Clara Marriott.
CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.
Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Apache PIG rev Tools for Data Analysis with Hadoop Hadoop HDFS MapReduce Pig Statistical Software Hive.
Big Data Infrastructure Week 3: From MapReduce to Spark (1/2) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0.
Image taken from: slideshare
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
CS 440 Database Management Systems
Pig, Making Hadoop Easy Alan F. Gates Yahoo!.
Machine Learning Library for Apache Ignite
Hadoop.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Spark SQL.
So, what was this course about?
Alternative system models
Spark Presentation.
Pig : Building High-Level Dataflows over Map-Reduce
Scaling SQL with different approaches
RDDs and Spark.
Dremel.
Project Project mid-term report due on 25th October at midnight Format
Central Florida Business Intelligence User Group
Pig Latin - A Not-So-Foreign Language for Data Processing
Apache Spark Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing Aditya Waghaye October 3, 2016 CS848 – University.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Introduction to Spark.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Introduction to PIG, HIVE, HBASE & ZOOKEEPER
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
CS110: Discussion about Spark
Building your own modern distributed system
Pig : Building High-Level Dataflows over Map-Reduce
Interpret the execution mode of SQL query in F1 Query paper
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
Introduction to Spark.
CS639: Data Management for Data Science
Fast, Interactive, Language-Integrated Cluster Computing
The Gamma Database Machine Project
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Pig and pig latin: An Introduction
CS639: Data Management for Data Science
Map Reduce, Types, Formats and Features
Presentation transcript:

Spark SQL

What did you think of this paper?

This paper Appeared at the “Industry” Track of SIGMOD Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn’t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of “bridging the gap” What’s the earliest example of “bridging the gap” between procedural and relational?

History lesson: earliest example of “bridging the gap” What’s the earliest example of “bridging the gap” between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

RDDs and Spark

The paper itself Great model for a systems paper Talk about something that is useful + used by many many real users Argue not just that your techniques are good but also that your limitations are not fundamentally bad Extensive experiments to back it up. Awesome performance numbers always help. Won the best paper award at NSDI’12

Memory vs. Disk (borrowed) L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns

Spark vs. Dremel Similar to Dremel in that the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Spark Primitives vs. MapReduce

Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ Get SM G S O Take string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea? Good idea: User may know which datasets need to be used and how Bad idea: System may be able to optimize and schedule computation across nodes Standard argument of declarative vs. imperative

What are other ways Spark can be optimized?

What are other ways Spark can be optimized? More Declarative than Imperative Relational Query Optimization Reordering predicates Caching, fault-tolerance only when needed Careful scheduling Careful partitioning, co-location, and persistence Indexes

Shark Two key ideas: Column store Mid-query re-planning + Other tweaks Bringing the power of relational databases to shark while this is not as much of a landmark paper by itself, it represents the evolution in thinking from imperative to declarative

Recall… Mid query replanning is not new given the work on adaptive query processing Traditional database systems plan once based on Statistics: distributions via histograms data layout & locality sizes of source relations selectivities of predicates intermediate sizes Can be notoriously bad! Famous example of unknown selectivity being estimated as 1/3.

Ways in which it can be used Mid-way reoptimization if statistics differ significantly mid-plan Use statistics from previous plans to optimize current plan Starting multiple plans at the same time, converge on one Routing tuples to operators randomly Adaptive sharing of common expressions Picking plans with least “expected cost”

Adaptive QP Still very much an unsolved problem… No one technique is known to be best For more details: Survey by Deshpande, Ives, Raman.