Spark SQL.

Spark SQL

What did you think of this paper?

This paper Appeared at the “Industry” Track of SIGMOD
Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL
Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn’t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source
Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational?

History lesson: earliest example of “bridging the gap”
What’s the earliest example of “bridging the gap” between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

RDDs and Spark

The paper itself Great model for a systems paper
Talk about something that is useful + used by many many real users Argue not just that your techniques are good but also that your limitations are not fundamentally bad Extensive experiments to back it up. Awesome performance numbers always help. Won the best paper award at NSDI’12

Memory vs. Disk (borrowed)
L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns Mutex lock/unlock 100 ns Main memory reference 100 ns Compress 1K bytes with Zippy 10,000 ns Send 2K bytes over 1 Gbps network 20,000 ns Read 1 MB sequentially from memory 250,000 ns Round trip within same datacenter 500,000 ns Disk seek 10,000,000 ns Read 1 MB sequentially from network 10,000,000 ns Read 1 MB sequentially from disk 30,000,000 ns Send packet CA->Netherlands->CA 150,000,000 ns

Spark vs. Dremel Similar to Dremel in that
the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Spark Primitives vs. MapReduce

Disadvantages of MapReduce
1. Extremely rigid data flow M R Other flows constantly hacked in M M R M Join, Union Split Chains 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

Another Example: DryadLINQ
Get SM G S O Take string uri PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); var groups = words.GroupBy(x => x); var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); var top = ordered.Take(k); top.ToDryadPartitionedTable("matching.pt"); Execution Plan Graph

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea?

Spark: Control Spark leaves control of data, algorithms, persistence to the user. Is this a good idea? Good idea: User may know which datasets need to be used and how Bad idea: System may be able to optimize and schedule computation across nodes Standard argument of declarative vs. imperative

What are other ways Spark can be optimized?

What are other ways Spark can be optimized?
More Declarative than Imperative Relational Query Optimization Reordering predicates Caching, fault-tolerance only when needed Careful scheduling Careful partitioning, co-location, and persistence Indexes

Shark Two key ideas: Column store Mid-query re-planning + Other tweaks Bringing the power of relational databases to shark while this is not as much of a landmark paper by itself, it represents the evolution in thinking from imperative to declarative

Recall… Mid query replanning is not new given the work on adaptive query processing Traditional database systems plan once based on Statistics: distributions via histograms data layout & locality sizes of source relations selectivities of predicates intermediate sizes Can be notoriously bad! Famous example of unknown selectivity being estimated as 1/3.

Ways in which it can be used
Mid-way reoptimization if statistics differ significantly mid-plan Use statistics from previous plans to optimize current plan Starting multiple plans at the same time, converge on one Routing tuples to operators randomly Adaptive sharing of common expressions Picking plans with least “expected cost”

Adaptive QP Still very much an unsolved problem…
No one technique is known to be best For more details: Survey by Deshpande, Ives, Raman.

Spark SQL.

Similar presentations

Presentation on theme: "Spark SQL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Spark SQL.

Similar presentations

Presentation on theme: "Spark SQL."— Presentation transcript:

Similar presentations

About project

Feedback