Top 10 SQL-on-Hadoop Pitfalls Monte Zweben CEO, Splice Machine.

Slides:



Advertisements
Similar presentations
Youre Smarter than a Database Overcoming the optimizers bad cardinality estimates.
Advertisements

Concurrency Control Part 2 R&G - Chapter 17 The sequel was far better than the original! -- Nobody.
SQL Performance 2011/12 Joe Chang, SolidQ
Real-Time Big Data Use Cases John Leach CTO, Splice Machine.
INTEGRATING BIG DATA TECHNOLOGY INTO LEGACY SYSTEMS Robert Cooley, Ph.D.CodeFreeze 1/16/2014.
A Fast Growing Market. Interesting New Players Lyzasoft.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
Meanwhile RAM cost continues to drop Moore’s Law on total CPU processing power holds but in parallel processing… CPU clock rate stalled… Because.
Chapter 3 Database Management
Connect with life Vinod Kumar M Technology Evangelist | Microsoft
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Preview of Oracle Database 12 c In-Memory Option Thomas Kyte
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
Performance and Scalability. Performance and Scalability Challenges Optimizing PerformanceScaling UpScaling Out.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Gain Performance & Scalability With RightNow Analytics
Cloud Computing Lecture Column Store – alternative organization for big relational data.
Crystal And Elliott Edward M. Kwang President. Crystal Version Standard - $145 Professional - $350 Developer - $450.
Oracle Challenges Parallelism Limitations Parallelism is the ability for a single query to be run across multiple processors or servers. Large queries.
1 Robert Wijnbelt Health Check your Database A Performance Tuning Methodology.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
Improving Efficiency of I/O Bound Systems More Memory, Better Caching Newer and Faster Disk Drives Set Object Access (SETOBJACC) Reorganize (RGZPFM) w/
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
1 Wenguang WangRichard B. Bunt Department of Computer Science University of Saskatchewan November 14, 2000 Simulating DB2 Buffer Pool Management.
Applications hitting a wall today with SQL Server Locking/Latching Scale-up Throughput or latency SLA Applications which do not use SQL Server.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Views Lesson 7.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Srik Raghavan Principal Lead Program Manager Kevin Cox Principal Program Manager SESSION CODE: DAT206.
The Relational Model1 Transaction Processing Units of Work.
UNIT-II Principles of dimensional modeling
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Partition Architecture Yeon JongHeum
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Nov 2006 Google released the paper on BigTable.
CIS 250 Advanced Computer Applications Database Management Systems.
Your Data Any Place, Any Time Performance and Scalability.
1 Copyright © 2005, Oracle. All rights reserved. Following a Tuning Methodology.
Last Updated : 27 th April 2004 Center of Excellence Data Warehousing Group Teradata Performance Optimization.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
CS4432: Database Systems II
Dynamicpartnerconnections.com Development for performance Oleksandr Katrusha, Program manager
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
Dave LinkedIn
How to kill SQL Server Performance Håkan Winther.
Troubleshooting Dennis Shasha and Philippe Bonnet, 2013.
Closing the Query Processing Loop in Oracle 11g Allison Lee, Mohamed Zait.
This document is provided for informational purposes only and Microsoft makes no warranties, either express or implied, in this document. Information.
Tuning Transact-SQL Queries
Parallel Databases.
Physical Database Design
Operational & Analytical Database
Data Platform and Analytics Foundational Training
Introduction to NewSQL
Oracle Tuning Practice
Purpose, Pitfalls and Performance Implications
Purpose, Pitfalls and Performance Implications
Physical Database Design
SETL: Efficient Spark ETL on Hadoop
JULIE McLAIN-HARPER LINKEDIN: JM HARPER
Transactions, Locking and Query Optimisation
Managing batch processing Transient Azure SQL Warehouse Resource
Fundamentals of Databases
Index Tuning Additional knowledge.
Transactions and Concurrency
Virtual Memory: Working Sets
CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #03 Row/Column Stores, Heap Files, Buffer Manager, Catalogs Instructor: Chen Li.
Presentation transcript:

Top 10 SQL-on-Hadoop Pitfalls Monte Zweben CEO, Splice Machine

SQL-on-Hadoop Landscape A crowded, confusing landscape, full of potential and pitfalls…

Pitfall #1 Individual Record Lookups

Pitfall #1: Individual Lookups and Range Queries Issues  Support real-time apps? -Concurrent readers/writers? -e.g., TPC-C A.5: Impact  Data delays  Inability to support: -many users -many sources -real-time apps  Avoid analytic systems designed for full table scans  Choose systems designed for range queries  Verification: do query <30ms? Solution Examples Affected Unaffected

Pitfall #2 Dirty Data

Pitfall #2: Fixing Bad Data Issues  Fix bad data without reloading all data: UPDATE address SET state=“CA” WHERE state=“California” Impact  Bad reports  Delays to reload entire data set  Ability to update and delete  Transactions for multi-row updates  Verification: update one record in <30ms? Solution Examples Affected Unaffected

Pitfall #3 Sharding

Pitfall #3: Sharding Issues  Right key to distribute data  Right shard size Impact  Slow queries, esp. for large joins or aggregations  Requires design time decisions and/or tuning  Small shards = more parallelization  Fewer shards = less memory and compaction overhead Solution Examples Affected Unaffected

Pitfall #4 Hotspotting

Pitfall #4: Hotspotting Issues  Data too concentrated in a few nodes -Especially for time series data -e.g., order data … PRIMARY KEY (order_id, order_line)… INSERT INTO order_lines (order_id, order_line, create_date, SKU, qty, …) VALUES (value1, value 2, …); Impact  Slow queries  Poor parallelization  Key design  Salt keys, if range queries not important Solution Examples Affected Unaffected

SELECT orders.id AS orders_id, (SELECT COUNT(order_items.id) FROM order_items WHERE order_items.order_id = orders.id) AS orders_total_number_of_items FROM thelook.orders AS orders ORDER BY orders_total_number_of_items LIMIT 500 Splice Machine | Proprietary & Confidential Pitfall #5 SQL Coverage

Pitfall #5: SQL Coverage Issues  Limited SQL dialects -Why can't I do full join syntax like ON or OR? -Can I do a correlated sub-query? SELECT employee_number, name FROM employees AS Person WHERE salary > ( SELECT AVG(salary) FROM employees WHERE department = Person.department); Impact  Can’t run queries to meet business needs  Do your homework  Compile list of toughest queries and test Solution Examples Affected Unaffected

Splice Machine | Proprietary & Confidential Pitfall #6 Concurrency

Pitfall #6: Concurrency Issues  Power real-time apps  Support 1000s of readers/writers at once  Commit records across tables atomically Impact  Inability to: -Power real-time apps -Handle many users -Support many input sources -Deliver reports as updates happen  Lockless, high-concurrency transactions  Verification: TPC-C benchmark Solution Examples Affected Unaffected

Splice Machine | Proprietary & Confidential Pitfall #7 Columnar

Pitfall #7: Columnar Issues  OLAP queries slower -Especially large joins or aggregations -e.g., TPC-H Q6 Impact  Poor productivity -Queries may take minutes or hours instead of seconds  Columnar, but not all created equal: 1.Columnar storage/compression (4x advantage) 2.Columnar projection (10x) 3.Columnar engine (e.g., vector processing, pipelining) (50x) Solution Examples Affected Unaffected

Splice Machine | Proprietary & Confidential Pitfall #8 Node Sizing

Pitfall #8: Node Sizing Issues  Big boxes vs. pizza boxes?  SSDs?  Memory size? Impact  Price/performance  Performance bottlenecks  Default: stick to pizza boxes with moderate memory ( GB) and no SSDs unless a good reason  Do your homework – profile your workload - CPU bound? Memory? I/O? Network? Solution Examples Affected Every distributed system

Pitfall #9 Brittle ETL on Hadoop

Pitfall #9: Brittle ETL on Hadoop Issues  Restart ETL pipeline because of errors  Serialize ETL Pipeline Impact  Miss ETL window  Delay reports to business users  Incremental updates with transactions to address data quality  Rollback failed ETL steps without restarting entire process Solution Examples Affected Unaffected

Pitfall #10 Cost-Based Optimizer

Pitfall #10: Cost-Based Optimizer Issues  Choose right join strategy  Choose right index  Choose right ordering Impact  Poor performance = poor productivity  Manual tuning by DBAs  Cost-based optimizer  However, maturity for OLAP queries may be issue for newer products like Hive, Impala, and Drill Solution Examples Affected Unaffected

Top 10 SQL-on-Hadoop Pitfalls 1.Individual Lookups and Range Queries 2.Fixing Bad Data 3.Sharding 4.Hotspotting 5.SQL Coverage 6.Concurrency 7.Columnar 8.Node Sizing 9.Brittle ETL on Hadoop 10.Cost-Based Optimizer

Summary  SQL-on-Hadoop Benefits - Excellent price/performance - SQL access - Incredible pace of product advancements  SQL-on-Hadoop Pitfalls - Crowded, confusing market - Relatively immature products - Difficult to determine product differences  Recommendations - Review Top 10 pitfalls - Truly understand your workloads - Do POCs to verify

Questions? Monte Zweben