BI and SQL Analytics with Hadoop in the Cloud

Slides:



Advertisements
Similar presentations
Hello i am so and so, title/role and a little background on myself (i.e. former microsoft employee or anything interesting) set context for what going.
Advertisements

SSRS 2008 Architecture Improvements Scale-out SSRS 2008 Report Engine Scalability Improvements.
Agile Infrastructure built on OpenStack Building The Next Generation Data Center with OpenStack John Griffith, Senior Software Engineer,
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
Observation Pattern Theory Hypothesis What will happen? How can we make it happen? Predictive Analytics Prescriptive Analytics What happened? Why.
Microsoft SQL Server x 46% 900+ For Hosting Service Providers
System Center: Accelerating Growth in the hybrid Cloud Microsoft Hosting Service Providers Conversation #2 1.
David Besemer, CTO On Demand Data Integration with Data Virtualization.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
IBM Bluemix Ecosystem Development Hands on Workshop Section 1 - Overview.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.
BIG DATA/ Hadoop Interview Questions.
Amazon Web Services. Amazon Web Services (AWS) - robust, scalable and affordable infrastructure for cloud computing. This session is about:
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
PHD Virtual Technologies “Reader’s Choice” Preferred product.
Clouding with Microsoft Azure
Extreme Scale Infrastructure
Business Insights Play briefing deck.
READ ME FIRST Use this template to create your Partner datasheet for Azure Stack Foundation. The intent is that this document can be saved to PDF and provided.
Energy Management Solution
Prof. Jong-Moon Chung’s Lecture Notes at Yonsei University
OMOP CDM on Hadoop Reference Architecture
Course: Cluster, grid and cloud computing systems Course author: Prof
Connected Infrastructure
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Organizations Are Embracing New Opportunities
  Choice Hotels’ journey to better understand its customers through self-service analytics Narasimhan Sampath & Avinash Ramineni Strata Hadoop World |
Data Platform and Analytics Foundational Training
Reducing Risk with Cloud Storage
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
Introduction to Distributed Platforms
Hadoop and Analytics at CERN IT
Business Critical Application Platform
Hybrid Management and Security
Partner Logo Veropath Offers a Next-Gen Expense Management SaaS Technology Solution, Built Specifically to Harness Big Data Analytics Capabilities in Azure.
A10 Networks vThunder Leverages the Powerful Microsoft Azure Cloud Platform to Offer Advanced Layer 4-7 Networking, Security on a Global Scale MICROSOFT.
Operational & Analytical Database
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
HPE SimpliVity 380 Our differentiated approach to hyperconvergence.
Energy Management Solution
Gregory Kesden, CSE-291 (Storage Systems) Fall 2017
Gregory Kesden, CSE-291 (Cloud Computing) Fall 2016
Using External Persistent Volumes to Reduce Recovery Times and Achieve High Availability Dinesh Israni, Senior Software Engineer, Portworx Inc.
Business Critical Application Platform
Introduction to Cloud Computing
Powering real-time analytics on Xfinity using Kudu
Cloud Computing.
Maximize the value of your cloud
Welcome! Power BI User Group (PUG)
Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.
Data Security for Microsoft Azure
Welcome! Power BI User Group (PUG)
Ch 4. The Evolution of Analytic Scalability
Managing batch processing Transient Azure SQL Warehouse Resource
XtremeData on the Microsoft Azure Cloud Platform:
AWS Cloud Computing Masaki.
IBM Power Systems.
Outperform the Competition with Azure SQL Data Warehouse
SQL Server 2019 Bringing Apache Spark to SQL Server
Architecture of modern data warehouse
Presentation transcript:

BI and SQL Analytics with Hadoop in the Cloud Alex Gutow // Product Marketing, Analytic Database Henry Robinson // Impala Technical Lead

Agenda Cloud drivers and benefits BI and analytic DBs in the cloud Details on Apache Impala (incubating) on cloud Impala on cloud best practices Impala vs Redshift

What’s Driving Analytics to the Cloud? Big data deployments in cloud are accelerating: Executive Mandate: Minimize on-prem datacenter footprint Increased Agility: End-user self-service Elasticity: Optimize infrastructure usage Lower Overall TCO As public cloud growth surges, we’re seeing broad interest across our customer base. Really spans verticals - retail, public sector, manufacturing, travel and use cases - lots of interest in customer 360, internet of things, application delivery, BI & analytics

What does the cloud provide? Faster and easier deployments: No hardware procurement delays Infra + service deployments via software automation Cost-efficiency: Offload data center and data center operations Public cloud vendors manage the data center and infrastructure Pay-for-use Workloads that don’t need 24/7 hardware can pay just for their duration Note: non-transient data needs to live outside the cluster (e.g. S3) Elastically size-to-demand Incrementally add/remove hardware as you need it Note: requires an elastically scalable service

Two Analytic Database Patterns Reduce Operating Costs New Insights, New Revenue Only pay for what you need, when you need it Transient clusters Elastic workload Object storage centric Cloud-native deployment ETL BI/Analytics Explore and analyze all data, wherever it lives Long-running clusters Sized to demand HDFS or object storage Lift-and-shift deployment

Patterns of Cloud-Native Applications Flexibility, Self-Service Models, and New Cost Dynamics Decoupled Storage and Compute for Elastic Scale Object Store COMPUTE Embrace Transience for Lower Costs 1hr SPIN UP SPIN DOWN Compartmentalize for Greater Isolation Object Store

Traditional Monolithic Analytic Databases Rigid Data Model with Tightly Coupled Storage/Compute Static Sizing ∞ Limited to SQL with Data Movement Necessary No Cloud Elasticity or Cloud Storage Integration COMPUTE STORE

Apache Impala (incubating): Cloud-Native Capabilities Data Flexibility Faster, more agile data acquisition Data portability: Open formats and open storage Elastic Scalability Elastic scale on-prem or in the cloud Cloud-native pay-per-use and transience Proven at big data scale Go Beyond SQL Open Architecture: Open formats and open storage Shared data across SQL and non-SQL workloads Hybrid Runs across multi-cloud & on-prem Multi-storage over S3, HDFS, Kudu, Isilon, DSSD, etc Shared Data Update elasticity to match 4 Cloud Elasticity Ability to grow/shrink cluster sizes (native architecture) Elastic compute scale (via direct query from S3) Transient workloads (via direct query from S3) Data Agility Faster, more agile data acquisition Data portability – open formats and open storage (even more so with S3) Scalability Proven through hundreds of nodes Proven with high concurrency Hybrid cloud Runs across clouds and on-prem Runs across S3, HDFS, Kudu, Isilon, DSSD, etc

BI Patterns Enabled Update elasticity to match 4 Cloud Elasticity On-Demand Elasticity Grow/shrink for peak and off-peak needs Flexibility for unplanned usage changes Object Store COMPUTE Data Portability Directly query from S3 without rigid ETL and data modeling Flexibly share data across object store-based workloads and tenants Object Store Pay-per-Use Economics Cost-efficient transience for periodic batch jobs like ETL Object Store 1h Seamless Query Across Filesystems Query data across HDFS (EBS), S3, Kudu, etc Same capabilities on-prem and in the cloud Update elasticity to match 4 Cloud Elasticity Ability to grow/shrink cluster sizes (native architecture) Elastic compute scale (via direct query from S3) Transient workloads (via direct query from S3) Data Agility Faster, more agile data acquisition Data portability – open formats and open storage (even more so with S3) Scalability Proven through hundreds of nodes Proven with high concurrency Hybrid cloud Runs across clouds and on-prem Runs across S3, HDFS, Kudu, Isilon, DSSD, etc

How Apache Impala (incubating) delivers

How Impala delivers Impala decouples compute from storage Compute layer is stateless and can be expanded elastically Best-of-breed execution engine translates performance gains directly to cloud hardware

From disks to the cloud Impala uses open standard HDFS interface as an abstraction over many different storage systems (Not all - see Kudu, HBase) Some performance optimizations still rely on idea that IO subsystem is locally attached, and that locality is key. No abstraction is perfect!

Scans without disks Impala maps read requests to individual disks to achieve high parallelism from locally-attached storage. Each disk gets one queue Scheduler tries to balance load while keeping read requests local to a replica Cloud storage has no disks. So which queue should get the read requests? Solution: Add a separate queue to serve cloud storage requests. Scheduler treats all reads as remote, and equal cost

Scans without disks – or blocks! HDFS file blocks are unit of IO parallelism - many nodes can process many blocks in parallel S3 has no concept of block size Need to choose a block size that doesn’t lead to extra IO for Parquet Solution: synthesize the metadata in Impala based on expected block size

Metadata updates Impala moves a lot of files around when INSERTing data Updating metadata in S3 can be very costly. Changing the location of a file means copying the entire file. INSERT performance suffers. Solution: give user flexibility of reduced consistency guarantee but better performance, by writing files to their final location directly

Deployment best practices Choosing between S3 and EBS

Impala: EBS vs S3, single user workload

Impala: EBS vs S3, multi-user workload

Impala: Scalability of compute on S3

Impala: EBS vs S3 - takeaways EBS holds the performance advantage, as expected S3 offers competitive performance / $ Storage operational costs are higher with EBS: Access to HDFS must be mediated by ‘always on’ cluster Scaling compute does not scale storage bandwidth Scaling storage bandwidth costly in time (re-replication of block data)

Choosing between Impala on S3 or on EBS Periodic or transient workloads -> use S3 Saves money to only pay during runtime Concurrent or 24/7 workloads -> use EBS Greater performance for continuously running workloads What if I have both? Use both together! (Impala queries seamlessly across) EBS for your regularly accessed data S3 for the long-tail for more cost-effective storage

Instance type best practices Both storage types need some attached storage for temporary results S3 - bandwidth scales with cluster, so more smaller instances typically better Key is ensuring sufficient memory to efficiently execute the workload EBS - similar sizing exercise to on-premise Storage naturally less elastic, and local IO throughput still matters Our benchmarks (3TB TPC-DS) run on r3.2xlarge

Apache Impala vs Amazon Redshift

Benchmark setup Match per-cluster resources (CPU, memory) while choosing best instances for each system Benchmark workload: 3TB TPC-DS Impala -> standard schema with partition keys Redshift -> general schema and ‘fixed reporting’ schema with many optimizations Measure cost and performance of data loading and query workloads

Impala vs Redshift: Multi-user cost ETL + Multi-user queries Exploratory BI can be expensive on Redshift Impala >200% cheaper than Redshift General Purpose Impala 8-28% cheaper than Redshift Fixed Reporting ETL + Multi user queries: This section covers loading data from S3 then running "compute stats" for Redshift and Impala on EBS. 4 concurrent streams were used for the run, higher concurrency causes Impala queries to fail (Memory limit). Admission control was used for Impala. Each stream used different query parameters, no query was repeated in the test. Redshift un-optimized No point running since single user tests took too long Highlights Impala on EBS is 40% cheaper than Redshift Impala on S3 is 24% cheaper than Redshift

Impala vs Redshift: Multi-user throughput Multi-user queries Impala 4-10x faster than Redshift General Purpose Impala 42-90% faster than Redshift Fixed Reporting Exploratory BI can be slow on Redshift ETL + Single user queries: This section covers loading data from S3 then running a single user test. For Impala on S3 the ETL time is the time it takes to run "compute stats" on all tables. Queries were generated using TPC-DS v2.1.0 kit, 70 out of the 99 queries were used in all the tests. Queries selected are the ones fully supported by Impala and didn't require any modifications or workaround, queries with variants were omitted as the Rollup workaround is very inefficient. Redshift un-optimized This schema is very close to what Impala's schema looks like, Redshift doesn't support partitioning so columns that are usually partitioned were sorted. Small tables were replicated and large tables where evenly distributed across the 8 nodes in the cluster, no PK/FK or hash partitioning was used. Redshift optimized This schema resembles the best case scenario for Redshift, utilizing hash partitioning, Sorting and PK/FK constraints. Redshift planner makes assumption based on the PK/FK relations without enforcing them. Impala on EBS Impala used gp2 mounted 2TB EBS volumes for HDFS as well as spill data Impala on S3 Impala used gp2 mounted 2TB EBS volumes for spill data while all the remaining data was on S3. ETL cost for Impala on S3 is based on time it takes to run "analyze table" Detailed query results can be found here. Highlights Un-optimized Redshift schema is very slow Running "analyze table" against S3 is very slow Impala on EBS is 39% cheaper than Redshift

Impala vs Redshift – Key Takeaways Impala’s de-coupled architecture enables new cloud benefits: Direct query from S3 Elastic scalability Data portability and flexibility Transient clusters Impala amongst best in class for performance and cost in AWS Querying data in-situ is critical to flexibility and agility, eliminating expensive ingest steps and leaving authoritative data in one place

Cloudera’s Analytic Database Solution Identify, offload, & optimize workloads to Hadoop Navigator Optimizer Audit, lineage, encryption, key management, & policy lifecycles Navigator Intelligent SQL editor Hue Integration with the leading BI tools BI Partners Interactive query engine for BI & SQL analytics Impala Large-scale ETL & batch processing engine Hive-on-Spark We’ve discussed many capabilities. How do you get started? Every Hadoop deployment breaks down into these three core platform use cases. Linear progression from data integration (getting data ready) to data discovery/analytics (looking at data, building models), to real-time data applications (deploying models for business transformation). Data Integration Use Cases: Data Warehouse Offload ETL Offload Mainframe Offload Active Archive Real-Time Data Pipelines Infrastructure Consolidation Data Discovery & Analytics Use Cases: Self-Service Data Discovery Analytics Modernization Data Science / Advanced Analytics Real-Time Data Applications Use Cases: Monitoring and Detection Real-Time Analytics Recommendation Engine Personalization Multi-Storage, Multi-Environment

More Details New cloud scenarios with Impala on S3 Impala on S3 vs EBS http://blog.cloudera.com/blog/2016/08/analytics-and-bi-on-amazon-s3-with-apache-impala-incubating/ Impala on S3 vs EBS http://blog.cloudera.com/blog/2016/09/apache-impala-incubating-on-amazon-performance-and-cost- considerations-for-s3-vs-ebs/ Impala compared to Redshift http://blog.cloudera.com/blog/2016/09/apache-impala-incubating-vs-amazon-redshift-s3-integration- elasticity-agility-and-cost-performance-benefits-on-aws/ Try it out! http://www.cloudera.com/downloads

Thank You!

Cloud Deployment Patterns Transient Workloads Elastic Combination Workloads Persistent Workloads Spin-up, run, and spin-down hardware Persistent data in S3 Grow/shrink based on demand Use HDFS or S3 adf