Download presentation
Presentation is loading. Please wait.
1
BI and SQL Analytics with Hadoop in the Cloud
Alex Gutow // Product Marketing, Analytic Database Henry Robinson // Impala Technical Lead
2
Agenda Cloud drivers and benefits BI and analytic DBs in the cloud
Details on Apache Impala (incubating) on cloud Impala on cloud best practices Impala vs Redshift
3
What’s Driving Analytics to the Cloud?
Big data deployments in cloud are accelerating: Executive Mandate: Minimize on-prem datacenter footprint Increased Agility: End-user self-service Elasticity: Optimize infrastructure usage Lower Overall TCO As public cloud growth surges, we’re seeing broad interest across our customer base. Really spans verticals - retail, public sector, manufacturing, travel and use cases - lots of interest in customer 360, internet of things, application delivery, BI & analytics
4
What does the cloud provide?
Faster and easier deployments: No hardware procurement delays Infra + service deployments via software automation Cost-efficiency: Offload data center and data center operations Public cloud vendors manage the data center and infrastructure Pay-for-use Workloads that don’t need 24/7 hardware can pay just for their duration Note: non-transient data needs to live outside the cluster (e.g. S3) Elastically size-to-demand Incrementally add/remove hardware as you need it Note: requires an elastically scalable service
5
Two Analytic Database Patterns
Reduce Operating Costs New Insights, New Revenue Only pay for what you need, when you need it Transient clusters Elastic workload Object storage centric Cloud-native deployment ETL BI/Analytics Explore and analyze all data, wherever it lives Long-running clusters Sized to demand HDFS or object storage Lift-and-shift deployment
6
Patterns of Cloud-Native Applications Flexibility, Self-Service Models, and New Cost Dynamics
Decoupled Storage and Compute for Elastic Scale Object Store COMPUTE Embrace Transience for Lower Costs 1hr SPIN UP SPIN DOWN Compartmentalize for Greater Isolation Object Store
7
Traditional Monolithic Analytic Databases
Rigid Data Model with Tightly Coupled Storage/Compute Static Sizing ∞ Limited to SQL with Data Movement Necessary No Cloud Elasticity or Cloud Storage Integration COMPUTE STORE
8
Apache Impala (incubating): Cloud-Native Capabilities
Data Flexibility Faster, more agile data acquisition Data portability: Open formats and open storage Elastic Scalability Elastic scale on-prem or in the cloud Cloud-native pay-per-use and transience Proven at big data scale Go Beyond SQL Open Architecture: Open formats and open storage Shared data across SQL and non-SQL workloads Hybrid Runs across multi-cloud & on-prem Multi-storage over S3, HDFS, Kudu, Isilon, DSSD, etc Shared Data Update elasticity to match 4 Cloud Elasticity Ability to grow/shrink cluster sizes (native architecture) Elastic compute scale (via direct query from S3) Transient workloads (via direct query from S3) Data Agility Faster, more agile data acquisition Data portability – open formats and open storage (even more so with S3) Scalability Proven through hundreds of nodes Proven with high concurrency Hybrid cloud Runs across clouds and on-prem Runs across S3, HDFS, Kudu, Isilon, DSSD, etc
9
BI Patterns Enabled Update elasticity to match 4 Cloud Elasticity
On-Demand Elasticity Grow/shrink for peak and off-peak needs Flexibility for unplanned usage changes Object Store COMPUTE Data Portability Directly query from S3 without rigid ETL and data modeling Flexibly share data across object store-based workloads and tenants Object Store Pay-per-Use Economics Cost-efficient transience for periodic batch jobs like ETL Object Store 1h Seamless Query Across Filesystems Query data across HDFS (EBS), S3, Kudu, etc Same capabilities on-prem and in the cloud Update elasticity to match 4 Cloud Elasticity Ability to grow/shrink cluster sizes (native architecture) Elastic compute scale (via direct query from S3) Transient workloads (via direct query from S3) Data Agility Faster, more agile data acquisition Data portability – open formats and open storage (even more so with S3) Scalability Proven through hundreds of nodes Proven with high concurrency Hybrid cloud Runs across clouds and on-prem Runs across S3, HDFS, Kudu, Isilon, DSSD, etc
10
How Apache Impala (incubating) delivers
11
How Impala delivers Impala decouples compute from storage
Compute layer is stateless and can be expanded elastically Best-of-breed execution engine translates performance gains directly to cloud hardware
12
From disks to the cloud Impala uses open standard HDFS interface as an abstraction over many different storage systems (Not all - see Kudu, HBase) Some performance optimizations still rely on idea that IO subsystem is locally attached, and that locality is key. No abstraction is perfect!
13
Scans without disks Impala maps read requests to individual disks to achieve high parallelism from locally-attached storage. Each disk gets one queue Scheduler tries to balance load while keeping read requests local to a replica Cloud storage has no disks. So which queue should get the read requests? Solution: Add a separate queue to serve cloud storage requests. Scheduler treats all reads as remote, and equal cost
14
Scans without disks – or blocks!
HDFS file blocks are unit of IO parallelism - many nodes can process many blocks in parallel S3 has no concept of block size Need to choose a block size that doesn’t lead to extra IO for Parquet Solution: synthesize the metadata in Impala based on expected block size
15
Metadata updates Impala moves a lot of files around when INSERTing data Updating metadata in S3 can be very costly. Changing the location of a file means copying the entire file. INSERT performance suffers. Solution: give user flexibility of reduced consistency guarantee but better performance, by writing files to their final location directly
16
Deployment best practices
Choosing between S3 and EBS
17
Impala: EBS vs S3, single user workload
18
Impala: EBS vs S3, multi-user workload
19
Impala: Scalability of compute on S3
20
Impala: EBS vs S3 - takeaways
EBS holds the performance advantage, as expected S3 offers competitive performance / $ Storage operational costs are higher with EBS: Access to HDFS must be mediated by ‘always on’ cluster Scaling compute does not scale storage bandwidth Scaling storage bandwidth costly in time (re-replication of block data)
21
Choosing between Impala on S3 or on EBS
Periodic or transient workloads -> use S3 Saves money to only pay during runtime Concurrent or 24/7 workloads -> use EBS Greater performance for continuously running workloads What if I have both? Use both together! (Impala queries seamlessly across) EBS for your regularly accessed data S3 for the long-tail for more cost-effective storage
22
Instance type best practices
Both storage types need some attached storage for temporary results S3 - bandwidth scales with cluster, so more smaller instances typically better Key is ensuring sufficient memory to efficiently execute the workload EBS - similar sizing exercise to on-premise Storage naturally less elastic, and local IO throughput still matters Our benchmarks (3TB TPC-DS) run on r3.2xlarge
23
Apache Impala vs Amazon Redshift
24
Benchmark setup Match per-cluster resources (CPU, memory) while choosing best instances for each system Benchmark workload: 3TB TPC-DS Impala -> standard schema with partition keys Redshift -> general schema and ‘fixed reporting’ schema with many optimizations Measure cost and performance of data loading and query workloads
26
Impala vs Redshift: Multi-user cost ETL + Multi-user queries
Exploratory BI can be expensive on Redshift Impala >200% cheaper than Redshift General Purpose Impala 8-28% cheaper than Redshift Fixed Reporting ETL + Multi user queries: This section covers loading data from S3 then running "compute stats" for Redshift and Impala on EBS. 4 concurrent streams were used for the run, higher concurrency causes Impala queries to fail (Memory limit). Admission control was used for Impala. Each stream used different query parameters, no query was repeated in the test. Redshift un-optimized No point running since single user tests took too long Highlights Impala on EBS is 40% cheaper than Redshift Impala on S3 is 24% cheaper than Redshift
27
Impala vs Redshift: Multi-user throughput Multi-user queries
Impala 4-10x faster than Redshift General Purpose Impala 42-90% faster than Redshift Fixed Reporting Exploratory BI can be slow on Redshift ETL + Single user queries: This section covers loading data from S3 then running a single user test. For Impala on S3 the ETL time is the time it takes to run "compute stats" on all tables. Queries were generated using TPC-DS v2.1.0 kit, 70 out of the 99 queries were used in all the tests. Queries selected are the ones fully supported by Impala and didn't require any modifications or workaround, queries with variants were omitted as the Rollup workaround is very inefficient. Redshift un-optimized This schema is very close to what Impala's schema looks like, Redshift doesn't support partitioning so columns that are usually partitioned were sorted. Small tables were replicated and large tables where evenly distributed across the 8 nodes in the cluster, no PK/FK or hash partitioning was used. Redshift optimized This schema resembles the best case scenario for Redshift, utilizing hash partitioning, Sorting and PK/FK constraints. Redshift planner makes assumption based on the PK/FK relations without enforcing them. Impala on EBS Impala used gp2 mounted 2TB EBS volumes for HDFS as well as spill data Impala on S3 Impala used gp2 mounted 2TB EBS volumes for spill data while all the remaining data was on S3. ETL cost for Impala on S3 is based on time it takes to run "analyze table" Detailed query results can be found here. Highlights Un-optimized Redshift schema is very slow Running "analyze table" against S3 is very slow Impala on EBS is 39% cheaper than Redshift
28
Impala vs Redshift – Key Takeaways
Impala’s de-coupled architecture enables new cloud benefits: Direct query from S3 Elastic scalability Data portability and flexibility Transient clusters Impala amongst best in class for performance and cost in AWS Querying data in-situ is critical to flexibility and agility, eliminating expensive ingest steps and leaving authoritative data in one place
29
Cloudera’s Analytic Database Solution
Identify, offload, & optimize workloads to Hadoop Navigator Optimizer Audit, lineage, encryption, key management, & policy lifecycles Navigator Intelligent SQL editor Hue Integration with the leading BI tools BI Partners Interactive query engine for BI & SQL analytics Impala Large-scale ETL & batch processing engine Hive-on-Spark We’ve discussed many capabilities. How do you get started? Every Hadoop deployment breaks down into these three core platform use cases. Linear progression from data integration (getting data ready) to data discovery/analytics (looking at data, building models), to real-time data applications (deploying models for business transformation). Data Integration Use Cases: Data Warehouse Offload ETL Offload Mainframe Offload Active Archive Real-Time Data Pipelines Infrastructure Consolidation Data Discovery & Analytics Use Cases: Self-Service Data Discovery Analytics Modernization Data Science / Advanced Analytics Real-Time Data Applications Use Cases: Monitoring and Detection Real-Time Analytics Recommendation Engine Personalization Multi-Storage, Multi-Environment
30
More Details New cloud scenarios with Impala on S3 Impala on S3 vs EBS
Impala on S3 vs EBS considerations-for-s3-vs-ebs/ Impala compared to Redshift elasticity-agility-and-cost-performance-benefits-on-aws/ Try it out!
31
Thank You!
32
Cloud Deployment Patterns
Transient Workloads Elastic Combination Workloads Persistent Workloads Spin-up, run, and spin-down hardware Persistent data in S3 Grow/shrink based on demand Use HDFS or S3 adf
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.