‹#› © Cloudera, Inc. All rights reserved. Marcel Kornacker // Cloudera, Inc. Friction-Free ELT: Automating Data Transformation with Cloudera Impala.

Slides:



Advertisements
Similar presentations
Advanced SQL (part 1) CS263 Lecture 7.
Advertisements

February 18, 2012 Lesson 3 Standard SQL. Lesson 3 Standard SQL.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.
Michael Pizzo Software Architect Data Programmability Microsoft Corporation.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
CS525: Special Topics in DBs Large-Scale Data Management MapReduce High-Level Langauges Spring 2013 WPI, Mohamed Eltabakh 1.
Technical BI Project Lifecycle
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
From Raw Data to Analytics with No ETL Marcel Kornacker // Cloudera, Inc.
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Organizing Data & Information
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 8 Advanced SQL.
Ch1: File Systems and Databases Hachim Haddouti
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 8 Advanced SQL.
Chapter 11 Data Management Layer Design
Database Systems More SQL Database Design -- More SQL1.
RIZWAN REHMAN, CCS, DU. Advantages of ORDBMSs  The main advantages of extending the relational data model come from reuse and sharing.  Reuse comes.
CS525: Big Data Analytics MapReduce Languages Fall 2013 Elke A. Rundensteiner 1.
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Hive: A data warehouse on Hadoop Based on Facebook Team’s paperon Facebook Team’s paper 8/18/20151.
Main challenges in XML/Relational mapping Juha Sallinen Hannes Tolvanen.
Objectives of the Lecture :
Introduction to SQL Structured Query Language Martin Egerhill.
Gain Performance & Scalability With RightNow Analytics
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
Database Systems: Design, Implementation, and Management Tenth Edition Chapter 8 Advanced SQL.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Context Tailoring the DBMS –To support particular applications Beyond alphanumerical data Beyond retrieve + process –To support particular hardware New.
CSE314 Database Systems More SQL: Complex Queries, Triggers, Views, and Schema Modification Doç. Dr. Mehmet Göktürk src: Elmasri & Navanthe 6E Pearson.
Converting COBOL Data to SQL Data: GDT-ETL Part 1.
Hive : A Petabyte Scale Data Warehouse Using Hadoop
NoSQL continued CMSC 461 Michael Wilson. MongoDB  MongoDB is another NoSQL solution  Provides a bit more structure than a solution like Accumulo  Data.
Data File Access API : Under the Hood Simon Horwith CTO Etrilogy Ltd.
© 2007 by Prentice Hall 1 Introduction to databases.
MapReduce High-Level Languages Spring 2014 WPI, Mohamed Eltabakh 1.
A NoSQL Database - Hive Dania Abed Rabbou.
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Chapter 6 Procedural Language SQL and Advanced SQL Database Principles: Fundamentals of Design, Implementation, and Management Tenth Edition.
Maintaining a Database Access Project 3. 2 What is Database Maintenance ?  Maintaining a database means modifying the data to keep it up-to-date. This.
© All rights reserved. U.S International Tech Support
IS 230Lecture 6Slide 1 Lecture 7 Advanced SQL Introduction to Database Systems IS 230 This is the instructor’s notes and student has to read the textbook.
1 © Cloudera, Inc. All rights reserved. Simplifying Analytic Workloads via Complex Schemas Josh Wills, Alex Behm, and Marcel Kornacker Data Modeling for.
7 Strategies for Extracting, Transforming, and Loading.
IBM Research ® © 2007 IBM Corporation A Brief Overview of Hadoop Eco-System.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
1 Storing and Maintaining Semistructured Data Efficiently in an Object- Relational Database Mo Yuanying and Ling Tok Wang.
There’s a particular style to it… Rob Hatton
MICROSOFT ACCESS – CHAPTER 5 MICROSOFT ACCESS – CHAPTER 6 MICROSOFT ACCESS – CHAPTER 7 Sravanthi Lakkimsety Mar 14,2016.
Capture and Storage of Tabular Data Leveraging Ephesoft and Alfresco W. Gary Cox Senior Consultant Blue Fish Development Group.
SQL Basics Review Reviewing what we’ve learned so far…….
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
More SQL: Complex Queries, Triggers, Views, and Schema Modification
OMOP CDM on Hadoop Reference Architecture
Database Systems: Design, Implementation, and Management Tenth Edition
Potter’s Wheel: An Interactive Data Cleaning System
Physical Database Design for Relational Databases Step 3 – Step 8
Lesson 1: Introduction to Trifacta Wrangler
SETL: Efficient Spark ETL on Hadoop
Data Model.
Sunil Agarwal | Principal Program Manager
Contents Preface I Introduction Lesson Objectives I-2
Chapter 8 Advanced SQL.
Database Systems: Design, Implementation, and Management Tenth Edition
ETL Patterns in the Cloud with Azure Data Factory
Presentation transcript:

‹#› © Cloudera, Inc. All rights reserved. Marcel Kornacker // Cloudera, Inc. Friction-Free ELT: Automating Data Transformation with Cloudera Impala

© Cloudera, Inc. All rights reserved. Outline Evolution of ELT in the context of analytics traditional systems Hadoop today Cloudera’s vision for ELT make most of it disappear automate data transformation 2

© Cloudera, Inc. All rights reserved. Traditional ELT Extract: physical extraction from source data store could be an RDBMS acting as an operational data store or log data materialized as json Load: the targeted analytic DBMS converts the transformed data into its binary format (typically columnar) Transform: data cleansing and standardization conversion of naturally complex/nested data into a flat relational schema 3

© Cloudera, Inc. All rights reserved. Three aspects to the traditional ELT process: 1.semantic transformation such as data standardization/cleansing » makes data more queryable, adds value 2.representational transformation: from source to target schema (from complex/nested to flat relational) » “lateral” transformation that doesn’t change semantics, adds operational overhead 3.data movement: from source to staging area to target system » adds yet more operational overhead Traditional ELT 4

© Cloudera, Inc. All rights reserved. The goals of “analytics with no ELT”: simplify aspect 1 eliminate aspects 2 and 3 Traditional ELT 5

© Cloudera, Inc. All rights reserved. A typical ELT workflow with Hadoop looks like this: raw source data initially lands in HDFS (examples: text/xml/json log files) … Text, XML, JSON ELT with Hadoop Today 6

© Cloudera, Inc. All rights reserved. … map that data into a table to make it queryable … Text, XML, JSONOriginal Data CREATE TABLE RawLogData (…) ROW FORMAT DELIMITED FIELDS LOCATION ‘/raw-log-data/‘; ELT with Hadoop Today 7

© Cloudera, Inc. All rights reserved. … create the target table at a different location … Text, XML, JSONOriginal DataParquet CREATE TABLE LogData (…) STORED AS PARQUET LOCATION ‘/log-data/‘; ELT with Hadoop Today 8

© Cloudera, Inc. All rights reserved. … convert the raw source data to the target format … Text, XML, JSONOriginal DataParquet INSERT INTO LogData SELECT * FROM RawLogData; ELT with Hadoop Today 9

© Cloudera, Inc. All rights reserved. … the data is then available for batch reporting/analytics (via Impala, Hive, Pig, Spark) or interactive analytics (via Impala, Search) Text, XML, JSONOriginal DataParquet Impala Hive Spark Pig ELT with Hadoop Today 10

© Cloudera, Inc. All rights reserved. Compared to traditional ELT, this has several advantages: Hadoop acts as a centralized location for all data: raw source data lives side by side with the transformed data data does not need to be moved between multiple platforms/clusters data in the raw source format is queryable as soon as it lands, although at reduced performance, compared to an optimized columnar data format ELT with Hadoop Today 11

© Cloudera, Inc. All rights reserved. However, even this still has drawbacks: new data needs to be loaded periodically into the target table doing that reliably and within SLAs can be a challenge you now have two tables: one with current but slow data another with lagging but fast data Text, XML, JSONOriginal DataParquet Impala Hive Spark Pig Another user-managed data pipeline! ELT with Hadoop Today 12

© Cloudera, Inc. All rights reserved. No explicit loading/conversion step to move raw data into a target table A single view of the data that is up-to-date and (mostly) in an efficient columnar format automated with custom logic Text, XML, JSONParquet Impala Hive Spark Pig automated A Vision for Analytics with No ELT 13

© Cloudera, Inc. All rights reserved. support for complex/nested schemas » avoid remapping of raw data into a flat relational schema background and incremental data conversion » retain in-place single view of entire data set, with most data being in an efficient format bonus: schema inference and schema evolution » start analyzing data as soon as it arrives, regardless of its complexity A Vision for Analytics with No ELT 14

© Cloudera, Inc. All rights reserved. Standard relational: all columns have scalar values: CHAR(n), DECIMAL(p, s), INT, DOUBLE, TIMESTAMP, etc. Complex types: structs, arrays, maps in essence, a nested relational schema Supported file formats: Parquet, json, XML, Avro Design principle for SQL extensions: maintain SQL’s way of dealing with multi-valued data Support for Complex Schemas in Impala 15

© Cloudera, Inc. All rights reserved. Sample schema CREATE TABLE Customers ( cid BIGINT, address STRUCT { street STRING, zip INT, city STRING }, orders ARRAY }> ) Support for Complex Schemas in Impala 16

© Cloudera, Inc. All rights reserved. Paths in expressions: generalization of column references to drill through structs Can appear anywhere a conventional column reference is legal SELECT address.zip, c.address.street FROM Customers c WHERE address.city = ‘Oakland’ SQL Syntax Extensions: Structs 17

© Cloudera, Inc. All rights reserved. Basic idea: nested collections are referenced like tables in the FROM clause SELECT SUM(i.price * i.qty) FROM Customers.orders.items i WHERE i.price > 10 SQL Syntax Extensions: Arrays and Maps 18

© Cloudera, Inc. All rights reserved. Parent/child relationships don’t require join predicates identical to SELECT c.cid, o.total FROM Customers c, c.orders o SELECT c.cid, o.total FROM Customers c INNER JOIN c.orders o SQL Syntax Extensions: Arrays and Maps 19

© Cloudera, Inc. All rights reserved. All join types are supported (INNER, OUTER, SEMI, ANTI) Advanced querying capabilities via correlated inline views and subqueries SELECT c.cid FROM Customers c LEFT ANTI JOIN c.orders o SELECT c.cid FROM Customers c WHERE EXISTS (SELECT * FROM c.orders.items where iid = 117) SQL Syntax Extensions: Arrays and Maps 20

© Cloudera, Inc. All rights reserved. Aggregates over collections are concise and easy to read instead of SELECT c.cid, COUNT(c.orders), AVG(c.orders.items.price) FROM Customers c SELECT c.cid, a, b FROM Customers c, (SELECT COUNT(*) AS a FROM c.orders), (SELECT AVG(price) AS b FROM c.orders.items) SQL Syntax Extensions: Arrays and Maps 21

© Cloudera, Inc. All rights reserved. Aggregates over collections can replace subqueries instead of SELECT c.cid FROM Customers c WHERE COUNT(c.orders) > 10 SELECT c.cid FROM Customers c WHERE (SELECT COUNT(*) FROM c.orders) > 10 SQL Syntax Extensions: Arrays and Maps 22

© Cloudera, Inc. All rights reserved. Parquet optimizes storage of complex schemas by inlining structural data into the columnar data (repetition/definition levels) No performance penalty for accessing nested collections! Parquet translates logical hierarchy into physical collocation: you don’t need parent-child joins queries run faster! Complex Schemas with Parquet 23

© Cloudera, Inc. All rights reserved. Automated data conversion takes care of format conversion: from text/JSON/Avro/… to Parquet compaction: multiple smaller files are coalesced into a single larger one Sample workflow: create table for data: attach data to table: CREATE TABLE LogData (…) WITH CONVERSION TO PARQUET; LOAD DATA INPATH ‘/raw-log-data/file1’ INTO LogData SOURCE FORMAT JSON; Automated Data Conversion 24

© Cloudera, Inc. All rights reserved. Automated Data Conversion 25 JSON Impala Hive Spark Pig PARQUETJSON Single-Table View

© Cloudera, Inc. All rights reserved. Conversion process atomic: the switch from the source to the target data files is atomic from the perspective of a running query (but any running query sees the full data set) redundant: with option to retain original data incremental: Impala’s catalog service detects new data files that are not in the target format automatically Automated Data Conversion 26

© Cloudera, Inc. All rights reserved. Specify data transformation via “view” definition: derived table is expressed as Select, Project, Join, Aggregation custom logic incorporated via UDFs, UDAs CREATE DERIVED TABLE CleanLogData AS SELECT StandardizeName(name), StandardizeAddr(addr, zip), … FROM LogData STORED AS PARQUET; Automating Data Transformation: Derived Tables 27

© Cloudera, Inc. All rights reserved. From the user’s perspective: table is queryable like any other table (but doesn’t allow INSERTs) reflects all data visible in source tables at time of query (not: at time of CREATE) performance is close to that of a table created with CREATE TABLE … AS SELECT (ie, that of a static snapshot) Automating Data Transformation: Derived Tables 28

© Cloudera, Inc. All rights reserved. From the system’s perspective: table is Union of physically materialized data, derived from input tables as of some point in the past view over yet-unprocessed data of input tables table is updated incrementally (and in the background) when new data is added to input tables Automating Data Transformation: Derived Tables 29

© Cloudera, Inc. All rights reserved. Automating Data Transformation: Derived Tables 30 Impala Hive Spark Pig Delta Query Materialized Data Base Table Updates Single-Table View

© Cloudera, Inc. All rights reserved. Schema inference from data files is useful to reduce the barrier to analyzing complex source data as an example, log data often has hundreds of fields the time required to create the DDL manually is substantial Example: schema inference from structured data files available today: future formats: XML, json, Avro CREATE TABLE LogData LIKE PARQUET ‘/log-data.pq’ Schema Inference and Schema Evolution 31

© Cloudera, Inc. All rights reserved. Schema evolution: a necessary follow-on to schema inference: every schema evolves over time; explicit maintenance is as time-consuming as the initial creation algorithmic schema evolution requires sticking to generally safe schema modifications: adding new fields adding new top-level columns adding fields within structs Schema Inference and Schema Evolution 32

© Cloudera, Inc. All rights reserved. Example workflow: scans data to determine new columns/fields to add synchronous: if there is an error, the ‘load’ is aborted and the user notified LOAD DATA INPATH ‘/path’ INTO LogData SOURCE FORMAT JSON WITH SCHEMA EXPANSION; Schema Inference and Schema Evolution 33

© Cloudera, Inc. All rights reserved. CREATE TABLE … LIKE : available today for Parquet Impala 2.3+ for Avro, JSON, XML Nested types: Impala 2.3 Background format conversion: Impala 2.4 Derived tables: > Impala 2.4 Timeline of Features in Impala 34

© Cloudera, Inc. All rights reserved. Hadoop offers a number of advantages over traditional multi-platform ETL solutions: availability of all data sets on a single platform data becomes accessible through SQL as soon as it lands However, this can be improved further: a richer analytic SQL that is extended to handle nested data an automated background conversion process that preserves an up-to- date view of all data while providing BI-typical performance a declarative transformation process that focuses on application semantics and removes operational complexity simple automation of initial schema creation and subsequent maintenance that makes dealing with large, complex schemas less labor-intensive Conclusion 35

‹#› © Cloudera, Inc. All rights reserved. Thank you Marcel Kornacker | 36