Integration of Oracle and Hadoop: hybrid databases affordable at scale

Slides:



Advertisements
Similar presentations
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
Advertisements

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t Sequential data access with Oracle and Hadoop: a performance comparison Zbigniew Baranowski.
Hadoop Ecosystem Overview
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Scale-out databases for CERN use cases Strata Hadoop World London 6 th of May,2015 Zbigniew Baranowski, CERN IT-DB.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
CERN IT Department CH-1211 Geneva 23 Switzerland t CF Computing Facilities Agile Infrastructure Monitoring CERN IT/CF.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
1 © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen.
Cloudera Kudu Introduction
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Practical Hadoop: do’s and don’ts by example Kacper Surdy, Zbigniew Baranowski.
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
1 Gaurav Kohli Xebia Breaking with DBMS and Dating with Relational Hbase.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Hadoop file format studies in IT-DB Analytics WG meeting 20 th of May, 2015 Daniel Lanza, IT-DB.
Pilot Kafka Service Manuel Martín Márquez. Pilot Kafka Service Manuel Martín Márquez.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Apache Spark: A Unified Engine for Big Data Processing
Connected Infrastructure
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
5/7/ :44 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN.
PROTECT | OPTIMIZE | TRANSFORM
Smart Building Solution
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Data Analytics and CERN IT Hadoop Service
Hadoop and Analytics at CERN IT
Running virtualized Hadoop, does it make sense?
Collecting heterogeneous data into a central repository
Scaling SQL with different approaches
Smart Building Solution
New Big Data Solutions and Opportunities for DB Workloads
Couchbase Server is a NoSQL Database with a SQL-Based Query Language
Connected Infrastructure
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
Data Analytics and CERN IT Hadoop Service
Data Analytics and CERN IT Hadoop Service
Powering real-time analytics on Xfinity using Kudu
Lab #2 - Create a movies dataset
Data Analytics and CERN IT Hadoop Service
Designed for Big Data Visual Analytics, Zoomdata Allows Business Users to Quickly Connect, Stream, and Visualize Data in the Microsoft Azure Platform MICROSOFT.
Data Analytics – Use Cases, Platforms, Services
Big Data - in Performance Engineering
DeFacto Planning on the Powerful Microsoft Azure Platform Puts the Power of Intelligent and Timely Planning at Any Business Manager’s Fingertips Partner.
Apache Spark for RDBMS Practitioners:
20 Questions with Azure SQL Data Warehouse
Data science and machine learning at scale, powered by Jupyter
Hadoop for SQL Server Pros
Introduction to Apache
Appcelerator Arrow: Build APIs in Minutes. Connect to Any Data Source
Managing batch processing Transient Azure SQL Warehouse Resource
XtremeData on the Microsoft Azure Cloud Platform:
Overview of big data tools
Introduction to SAP HANA
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Big-Data Analytics with Azure HDInsight
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Architecture of modern data warehouse
Presentation transcript:

Integration of Oracle and Hadoop: hybrid databases affordable at scale Luca Canali, Zbigniew Baranowski, Prasanth Kothuri CERN IT, Geneva (CH)

Advantages of integrating Oracle and Hadoop Best of two worlds: Oracle, optimized for Online Transactional System Hadoop, scalable distributed data processing platform Hybrid systems: Move (read-only) data from Oracle to Hadoop Query Hadoop data from Oracle (using Oracle APIs) Also possible: query Oracle from Hadoop Increase scalability and lower ratio cost/performance Hadoop data formats and engines for high performance analytics ….without need of changing the end-user apps connecting to Oracle

Oracle optimized for OLTP, Hadoop affordable at scale Interconnect Node 1 Node 2 Node n The shared nothing architecture allows to scale for high capacity and throughput on commodity HW Example of Oracle RAC deployed with shared storage

The Hadoop ecosystem is heterogeneous and evolving Main Hadoop components at the CERN-IT Hadoop service (2016):

Hadoop data formats and data ingestion Are an important dimension to the architecture Compression and encoding for analytic workloads Columnar formats (Parquet) Data ingestion also very important Sqoop to transfer from databases Kafka, very successful for message-oriented ingestion

Examples from CERN and HEP ATLAS DDM for reports FCC reliability studies Accelerator and industrial controls at CERN Archive and reporting use cases to be explored Example: Speedup of a reporting query from CERN Network experts Running in Oracle in 12 hours (no parallel query allowed in prod) Moved to Hadoop and run in parallel in minutes (throw HW to the problem)

The biggest (RDBMS) database at CERN CERN Accelerator Logging System An archive system for metrics from most of devices and systems installed and used by LHC 500 TB in Oracle 500 GB produced per day (15 billion data points) Offloaded to Hadoop with Apache Sqoop Daily export with 10 parallel streams, duration 3 hours (40 MB/s) Parquet format with Snappy compression: factor 3.3 Size on HDFS: 140 TB

Techniques for integrating Oracle and Hadoop Export data from Oracle to HDFS Sqoop good enough for most cases Other options possible (custom ingestion, Oracle DataPump, streaming, ..) Query Hadoop from Oracle Access tables in Hadoop engines using DB links in Oracle Build hybrid views: transparently combine data in Oracle and Hadoop Use Hadoop frameworks to process data in Oracle DBs Use Hadoop engines (Impala, Spark) to process data exported from Oracle Read data in a RDBMS directly from Spark SQL with JDBC

Offloading from Oracle to Hadoop Step1: Offload data to Hadoop Step2: Offload queries to Hadoop Oracle database Hadoop cluster Table data export Apache Sqoop Data formats: Parquet, Avro SQL Oracle Hadoop SQL engines: Impala, Hive Offload interface: DB LINK, External table Offloaded SQL

How to access Hadoop from an Oracle query Query Apache Hive/Impala tables from Oracle using a database link Query offloaded via ODBC gateway to Impala (or Hive) SQL operations that can be offloaded Filtering predicates are pushed to Hadoop Problem: grouping aggregates are not pushed There are techniques to work around this problem Create aggregation with views in Hive/Impala DBMS_HS_PASSTHROUGH – to push exact SQL statement to Hadoop create database link my_hadoop using 'impala-gateway'; select * from big_table@my_hadoop where col1 = :val1; Oracle database HDFS SQL ODBC gateway Impala/Hive execute OracleNet Thrift Hadoop Max ~20k rows/s

Making data sources transparent to end-user Hybrid views on Oracle Recent (read-write) data in Oracle Archive data in Hadoop Advantages: Hadoop performance with unchanged applications (Oracle APIs) Split point has to be updated after each successful data offload create view hybrid_view as select * from online_table where date > '2016-10-01' union all select * from archival_table@hadoop where date <= '2016-10-01'

Specialized products for hybrid DB and offloading Oracle BigData SQL Custom engine to query Hadoop from Oracle Based on Oracle external tables (predicate pushing) Gluent Inc Similar functionality to BigData SQL Uses Apache Impala to process data on Hadoop Leverages hybrid views on Oracle for data integrity Predicate pushing and partition pruning Data retrieval >10x faster than ODBC gateway

Query Oracle from Apache Spark Spark SQL using JDBC to query Oracle directly To access metadata and/or lookup tables Use to retrieve metadata that can quickly become stale Example code to query Oracle into a Spark DataFrame (Python): df = sqlContext.read.format('jdbc').options( url="jdbc:oracle:thin:@ORACLE_DB/orcl.cern.ch", user="myuser", password="mypass", fetchSize=1000, dbtable="(select id, payload from my_oracle_table) df", driver="oracle.jdbc.driver.OracleDriver" ).load()

Conclusions Hadoop performs at scale, excellent for data analytics Oracle proven for concurrent transactional workloads Solutions are available to integrate Oracle and Hadoop There is value in hybrid systems (Oracle + Hadoop): Oracle APIs for legacy applications and OLTP workloads Scalability on commodity HW for analytic workloads

Hadoop Service at CERN Service provided by CERN-IT for Experiments and CERN users Projects ongoing with Experiments, Accelerators sector and IT Hadoop Users Forum for open discussions: subscribe to egroup it-analytics-wg Getting started material: Hadoop tutorials https://indico.cern.ch/event/546000/ Related talks/posters at CHEP 2016: First results from a combined analysis of CERN computing infrastructure metrics, talk on Tuesday at 12:00 A study of data representations in Hadoop to optimize data storage and search performance of the ATLAS EventIndex, poster on Tuesday 16:30 Big Data Analytics for the Future Circular Collider Reliability and Availability Studies, talk on Thursday at 14:15 Hadoop and friends - first experience at CERN with a new platform for high throughput analysis steps, talk on Thursday at 14:45 Developing and optimizing applications for the Hadoop environment, talk on Thursday at 15:15