OMOP CDM on Hadoop Reference Architecture

Slides:

Advertisements

Similar presentations

Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.

Advertisements

Why Spark on Hadoop Matters

Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Transform + analyze Visualize + decide Capture + manage Dat a.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*

Nov 2006 Google released the paper on BigTable.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Big Data Yuan Xue CS 292 Special topics on.

1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.

Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.

Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Partner since 2011

Microsoft Ignite /28/2017 6:07 PM

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.

Big Data Introduction to Big Data, Hadoop and Spark.

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,

Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.

The Big Data Network (phase 2) Cloud Hadoop system

Qlik + Cloudera 10 Points of Integration

Malektron: Company Profile

Big Data & Test Automation

Protecting a Tsunami of Data in Hadoop

Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.

data & analytics beyond dashboards

Big Data Enterprise Patterns

PROTECT | OPTIMIZE | TRANSFORM

Integration of Oracle and Hadoop: hybrid databases affordable at scale

BigData - NoSQL Hadoop - Couchbase

Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.

Introduction to Spark Streaming for Real Time data analysis

Why is my Hadoop* job slow?

Hadoop and Analytics at CERN IT

An Open Source Project Commonly Used for Processing Big Data Sets

Zhangxi Lin Texas Tech University

Chapter 10 Data Analytics for IoT

CS122B: Projects in Databases and Web Applications Winter 2017

Hadoopla: Microsoft and the Hadoop Ecosystem

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Data Platform and Analytics Foundational Training

Powering real-time analytics on Xfinity using Kudu

Lab #2 - Create a movies dataset

Overview of Azure Data Lake Store

Ministry of Higher Education

Gen-Tao Chiang Data and Analytic Engineer

Visual Analytics Sandbox

Big Data - in Performance Engineering

Overview of big data tools

IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.

Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.

Introduction to Azure Data Lake

Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.

Presentation transcript:

OMOP CDM on Hadoop Reference Architecture Target audience: technical or IT staff currently using or hoping to use OHDSI OMOP CDM Goals: Show design for CDM using ‘big data’ tools that are Apache open source as alternative to Oracle or Postgres Show ingestion paths or approaches for various data types and data sources Map current (Nov 2016) most commonly used tools fit for various business purposes based on market share but not exclusive to Cloudera Glossary of acronyms OHDSI: Observational Health Data Science & Informatics OMOP: Observational Medical Outcomes Partnership CDM: Common Data Model HDFS: Hadoop Distributed File System RDBMS: Relational Database Management System SQL: Structured Query Language Revision history Original: November 29, 2016 Send questions to sdolley at cloudera dot com Original technical authors: Derek Kane, Tom White N.B. This is offered as one opinion on the current most selected set of choices to fill technical capabilities, and is not inclusive of all approaches.

UPDATEABLE, ANALYTIC STORAGE OMOP CDM on Hadoop Reference Architecture BIG DATA SUPERSET ARCHITECTURE FOR A DATA LAKE BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM HDFS ONLINE NOSQL HBASE Source Source Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data. NB: alternate, less common tool options can be found later in this document.

UPDATEABLE, ANALYTIC STORAGE Minimum software stand up for CDM in Hadoop MINIMUM SOFTWARE TO STAND UP CDM IN HADOOP BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM ONLINE NOSQL HBASE Source Source HDFS Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data.

Hadoop Ingestion Paths Ingestion method varies by data source & complexity of transformations needed Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Note: All data is most often physically stored in HDFS. This graphic shows the most commonly used tools to manage getting data into HDFS.

Hadoop Ingest/Egress, Hive & Impala GETTING DATA IN DATA STORED & MANAGED GETTING DATA OUT Ingestion method varies by data source & complexity of transformations needed. Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional)

for ETL and Batch processing for fast, in-memory analytics SQL on Hadoop Hive and Impala share the same metadata repository (data dictionary) Data ingested into HDFS and made available in Hive… HDFS Hive for ETL and Batch processing Metadata (in Hive Catalog) Impala ODBC/JDBC tools for fast, in-memory analytics … is also available in Impala.

Programming framework Streaming Workflow Engine Cloudera options Batch data ingest Programming framework Streaming Workflow Engine Spark MapReduce Kafka Spark Streaming Sqoop Flume Oozie Security Interactive Database (SQL) Search Other Sentry Kerberos RecordService HDFS Encryption Impala SOLR / Cloudera Search Pig Hive on Spark Hive on MapReduce Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Hadoop File System AWS S3 AWS EBS EMC Isilon OneFS Kudu Parquet Hbase Avro

With other options listed Batch data ingest Programming framework Streaming Workflow Engine Kafka Spark Streaming Storm Spark MapReduce Oozie Nifi (aka Hadoop Data Flow) Sqoop Flume Security Interactive Database (SQL) Search Other Impala Hive on Spark Hive on MapReduce Hive on Tez Hawq Presto Sentry Kerberos RecordService HDFS Encryption Ranger Knox Pig SOLR / Cloudera Search Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Parquet Hbase Cassandra Accumulo Hadoop File System EMC Isilon OneFS S3 Kudu Avro NB: due to authors lack of familiarity with many alternative tools listed here, we cannot guarantee the tools shown can provide the technical capability shown

Minimum projects/products needed for ingesting syndicated data into OMOP CDM for analysis Interactive Database (SQL) for running queries Impala Hive Storing data (files) Hadoop File System (HDFS) Batch data ingest for ingesting data into CDM Programming framework for ingesting data into CDM AND/OR Spark Sqoop NB: this is one approach, multiple tools exist as alternatives, and adding tools to this architecture can make your process faster, or more organized, or more secure, or have other benefits.