OMOP CDM on Hadoop Reference Architecture

Slides:



Advertisements
Similar presentations
Big Data Training Course for IT Professionals Name of course : Big Data Developer Course Duration : 3 days full time including practical sessions Dates.
Advertisements

Why Spark on Hadoop Matters
Evaluation of distributed open source solutions in CERN database use cases HEPiX, spring 2015 Kacper Surdy IT-DB-DBF M. Grzybek, D. L. Garcia, Z. Baranowski,
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Transform + analyze Visualize + decide Capture + manage Dat a.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
Hadoop Ecosystem Overview
TITLE SLIDE: HEADLINE Presenter name Title, Red Hat Date For Red Hat, it's 1994 all over again Sarangan Rangachari VP and GM, Storage and Big Data Red.
SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark Cluster Monitoring 2.
Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.
Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.
Hadoop IT Services Hadoop Users Forum CERN October 7 th,2015 CERN IT-D*
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Big Data Yuan Xue CS 292 Special topics on.
1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.
Harnessing Big Data with Hadoop Dipti Sangani; Madhu Reddy DBI210.
Data Science Hadoop YARN Rodney Nielsen. Rodney Nielsen, Human Intelligence & Language Technologies Lab Outline Classical Hadoop What’s it all about Hadoop.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.
Microsoft Partner since 2011
Microsoft Ignite /28/2017 6:07 PM
Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit
Data Analytics Challenges Some faults cannot be avoided Decrease the availability for running physics Preventive maintenance is not enough Does not take.
Big Data Introduction to Big Data, Hadoop and Spark.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,
Hadoop Big Data Usability Tools and Methods. On the subject of massive data analytics, usability is simply as crucial as performance. Right here are three.
The Big Data Network (phase 2) Cloud Hadoop system
Qlik + Cloudera 10 Points of Integration
Malektron: Company Profile
Big Data & Test Automation
Protecting a Tsunami of Data in Hadoop
Mail call Us: / / Hadoop Training Sathya technologies is one of the best Software Training Institute.
data & analytics beyond dashboards
Big Data Enterprise Patterns
PROTECT | OPTIMIZE | TRANSFORM
Integration of Oracle and Hadoop: hybrid databases affordable at scale
BigData - NoSQL Hadoop - Couchbase
Database Services Katarzyna Dziedziniewicz-Wojcik On behalf of IT-DB.
Introduction to Spark Streaming for Real Time data analysis
Hadoop.
Why is my Hadoop* job slow?
Hadoop and Analytics at CERN IT
An Open Source Project Commonly Used for Processing Big Data Sets
Zhangxi Lin Texas Tech University
Chapter 10 Data Analytics for IoT
CS122B: Projects in Databases and Web Applications Winter 2017
Hadoopla: Microsoft and the Hadoop Ecosystem
Hadoop.
Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.
Data Platform and Analytics Foundational Training
SQOOP.
Powering real-time analytics on Xfinity using Kudu
Lab #2 - Create a movies dataset
Overview of Azure Data Lake Store
Ministry of Higher Education
Gen-Tao Chiang Data and Analytic Engineer
Visual Analytics Sandbox
Big Data - in Performance Engineering
Overview of big data tools
Setup Sqoop.
IBM C IBM Big Data Engineer. You want to train yourself to do better in exam or you want to test your preparation in either situation Dumpspedia’s.
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
Introduction to Azure Data Lake
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

OMOP CDM on Hadoop Reference Architecture Target audience: technical or IT staff currently using or hoping to use OHDSI OMOP CDM Goals: Show design for CDM using ‘big data’ tools that are Apache open source as alternative to Oracle or Postgres Show ingestion paths or approaches for various data types and data sources Map current (Nov 2016) most commonly used tools fit for various business purposes based on market share but not exclusive to Cloudera Glossary of acronyms OHDSI: Observational Health Data Science & Informatics OMOP: Observational Medical Outcomes Partnership CDM: Common Data Model HDFS: Hadoop Distributed File System RDBMS: Relational Database Management System SQL: Structured Query Language Revision history Original: November 29, 2016 Send questions to sdolley at cloudera dot com Original technical authors: Derek Kane, Tom White N.B. This is offered as one opinion on the current most selected set of choices to fill technical capabilities, and is not inclusive of all approaches.

UPDATEABLE, ANALYTIC STORAGE OMOP CDM on Hadoop Reference Architecture BIG DATA SUPERSET ARCHITECTURE FOR A DATA LAKE BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM HDFS ONLINE NOSQL HBASE Source Source Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data. NB: alternate, less common tool options can be found later in this document.

UPDATEABLE, ANALYTIC STORAGE Minimum software stand up for CDM in Hadoop MINIMUM SOFTWARE TO STAND UP CDM IN HADOOP BATCH PROCESSING SPARK HIVE SQOOP MAPREDUCE PIG ANALYTIC SQL IMPALA SEARCH ENGINE SOLR MACHINE LEARNING SPARK REAL-TIME PROCESSING SPARK STREAMING KAFKA FLUME UPDATEABLE, ANALYTIC STORAGE KUDU WORKLOAD MANAGEMENT YARN WORKFLOW MANAGEMENT OOZIE SECURITY SENTRY, RECORD SERVICE HDFS Encryption, TLS/SSL, Kerberos FILE STORAGE PARQUET Source HDFS FILESYSTEM ONLINE NOSQL HBASE Source Source HDFS Legend: Box titles in black bold such as “BATCH PROCESSING” are the technical capability required to meet some business requirement(s). Tool names in dark gray such as “SPARK” or “Kerberos” are Software/Apache projects that can be implemented to meet the technical capability need. Boxes near the bottom of the architecture tend toward storing data; middle boxes tend toward managing or organizing the data; boxes at the top tend toward enabling users to access the data.

Hadoop Ingestion Paths Ingestion method varies by data source & complexity of transformations needed Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Note: All data is most often physically stored in HDFS. This graphic shows the most commonly used tools to manage getting data into HDFS.

Hadoop Ingest/Egress, Hive & Impala GETTING DATA IN DATA STORED & MANAGED GETTING DATA OUT Ingestion method varies by data source & complexity of transformations needed. Three approaches, select one or many Flat File Ingestion CSV/TSV/TXT Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) RDBMS Ingestion from relational databases (MySQL, PostgreSQL, Oracle, SQL Server, etc.) RDBMS SQOOP Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional) Complex Ingestion (JSON, XML, Nested sources, custom, etc.) Source file Spark Hive Hive Catalog Impala SQL from SQL Render OHDSI apps HDFS Other SQL, other apps (optional)

for ETL and Batch processing for fast, in-memory analytics SQL on Hadoop Hive and Impala share the same metadata repository (data dictionary) Data ingested into HDFS and made available in Hive… HDFS Hive for ETL and Batch processing Metadata (in Hive Catalog) Impala ODBC/JDBC tools for fast, in-memory analytics … is also available in Impala.

Programming framework Streaming Workflow Engine Cloudera options Batch data ingest Programming framework Streaming Workflow Engine Spark MapReduce Kafka Spark Streaming Sqoop Flume Oozie Security Interactive Database (SQL) Search Other Sentry Kerberos RecordService HDFS Encryption Impala SOLR / Cloudera Search Pig Hive on Spark Hive on MapReduce Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Hadoop File System AWS S3 AWS EBS EMC Isilon OneFS Kudu Parquet Hbase Avro

With other options listed Batch data ingest Programming framework Streaming Workflow Engine Kafka Spark Streaming Storm Spark MapReduce Oozie Nifi (aka Hadoop Data Flow) Sqoop Flume Security Interactive Database (SQL) Search Other Impala Hive on Spark Hive on MapReduce Hive on Tez Hawq Presto Sentry Kerberos RecordService HDFS Encryption Ranger Knox Pig SOLR / Cloudera Search Data storage file format Storing data (key value) Storing data (columnar) Storing data (files) Parquet Hbase Cassandra Accumulo Hadoop File System EMC Isilon OneFS S3 Kudu Avro NB: due to authors lack of familiarity with many alternative tools listed here, we cannot guarantee the tools shown can provide the technical capability shown

Minimum projects/products needed for ingesting syndicated data into OMOP CDM for analysis Interactive Database (SQL) for running queries Impala Hive Storing data (files) Hadoop File System (HDFS) Batch data ingest for ingesting data into CDM Programming framework for ingesting data into CDM AND/OR Spark Sqoop NB: this is one approach, multiple tools exist as alternatives, and adding tools to this architecture can make your process faster, or more organized, or more secure, or have other benefits.