An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.

Slides:

Advertisements

Similar presentations

The Big Data Ecosystem at LinkedIn

Advertisements

Big Data: Analytics Platforms Donald Kossmann Systems Group, ETH Zurich 1.

1 1 Apache Hadoop and the Emergence of the Enterprise Data Hub Eli Collins, Chief Technologist ©2014 Cloudera, Inc. All rights reserved.

Configuring a secure, multitenant cluster for the enterprise James Kinley // Principal Solutions Architect.

Mihai Pintea. 2 Agenda Hadoop and MongoDB DataDirect driver What is Big Data.

Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.

Hadoop Ecosystem Overview

SQL on Hadoop. Todays agenda Introduction Hive – the first SQL approach Data ingestion and data formats Impala – MPP SQL.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

1 © 2014 Cloudera, Inc. All rights reserved. Preventing a Big Data Security Breach.

Data and SQL on Hadoop. Cloudera Image for hands-on Installation instruction – 2.

1 Apache Spark and Its Role in the Enterprise Data Hub Mike Olson, Chief Strategy Officer,

USING MULTIPLE PERSISTENCE LAYERS IN SPARK TO BUILD A SCALABLE PREDICTION ENGINE Richard Williamson

1 © Cloudera, Inc. All rights reserved. Partner Solution Overview 1 Partner Logo Full Color Partner Logo Full Color.

1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering, Cloudera, Inc. The Future of Data Management with Hadoop and the.

Nov 2006 Google released the paper on BigTable.

Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.

1 © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen.

Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.

Zhangxi Lin Texas Tech University

1 © Cloudera, Inc. All rights reserved. Alexander Bibighaus| Director of Engineering The Future of Data Management with Hadoop and the Enterprise Data.

1 Tree and Graph Processing On Hadoop Ted Malaska.

This is a free Course Available on Hadoop-Skills.com.

BIG DATA. Big Data: A definition Big data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database.

What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.

Apache Hadoop on Windows Azure Avkash Chauhan

Data Analytics and Hadoop Service in IT-DB Visit of Cloudera - April 19 th, 2016 Luca Canali (CERN) for IT-DB.

Microsoft Partner since 2011

Microsoft Ignite /28/2017 6:07 PM

Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.

From RDBMS to Hadoop A case study Mihaly Berekmeri School of Computer Science University of Manchester Data Science Club, 14th July 2016 Hayden Clark,

Energy Management Solution

Qlik + Cloudera 10 Points of Integration

Big Data & Test Automation

OMOP CDM on Hadoop Reference Architecture

BUILD BIG DATA ENTERPRISE SOLUTIONS FASTER ON AZURE HDINSIGHT

Protecting a Tsunami of Data in Hadoop

Connected Infrastructure

Choice Hotels’ journey to better understand its customers through self-service analytics Narasimhan Sampath & Avinash Ramineni Strata Hadoop World |

Big Data Enterprise Patterns

PROTECT | OPTIMIZE | TRANSFORM

5/9/2018 7:28 AM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS.

Connected Living Connected Living What to look for Architecture

Smart Building Solution

Hadoop and Analytics at CERN IT

Connected Maintenance Solution

Support digital applications with a resilient, highly available and NRT Hadoop Backend Santander UK.

Collecting heterogeneous data into a central repository

Smart Building Solution

Hadoopla: Microsoft and the Hadoop Ecosystem

Connected Maintenance Solution

Connected Living Connected Living What to look for Architecture

Connected Infrastructure

Enabling Scalable and HA Ingestion and Real-Time Big Data Insights for the Enterprise OCJUG, 2014.

Energy Management Solution

Data Warehouse.

Powering real-time analytics on Xfinity using Kudu

Lab #2 - Create a movies dataset

Enterprise security for big data solutions on Azure HDInsight

Ministry of Higher Education

Visual Analytics Sandbox

Microsoft Connect /22/2018 9:50 PM

XtremeData on the Microsoft Azure Cloud Platform:

Architecture for Real-Time ETL

Data Wrangling as the key to success with Data Lake

Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.

Data Wrangling for ETL enthusiasts

SQL Server 2019 Bringing Apache Spark to SQL Server

Presentation transcript:

An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera

2 © Cloudera, Inc. All rights reserved. Background The trend is for organisations to build business-wide Hadoop implementations Enterprise Data Hub / Data Lake / Hadoop as a Service Many data sources Many lines of business Many use cases Many engines and tools available to process and analyse data Need to meet SLAs for data consumers How do I organise my information architecture within Hadoop to cope with this variety? Need a Logical Information Architecture for Hadoop!

3 © Cloudera, Inc. All rights reserved. What are the requirements? Ingest data in its full fidelity, in as close to its original, raw form as possible Provide a data discovery and exploration facility for analysts and data scientists Bring together and link multiple data sets Serve data efficiently to business users and applications – meeting SLAs

4 © Cloudera, Inc. All rights reserved. Data Consumers Where does an Enterprise Data Hub fit? Data Sources Enterprise Data Hub Data consumers can be: Analysts Data Scientists Business Users (Reports) Applications Data Sources can be: Databases / DWs File Sources Machines, Sensors (IoT) Internet (Social Media etc) Mobile Enterprise Data Hub sits in between! (but it’s not the only thing in between)

5 © Cloudera, Inc. All rights reserved. Data Consumers How does data arrive? Data Sources Enterprise Data Hub Data can arrive in any form e.g. Event data Log files Streaming e.g. via MQ, Kafka Relational tables with any data model Star schema 3NF Files with any format Text JSON XML Avro …

6 © Cloudera, Inc. All rights reserved. Data Consumers Data Sources Raw Layer Raw Layer Principle: Ingest data raw, in full fidelity – as close as possible to the form in which it arrives Data organised in HDFS by data source e.g. /landing/ Writeable by ingestion processes e.g. Flume, Sqoop Readable by transformation processes e.g. Hive, Pig, MR, Spark

7 © Cloudera, Inc. All rights reserved. Raw Layer Data Sources Data Consumers Discovery Layer Discovery Layer Used for Discovery and Exploration by small teams of Analysts and Data Scientists Users or teams given their own “sandpits” (at a cost?) Mix of views and materialised data Some data sets “enriched” e.g. by joining reference data Tools: Impala, Solr, Spark

8 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Data Sources Data Consumers Shared Layer Shared Layer Available across LOBs (subject to security constraints) Incentives for Analyst / Data Science teams to move their data and use cases into this Layer Data from multiple sources joined together Tools: Impala, Hive, Pig, Spark

9 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Optimised Layer Build this when you need to operationalise the use case Organised by data consumer and use case not by source Data modeled to provide optimised performance Often denormalised Uses optimised storage formats e.g. Parquet with partitioning, HBase Accessed by low latency query engines e.g. HBase, Impala, Solr

10 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers What About Real Time? Optimised Layer To operationalise use cases in real time: Low latency components e.g. Kafka, Flume, Spark Streaming Consume straight from sources Transform/analyse it Deliver it direct to the Optimised Layer for low-latency query Or deliver direct to consumer Generally still persist raw data in Raw Layer Follows the Lambda Architecture Speed Layer

11 © Cloudera, Inc. All rights reserved. This is a Complex, Multi-Tenant Architecture Critical Enablers A broad and open ecosystem Security and Governance Authentication Authorisation Auditing Lineage and Metadata Encryption Resource Management Chargeback Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer

12 © Cloudera, Inc. All rights reserved. Considerations This is not prescriptive There could be more or fewer layers, depending on use cases This is a logical architecture There may be multiple physical clusters due to non functional requirements e.g. Compliance and security e.g. some data can only be kept in EU If there are tight SLAs, some engines perform better on dedicated clusters e.g. HBase, Kafka Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer

13 © Cloudera, Inc. All rights reserved. Conclusion Move from Big Data Spaghetti Data Sources Data Consumers EDWs Marts Search Servers Document Stores Storage

14 © Cloudera, Inc. All rights reserved. Conclusion Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer Move from Big Data Spaghetti …to Big Data Lasagne!

15 © Cloudera, Inc. All rights reserved. BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS Visit us at Booth #101 HIGHLIGHTS: Apache Kafka is now fully supported with Cloudera Learn why Cloudera is the leader for security & governance in Hadoop

Thank you