Download presentation
Presentation is loading. Please wait.
Published byGriffin Ball Modified over 9 years ago
1
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera
2
2 © Cloudera, Inc. All rights reserved. Background The trend is for organisations to build business-wide Hadoop implementations Enterprise Data Hub / Data Lake / Hadoop as a Service Many data sources Many lines of business Many use cases Many engines and tools available to process and analyse data Need to meet SLAs for data consumers How do I organise my information architecture within Hadoop to cope with this variety? Need a Logical Information Architecture for Hadoop!
3
3 © Cloudera, Inc. All rights reserved. What are the requirements? Ingest data in its full fidelity, in as close to its original, raw form as possible Provide a data discovery and exploration facility for analysts and data scientists Bring together and link multiple data sets Serve data efficiently to business users and applications – meeting SLAs
4
4 © Cloudera, Inc. All rights reserved. Data Consumers Where does an Enterprise Data Hub fit? Data Sources Enterprise Data Hub Data consumers can be: Analysts Data Scientists Business Users (Reports) Applications Data Sources can be: Databases / DWs File Sources Machines, Sensors (IoT) Internet (Social Media etc) Mobile Enterprise Data Hub sits in between! (but it’s not the only thing in between)
5
5 © Cloudera, Inc. All rights reserved. Data Consumers How does data arrive? Data Sources Enterprise Data Hub Data can arrive in any form e.g. Event data Log files Streaming e.g. via MQ, Kafka Relational tables with any data model Star schema 3NF Files with any format Text JSON XML Avro …
6
6 © Cloudera, Inc. All rights reserved. Data Consumers Data Sources Raw Layer Raw Layer Principle: Ingest data raw, in full fidelity – as close as possible to the form in which it arrives Data organised in HDFS by data source e.g. /landing/ Writeable by ingestion processes e.g. Flume, Sqoop Readable by transformation processes e.g. Hive, Pig, MR, Spark
7
7 © Cloudera, Inc. All rights reserved. Raw Layer Data Sources Data Consumers Discovery Layer Discovery Layer Used for Discovery and Exploration by small teams of Analysts and Data Scientists Users or teams given their own “sandpits” (at a cost?) Mix of views and materialised data Some data sets “enriched” e.g. by joining reference data Tools: Impala, Solr, Spark
8
8 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Data Sources Data Consumers Shared Layer Shared Layer Available across LOBs (subject to security constraints) Incentives for Analyst / Data Science teams to move their data and use cases into this Layer Data from multiple sources joined together Tools: Impala, Hive, Pig, Spark
9
9 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Optimised Layer Build this when you need to operationalise the use case Organised by data consumer and use case not by source Data modeled to provide optimised performance Often denormalised Uses optimised storage formats e.g. Parquet with partitioning, HBase Accessed by low latency query engines e.g. HBase, Impala, Solr
10
10 © Cloudera, Inc. All rights reserved. Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers What About Real Time? Optimised Layer To operationalise use cases in real time: Low latency components e.g. Kafka, Flume, Spark Streaming Consume straight from sources Transform/analyse it Deliver it direct to the Optimised Layer for low-latency query Or deliver direct to consumer Generally still persist raw data in Raw Layer Follows the Lambda Architecture Speed Layer
11
11 © Cloudera, Inc. All rights reserved. This is a Complex, Multi-Tenant Architecture Critical Enablers A broad and open ecosystem Security and Governance Authentication Authorisation Auditing Lineage and Metadata Encryption Resource Management Chargeback Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer
12
12 © Cloudera, Inc. All rights reserved. Considerations This is not prescriptive There could be more or fewer layers, depending on use cases This is a logical architecture There may be multiple physical clusters due to non functional requirements e.g. Compliance and security e.g. some data can only be kept in EU If there are tight SLAs, some engines perform better on dedicated clusters e.g. HBase, Kafka Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer
13
13 © Cloudera, Inc. All rights reserved. Conclusion Move from Big Data Spaghetti Data Sources Data Consumers EDWs Marts Search Servers Document Stores Storage
14
14 © Cloudera, Inc. All rights reserved. Conclusion Raw Layer Discovery Layer Shared Layer Data Sources Data Consumers Optimised Layer Speed Layer Move from Big Data Spaghetti …to Big Data Lasagne!
15
15 © Cloudera, Inc. All rights reserved. BOOK SIGNINGS THEATER SESSIONS TECHNICAL DEMOS GIVEAWAYS Visit us at Booth #101 HIGHLIGHTS: Apache Kafka is now fully supported with Cloudera Learn why Cloudera is the leader for security & governance in Hadoop
16
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.