Download presentation
Presentation is loading. Please wait.
Published byEmmeline Barton Modified over 9 years ago
1
Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri
2
The problem with data management Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files
3
Agenda Traditional data management Hadoop’s eco-system Natero’s approach to data management
4
What is data management? What do you have? – What data sets exist? – Where are they stored? – What properties do they have? Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?
5
Traditional data management External Data Sources Extract Transform Load Extract Transform Load Data Warehouse Integrated storage Data processing Users SQL
6
Key lessons of traditional systems Data requires the right abstraction – Schemas have value – Tables are easy to reason about Referenced by name, not location Narrow interface – SQL defines the data sources and the processing But not where and how the data is kept!
7
Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator
8
Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout
9
Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout
10
Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator Lots of new consumers of the data HiveQL Mahout
11
Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator One access control mechanism: files HiveQL Mahout
12
Steps to data management Provide access at the right level Limit the processing interfaces Schemas and provenance provide control Enforce policy 1 3 2 4
13
Case study: Natero Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy Source-driven policy management
14
Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4
15
Natero execution example Job Sources Job Compiler Job Compiler Metadata Manager Metadata Manager Scheduler Fine-grain access control Auditing Enforceable policy Easy for users Natero UI Natero UI
16
The right level of abstraction Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce Possible to integrate with Pig/Hive What’s the right level of abstraction for you? – Kinds of execution
17
Hadoop projects to watch HCatalog – Data discovery / schema management / access Falcon – Lifecycle management / workflow execution Knox – Centralized access control Navigator – Auditing / access management
18
Lessons learned If you want control over your data, you also need control over data processing File-based access control is not enough Metadata is crucial Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.