Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri

The problem with data management Hadoop is a collection of tools – Not tightly integrated – Everyone’s stack looks a little different – Everything falls back to files

Agenda Traditional data management Hadoop’s eco-system Natero’s approach to data management

What is data management? What do you have? – What data sets exist? – Where are they stored? – What properties do they have? Are you doing the right thing with it? – Who can access data? – Who has accessed data? – What did they do with it? – What rules apply to this data?

Traditional data management External Data Sources Extract Transform Load Extract Transform Load Data Warehouse Integrated storage Data processing Users SQL

Key lessons of traditional systems Data requires the right abstraction – Schemas have value – Tables are easy to reason about Referenced by name, not location Narrow interface – SQL defines the data sources and the processing But not where and how the data is kept!

Hadoop eco-system External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig HiveQL Mahout Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume More varied data sources with many more access / retention requirements Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Data accessed through multiple entry points Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator Lots of new consumers of the data HiveQL Mahout

Key challenges External Data Sources HDFS storage layer Users Sqoop + Flume Processing Framework (Map-Reduce) HBase Pig Hive Metastore (HCatalog) Hive Metastore (HCatalog) Oozie Cloudera Navigator Cloudera Navigator One access control mechanism: files HiveQL Mahout

Steps to data management Provide access at the right level Limit the processing interfaces Schemas and provenance provide control Enforce policy 1 3 2 4

Case study: Natero Cloud-based analytics service – Enable business users to take advantage of big data – UI-driven workflow creation and automation Single shared Hadoop eco-system – Need customer-level isolation and user-level access controls Goals: – Provide the appropriate level of abstraction for our users – Finer granularity of access control – Enable policy enforcement – Users shouldn’t have to think about policy Source-driven policy management

Natero application stack External Data Sources HDFS storage layer Processing Framework (Map-Reduce) Users HBase Sqoop + Flume Pig Access-aware workflow compiler Schema Extraction Policy and Metadata Manager Provenance-aware scheduler HiveQL Mahout 1 3 2 4

Natero execution example Job Sources Job Compiler Job Compiler Metadata Manager Metadata Manager Scheduler Fine-grain access control Auditing Enforceable policy Easy for users Natero UI Natero UI

The right level of abstraction Our abstraction comes with trade-offs – More control, compliance – No more raw Map-Reduce Possible to integrate with Pig/Hive What’s the right level of abstraction for you? – Kinds of execution

Hadoop projects to watch HCatalog – Data discovery / schema management / access Falcon – Lifecycle management / workflow execution Knox – Centralized access control Navigator – Auditing / access management

Lessons learned If you want control over your data, you also need control over data processing File-based access control is not enough Metadata is crucial Users aren’t motivated by policy – Policy shouldn’t get in the way of use – But you might get IT to reason about the sources

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Similar presentations

Presentation on theme: "Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri.

Similar presentations

Presentation on theme: "Enabling data management in a big data world Craig Soules Garth Goodson Tanya Shastri."— Presentation transcript:

Similar presentations

About project

Feedback