1 © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen.

Slides:



Advertisements
Similar presentations
Syncsort Data Integration Update Summary Helping Data Intensive Organizations Across the Big Data Continuum Hadoop – The Operating System.
Advertisements

Copyright © 2008 SAS Institute Inc. All rights reserved. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.
Spark: Cluster Computing with Working Sets
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Resource Management with YARN: YARN Past, Present and Future
An Information Architecture for Hadoop Mark Samson – Systems Engineer, Cloudera.
Google Bigtable A Distributed Storage System for Structured Data Hadi Salimi, Distributed Systems Laboratory, School of Computer Engineering, Iran University.
Hadoop Ecosystem Overview
Raghav Ayyamani. Copyright Ellis Horowitz, Why Another Data Warehousing System? Problem : Data, data and more data Several TBs of data everyday.
Copyright © 2014 Oracle and/or its affiliates. All rights reserved. | OFSAAAI: Modeling Platform Enterprise R Modeling Platform Gagan Deep Singh Director.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,
Data Formats CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Key/Value Stores CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook.
1 Geospatial and Business Intelligence Jean-Sébastien Turcotte Executive VP San Francisco - April 2007 Streamlining web mapping applications.
1 © 2014 Cloudera, Inc. All rights reserved. Preventing a Big Data Security Breach.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Data storing and data access. Adding a row with Java API import org.apache.hadoop.hbase.* 1.Configuration creation Configuration config = HBaseConfiguration.create();
1 Apache Spark and Its Role in the Enterprise Data Hub Mike Olson, Chief Strategy Officer,
EGEE User Forum Data Management session Development of gLite Web Service Based Security Components for the ATLAS Metadata Interface Thomas Doherty GridPP.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
Nov 2006 Google released the paper on BigTable.
Impala. Impala: Goals General-purpose SQL query engine for Hadoop High performance – C++ implementation – runtime code generation (using LLVM) – direct.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Data-Centric Security and User Access Controls for Hadoop on Microsoft Azure MICROSOFT AZURE APP BUILDER PROFILE: BLUETALON BlueTalon provides data-centric.
1 HBASE – THE SCALABLE DATA STORE An Introduction to HBase XLDB Europe Workshop 2013: CERN, Geneva James Kinley EMEA Solutions Architect, Cloudera.
CERN IT Department CH-1211 Genève 23 Switzerland t CERN Agile Infrastructure Monitoring Pedro Andrade CERN – IT/GT HEPiX Spring 2012.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
Apache Accumulo CMSC 491 Hadoop-Based Distributed Computing Spring 2016 Adam Shook.
SQL Basics Review Reviewing what we’ve learned so far…….
Learn. Hadoop Online training course is designed to enhance your knowledge and skills to become a successful Hadoop developer and In-depth knowledge of.
Seminar On Rain Technology
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Apache Hadoop on Windows Azure Avkash Chauhan
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Integration of Oracle and Hadoop: hybrid databases affordable at scale
OMOP CDM on Hadoop Reference Architecture
Integration of Oracle and Hadoop: hybrid databases affordable at scale
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
File Format Benchmark - Avro, JSON, ORC, & Parquet
Chris Menegay Sr. Consultant TECHSYS Business Solutions
Spark Presentation.
SQOOP.
Introduction to Apache
Interpret the execution mode of SQL query in F1 Query paper
Database System Architectures
MapReduce: Simplified Data Processing on Large Clusters
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

1 © Cloudera, Inc. All rights reserved. Simplifying Hadoop: A Secure and Unified Data Access Path for Compute Frameworks Nong Li | Lenni Kuff | Stephen Romanoff

2 © Cloudera, Inc. All rights reserved. Introducing RecordService Nong Li | Lenni Kuff | Stephen Romanoff

3 © Cloudera, Inc. All rights reserved. RecordService is a distributed, scalable, data access service for unified authorization in Hadoop. Introducing RecordService

4 © Cloudera, Inc. All rights reserved. Motivation As the Hadoop ecosystem expands, new components continue to be added Speaks to the overall flexibility of Hadoop This is good - more functionality, more workloads, more use cases. As use cases for Hadoop mature, user requirements and expectations increase: Security Performance Compatibility The flexibility of Hadoop has come at cost of increased complexity

5 © Cloudera, Inc. All rights reserved. Storage Compute

6 © Cloudera, Inc. All rights reserved. Storage Compute …

7 © Cloudera, Inc. All rights reserved. Example: Security Challenge: Provide unified fine-grained security across compute frameworks Integrating consistent security layer into every components is not scalable. Securing data at file-level precludes fine grained access control (column/row) File ACLs not enough - User can view all or nothing. Currently, must split files, duplicate data – large operational cost. Solution: Add a level of abstraction - secure service to access datasets in “record” format Can now apply fine-grained constraints on projection of dataset Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer

8 © Cloudera, Inc. All rights reserved. Example: Security How to provide unified access control across compute frameworks? Securing data at file-level precludes fine grained access control (column/row) File ACLs not enough - User can view all or nothing. Currently, must split files, duplicate data – large operational cost. Solution: Add layer of abstraction - secure service to access datasets in “record” format Can now apply fine-grained constraints on projection of dataset Same access control policy can be applied uniformly across compute frameworks; uncoupled from underlying storage layer

9 © Cloudera, Inc. All rights reserved. Introducing RecordService

10 © Cloudera, Inc. All rights reserved. Architecture Summary Simplifies Provides a higher level, logical abstraction for data (ie Tables or Views) Returns schemed objects (instead of paths and bytes). No need for applications to worry about storage APIs and file formats. HCatalog? Similar concept, RecordService is secure, performant. Plan to support HCatalog as a data model on RecordService. Secures Secure service that does not execute arbitrary user code Central location for all authorization checks using Sentry metadata. Accelerates Unified data access path allows platform-wide performance improvements.

11 © Cloudera, Inc. All rights reserved. Transition – Nong starts here?

12 © Cloudera, Inc. All rights reserved. Architecture

13 © Cloudera, Inc. All rights reserved. Architecture Runs as a distributed service: Planner Servers & Worker Servers Servers do not store any state Easy HA, fault tolerance. Planner Servers responsible for request planning Retrieve and combine metadata (NN, HMS, Sentry) Split generation -> Creates tasks for workers Performs authorization Worker Servers reads from storage and constructs records. IO, file parsing, predicate evaluation Runs as the “source” for a DAG computation

14 © Cloudera, Inc. All rights reserved. Architecture – Server APIs Planner and Worker services expose thrift APIs PlanRequest(), Exec(), Fetch() PlanRequest() Accepts SQL to specify request: Support SELECT and PROJECT Access to tables and views stored in HMS Does not run operators that require data exchange; “map only” Generates a list of tasks which contain the request, each with locality Exec()/Fetch() Returns records in a canonical optimized, columnar-format.

15 © Cloudera, Inc. All rights reserved. Architecture – Fault tolerance Cluster state persisted in ZK Membership, delegation tokens, secret keys Servers do not communicate with each other directly => scalability Planner services Expected to run a few (i.e. 3) for HA Fault tolerance handled with clients getting a list of planners and failing over Plan requests are short Worker services Expect to run on each node in the cluster with data Fault tolerance handled by framework (e.g. MR) rescheduling task

16 © Cloudera, Inc. All rights reserved. Architecture – Security Authentication using Kerberos and delegation tokens Planner authorizes request using metadata in Sentry Column level ACLs Row level ACLs – create a view with a predicate Masking – create a view with the masking function in the select list Tasks generated by the planner are signed with a shared key Worker runs generated tasks. Does not authorize, relies on signed tasks Runs as user with full access to data, does not run user code

17 © Cloudera, Inc. All rights reserved. Architecture – Security example CREATE VIEW v as SELECT mask(credit_card_number) as ccn, name, balance, region FROM data WHERE region = “Europe” 1. Restrict access to the data set: disable access to ‘data’ table and underlying files in HDFS. 2. Give access by creating view, v 3. Set column level permissions on v per user if necessary Write path (ingest) unchanged. Job expected to run as privileged user.

18 © Cloudera, Inc. All rights reserved. Client APIs – Integration with ecosystem Similar APIs designed to integrate with MapReduce and Spark Client APIs make things simpler Don’t need to interact with HMS Care about the underlying storage format: worker always returns records in a canonical format. Storage engine details (e.g. s3)

19 © Cloudera, Inc. All rights reserved. Client Integration APIs Drop in replacements for common existing InputFormats Text, Avro Can be used with Spark as well SparkSQL: integration with the Data Sources API Predicate pushdown, projection Migration should be easy

20 © Cloudera, Inc. All rights reserved. MR Example //FileInputFormat.setInputPaths(job, new Path(args[0])); //job.setInputFormatClass(AvroKeyInputFormat.class); RecordServiceConfig.setInputTable(configuration, null, args[0]); job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);

21 © Cloudera, Inc. All rights reserved. Spark Example // Comment out one or the other val file = sc.recordServiceTextFile(path) //val file = sc.textFile(path)

22 © Cloudera, Inc. All rights reserved. Spark SQL Example ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)

23 © Cloudera, Inc. All rights reserved. Performance Shares some core components with Impala IO management, optimized C++ code, runtime code generation, uses low level storage APIs Highly efficient implementation of the scan functionality Optimized columnar on wire format Inspired by Apache Parquet Accelerates performance for many workloads

24 © Cloudera, Inc. All rights reserved. Terasort ~Worst case scenario. Minimal schema: a single STRING column Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales See Github repo for more details and runnable examples.

25 © Cloudera, Inc. All rights reserved. TeraChecksum

26 © Cloudera, Inc. All rights reserved. Spark SQL Represents a more expected use case Data is fully schemed TPCDS 500GB scale factor, on parquet Cluster 5 node cluster

27 © Cloudera, Inc. All rights reserved. Spark SQL ~15% improvement in query times; queries are not scan bound

28 © Cloudera, Inc. All rights reserved. Spark SQL

29 © Cloudera, Inc. All rights reserved. RecordService at CapOne Early development partner Let’s hear their use case

Implementing Record Service in Capital One’s Data Lake Stephen Romanoff Director, Data Management

31 Capital One at a glance

32 We are building the technology foundation to ensure our analytics leadership Build an analytics architecture centered on a Hadoop-based Enterprise Data Hub Provide state-of-the-art analytical tools, unconstrained data storage & processing Empower our associates to dream and disrupt Key Objectives Delivery Principles Fast Prototyping, scaled agile delivery Smaller, cross functional teams Collaboration and leverage the power of Open Source

33 SQL Access Non SQL Access HDFS - Original file duplicated with horizontal and vertical filters Data duplication was the only way to meet fine grained Access Control needs Source File LOB A Affiliate split NPI File Non-NPI Other splits LOB B NPI Other splits Non-NPI Map Reduce Spark Pig Impala Hive Data is co-mingled - Multiple business lines, Affiliates, NPI, Credit Users need access for both SQL and non-SQL processing We need to provide Fine grained controls for all types of access Duplicating data was the only option

34 Sentry + Record Service provides us fine grained access controls across Hadoop compute frameworks HDFS Pig MR Spark Hive Meta Store Sentry + Record Service Table View 1 View 2 View n No Data Duplication Fine grained access controls for SQL, Pig, MR, Spark processing Optimized IO scanners provide high performance Abstraction from physical storage of data Existing applications migrated with minor code changes

35 © Cloudera, Inc. All rights reserved. State of the project Available for beta already Integration with Spark and MR. Pig soon (via Hcatalog) Looking into other compute abstractions: e.g. crunch More InputFormat support Need your help! Well continually refresh beta, in particular client libraries. Apache 2.0 Licensed Intent to donate to Apache Software Foundation

36 © Cloudera, Inc. All rights reserved. Conclusion RecordService provides a schemed data access service for Hadoop Logical data access instead of physical Much more powerful abstraction Demonstrated security enforcement, improved performance Simpler: clients don’t need to worry about low level details: storage APIs, file formats Opens the door for future improvements

37 © Cloudera, Inc. All rights reserved. Contributing! Mailing list: Discussion forum: p/Betahttp://community.cloudera.com/t5/Beta-Releases/bd- p/Beta Contributions: Documentation: Bug Reporting: Open Github IssueGithub Issue Beta Download: rvice/0-1-0.html rvice/0-1-0.html

38 © Cloudera, Inc. All rights reserved. Thank you