What does it mean to virtualize the Hadoop File System?

Slides:



Advertisements
Similar presentations
Distributed Data Processing
Advertisements

Agile Infrastructure built on OpenStack Building The Next Generation Data Center with OpenStack John Griffith, Senior Software Engineer,
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Distributed Storage March 12, Distributed Storage What is Distributed Storage?  Simple answer: Storage that can be shared throughout a network.
Resource Management with YARN: YARN Past, Present and Future
Faster and Easier with Built-in Data Protection Ryan Troy – NorthEast System Engineering Manager 12/13/13.
© 2009 VMware Inc. All rights reserved Big Data’s Virtualization Journey Andrew Yu Sr. Director, Big Data R&D VMware.
VMware Update 2009 Daniel Griggs Solutions Architect, Virtualization Servers & Storage Solutions Practice Dayton OH.
Hadoop tutorials. Todays agenda Hadoop Introduction and Architecture Hadoop Distributed File System MapReduce Spark 2.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Module – 7 network-attached storage (NAS)
Microsoft ® Application Virtualization 4.5 Infrastructure Planning and Design Series.
Implementing Failover Clustering with Hyper-V
ProjectWise Virtualization Kevin Boland. What is Virtualization? Virtualization is a technique for deploying technologies. Virtualization creates a level.
Microsoft ® Application Virtualization 4.6 Infrastructure Planning and Design Published: September 2008 Updated: February 2010.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Windows ® Powered NAS. Agenda Windows Powered NAS Windows Powered NAS Key Technologies in Windows Powered NAS Key Technologies in Windows Powered NAS.
Apache Spark and the future of big data applications Eric Baldeschwieler.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
INSTALLING MICROSOFT EXCHANGE SERVER 2003 CLUSTERS AND FRONT-END AND BACK ‑ END SERVERS Chapter 4.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Microsoft and Community Tour 2011 – Infrastrutture in evoluzione Community Tour 2011 Infrastrutture in evoluzione.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
VMware vSphere Configuration and Management v6
Virtualization and Databases Ashraf Aboulnaga University of Waterloo.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Apache Hadoop on the Open Cloud David Dobbins Nirmal Ranganathan.
CoprHD and OpenStack Ideas for future.
Module 7: SQL Server Special Considerations. Overview SQL Server High Availability Unicode.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Cloud Computing Lecture 5-6 Muhammad Ahmad Jan.
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
Cloud Computing project NSYSU Sec. 1 Demo. NSYSU EE IT_LAB2 Outline  Our system’s architecture  Flow chart of the hadoop’s job(web crawler) working.
Managing deployment and activation of Web Applications in a distributed e-Infrastructure EGI Technical Forum September 2011 Lyon
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
BIG DATA/ Hadoop Interview Questions.
vSphere 6 Foundations Exam Training
Redmond Protocols Plugfest 2016 Casey Karst PolyBase in SQL Server 2016.
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint.
Hadoop Introduction. Audience Introduction of students – Name – Years of experience – Background – Do you know Java? – Do you know linux? – Any exposure.
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Services DFS, DHCP, and WINS are cluster-aware.
Introduction to Distributed Platforms
By Chris immanuel, Heym Kumar, Sai janani, Susmitha
CyberSKA: Global Federated e-Infrastructure
CSS534: Parallel Programming in Grid and Cloud
Open Source distributed document DB for an enterprise
HP HPE0-J74 - Foundations of HPE Storage Solutions exam Braindumps
Welcome! Thank you for joining us. We’ll get started in a few minutes.
Veeam Backup Repository
OpenNebula Offers an Enterprise-Ready, Fully Open Management Solution for Private and Public Clouds – Try It Easily with an Azure Marketplace Sandbox MICROSOFT.
Amazon AWS Solution Architect Associate Exam Dumps For Full Exam Info Visit This Link:
How to Protect Big Data in a Containerized Environment
Data Security for Microsoft Azure
CloneManager® Helps Users Harness the Power of Microsoft Azure to Clone and Migrate Systems into the Cloud Cost-Effectively and Securely MICROSOFT AZURE.
Dell Data Protection | Rapid Recovery: Simple, Quick, Configurable, and Affordable Cloud-Based Backup, Retention, and Archiving Powered by Microsoft Azure.
Hadoop Technopoints.
Introduction to Apache
DriveScale Log Collection Method of Procedure
Oracle 1z0-928 Oracle Cloud Platform Big Data Management 2018 Associate.
OpenStack for the Enterprise
Presentation transcript:

What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData

It is HDFS …

Unless it is not

Outline There are questions to be answered … Three “What”’s: What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

Before we can virtualize it, we need to understand what “it” is. What is HDFS? Before we can virtualize it, we need to understand what “it” is.

It is a distributed file system built with NameNodes and DataNodes HDFS It is a distributed file system built with NameNodes and DataNodes Source: David Engfer via slidershare.net http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500

HDFS Implementation HDFS Implementation hadoop-hdfs.jar org.apache.hadoop.fs.FileSystem org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem

It is a stack of Java code used by Hadoop applications to access data. HDFS Implementation It is a stack of Java code used by Hadoop applications to access data. YARN Hadoop Distributed File System API/Java Class HDFS Implementation Distributed File System Client Protocol at TCP/IP level – “over the wire” HDFS Implementation HDFS Implementation

HDFS Layers of Potential Virtualization Generic Java Classes Java class org.apache.hadoop.fs.FileSystem HDFS over the wire protocol Java class org.apache.hadoop.hdfs.DFSClient

HDFS Implementation Host Host Wire Protocol HDFS Implementation NameNode Resource Manager HDFS Implementation Host DataNode Node Manager App HDFS Impl DFSClient Local Disk

The virtualization of either the HDFS Implementation or the Protocols HDFS Virtualization The virtualization of either the HDFS Implementation or the Protocols

Outline There are questions to be answered … Three “What”’s: What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

HDFS Virtualization Methods Virtualize the HDFS Implementation Implement one of the Hadoop Compatible File System (HCFS) Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Virtualize the HDFS Implementation This is the only method of HDFS virtualization that requires Hadoop compute virtualization. Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks. Instances of this type of HDFS virtualization include: VMware BDE Apache OpenStack Sahara Cloudera Director Hortonworks Cloudbreak

Virtualize the HDFS Implementation HOST VM NameNode Resource Manager HOST VM HOST VM Node Manager App HDFS Impl DFSClient Node Manager App HDFS Impl DFSClient Local Disk DataNode DataNode Local Disk Local Disk Local Disk

Virtualize the HDFS Implementation Advantages: Simple No new Java code Compute/data locality Considerations: Requires data ingest time The clusters become stateful

HDFS Virtualization Methods Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Implement a HCFS via the over-the-wire protocol Use the unmodified hadoop-hdfs jar fs.defaultfs hdfs://1.2.3.4:8020/path Instance: EMC Isilon

Implement a HCFS via the over-the-wire protocol Host Storage Service Local Disk NameNode Resource Manager Host Host Node Manager App HDFS Impl DFSClient Node Manager App HDFS Impl DFSClient Local Disk DataNode DataNode Local Disk

Implement a HCFS via the over-the-wire protocol Advantages: Multi-protocol No new Java code Enterprise storage services Considerations: Open source / proprietary No compute / data locality

HDFS Virtualization Methods Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Implement a HCFS via the FileSystem Java classes Write the java code that implements the class, build a jar file, put the jar file in the YARN services class path edit the core-site.xml file Instances: S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem https://github.com/Aloisius/hadoop-s3a GlusterFS - org.apache.hadoop.fs.FilterFileSystem https://github.com/gluster/glusterfs-hadoop Tachyon – org.apache.hadoop.fs.FileSystem https://github.com/amplab/tachyon Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem https://github.com/apache/ignite

Implement a HCFS via the FileSystem Java classes Host NameNode Resource Manager Storage Service Host Host Node Manager Node Manager Local Disk Storage Service Storage Service DataNode App DataNode Local Disk App HDFS Impl CustomFS Impl HDFS Impl CustomFS Impl DFSClient DFSClient

Implement a HCFS via the FileSystem Java classes Host Local Disk Storage Service NameNode Resource Manager Host Host Node Manager Node Manager Local Disk Storage Service App DataNode Local Disk DataNode App Storage Service CustomFS Impl HDFS Impl HDFS Impl CustomFS Impl DFSClient DFSClient

Implement a HCFS via the FileSystem Java classes Advantages: Open source / proprietary Multiple file access protocols supported Considerations: These are file systems New Java code Possibly no compute / data locality May lag latest HDFS feature set

HDFS Virtualization Is there another way?

HDFS Virtualization Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol Implement a HCFS via the FileSystem Java classes Virtualize the Hadoop Compatible File System Protocol

Virtualize the Hadoop Compatible File System Protocol Translate the Hadoop File System Calls into native calls to the BackEnd File systems Insert intelligent caching layer Instance: BlueData EPIC software – org.apache.fs.FileSystem

Virtualize the Hadoop Compatible File System Protocol Host Storage Service Host NameNode Resource Manager Local Disk Local Disk Host Host Node Manager DTAP Service Node Manager DTAP Service App App Local Disk DataNode Local Disk DTAP Impl HDFS Impl DataNode HDFS Impl DTAP Impl Local Disk DFSClient Local Disk DFSClient

Application is cache aware HDFS mem cache Page Cache DataNode page HDFS Implementation DFSClient Application is cache aware

Extend mem cache to any File System or Object storage Page Cache HDFS GlusterFS Object Store page DTAP Service DTAP FileSystem Implementation Application is cache unaware

Virtualize the Hadoop Compatible File System Protocol Advantages: Not a file system Transparent in memory cache write back, read ahead Supports multiple protocols Supports compute / data locality Considerations: New Java code Open source / proprietary May lag latest HDFS feature set

Let’s Review

Outline There are questions to be answered … Three “What”’s: What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

A Few Words about Performance Performance measurements are an art as well as a science Bottlenecks in applications Bottlenecks in infrastructure network CPU disk Configuration is key block size distro security

Virtualize the HDFS Implementation Performance – VMware BDE Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers

Implement a HCFS via the over-the-wire protocol Performance – Isilon http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.html Source of graph: Stefan Radtke blog post

Implement a HCFS via the FileSystem Java classes Performance – Tachyon Source of graph: Haoyuan Li https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf

Performance – BlueData Virtualize the Hadoop Compatible File System Protocol Performance – BlueData Source of Graph: BlueData customer proof-of-concept results

Virtualized HDFS solutions provide good performance Even with remote storage Even in virtualized environments

When it comes to Hadoop storage virtualization, speed is not the whole story Other factors to consider when implementing a virtualized HDFS option: Use of a virtualized compute environment Open source / proprietary solution Required Hadoop File System features Lifespan of Hadoop cluster

When it comes to Hadoop storage virtualization, speed is not the whole story Other factors to consider when selecting storage: Data accessibility Hadoop File System protocol NFS, object store, other protocols Enterprise storage services data protection geographical replication offline backup

Consider a Virtualized HDFS Solution When any of the following are true: Hadoop and non-Hadoop applications are required to access the same data Do not want to replicate the data Enterprise storage data services required Need to run Hadoop in a virtual compute environment

Volume, Velocity, Variety Hadoop File System Volume, Velocity, Variety Virtualization

Q & A twitter: @tapbluedata email: tap@bluedata.com www.bluedata.com Visit our booth in the Expo