Presentation is loading. Please wait.

Presentation is loading. Please wait.

What does it mean to virtualize the Hadoop File System?

Similar presentations


Presentation on theme: "What does it mean to virtualize the Hadoop File System?"— Presentation transcript:

1 What does it mean to virtualize the Hadoop File System?
Tom Phelan Chief Architect for BlueData

2 It is HDFS …

3 Unless it is not

4 Outline There are questions to be answered … Three “What”’s:
What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

5 Before we can virtualize it, we need to understand what “it” is.
What is HDFS? Before we can virtualize it, we need to understand what “it” is.

6 It is a distributed file system built with NameNodes and DataNodes
HDFS It is a distributed file system built with NameNodes and DataNodes Source: David Engfer via slidershare.net

7 HDFS Implementation HDFS Implementation hadoop-hdfs.jar
org.apache.hadoop.fs.FileSystem org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem

8 It is a stack of Java code used by Hadoop applications to access data.
HDFS Implementation It is a stack of Java code used by Hadoop applications to access data. YARN Hadoop Distributed File System API/Java Class HDFS Implementation Distributed File System Client Protocol at TCP/IP level – “over the wire” HDFS Implementation HDFS Implementation

9 HDFS Layers of Potential Virtualization
Generic Java Classes Java class org.apache.hadoop.fs.FileSystem HDFS over the wire protocol Java class org.apache.hadoop.hdfs.DFSClient

10 HDFS Implementation Host Host Wire Protocol HDFS Implementation
NameNode Resource Manager HDFS Implementation Host DataNode Node Manager App HDFS Impl DFSClient Local Disk

11 The virtualization of either the HDFS Implementation or the Protocols
HDFS Virtualization The virtualization of either the HDFS Implementation or the Protocols

12 Outline There are questions to be answered … Three “What”’s:
What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

13 HDFS Virtualization Methods
Virtualize the HDFS Implementation Implement one of the Hadoop Compatible File System (HCFS) Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

14 Virtualize the HDFS Implementation
This is the only method of HDFS virtualization that requires Hadoop compute virtualization. Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks. Instances of this type of HDFS virtualization include: VMware BDE Apache OpenStack Sahara Cloudera Director Hortonworks Cloudbreak

15 Virtualize the HDFS Implementation
HOST VM NameNode Resource Manager HOST VM HOST VM Node Manager App HDFS Impl DFSClient Node Manager App HDFS Impl DFSClient Local Disk DataNode DataNode Local Disk Local Disk Local Disk

16 Virtualize the HDFS Implementation
Advantages: Simple No new Java code Compute/data locality Considerations: Requires data ingest time The clusters become stateful

17 HDFS Virtualization Methods
Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

18 Implement a HCFS via the over-the-wire protocol
Use the unmodified hadoop-hdfs jar fs.defaultfs hdfs:// :8020/path Instance: EMC Isilon

19 Implement a HCFS via the over-the-wire protocol
Host Storage Service Local Disk NameNode Resource Manager Host Host Node Manager App HDFS Impl DFSClient Node Manager App HDFS Impl DFSClient Local Disk DataNode DataNode Local Disk

20 Implement a HCFS via the over-the-wire protocol
Advantages: Multi-protocol No new Java code Enterprise storage services Considerations: Open source / proprietary No compute / data locality

21 HDFS Virtualization Methods
Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

22 Implement a HCFS via the FileSystem Java classes
Write the java code that implements the class, build a jar file, put the jar file in the YARN services class path edit the core-site.xml file Instances: S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem GlusterFS - org.apache.hadoop.fs.FilterFileSystem Tachyon – org.apache.hadoop.fs.FileSystem Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem

23 Implement a HCFS via the FileSystem Java classes
Host NameNode Resource Manager Storage Service Host Host Node Manager Node Manager Local Disk Storage Service Storage Service DataNode App DataNode Local Disk App HDFS Impl CustomFS Impl HDFS Impl CustomFS Impl DFSClient DFSClient

24 Implement a HCFS via the FileSystem Java classes
Host Local Disk Storage Service NameNode Resource Manager Host Host Node Manager Node Manager Local Disk Storage Service App DataNode Local Disk DataNode App Storage Service CustomFS Impl HDFS Impl HDFS Impl CustomFS Impl DFSClient DFSClient

25 Implement a HCFS via the FileSystem Java classes
Advantages: Open source / proprietary Multiple file access protocols supported Considerations: These are file systems New Java code Possibly no compute / data locality May lag latest HDFS feature set

26 HDFS Virtualization Is there another way?

27 HDFS Virtualization Virtualize the HDFS Implementation Implement a Hadoop Compatible File System – HCFS Implement a HCFS via the over-the-wire protocol Implement a HCFS via the FileSystem Java classes Virtualize the Hadoop Compatible File System Protocol

28 Virtualize the Hadoop Compatible File System Protocol
Translate the Hadoop File System Calls into native calls to the BackEnd File systems Insert intelligent caching layer Instance: BlueData EPIC software – org.apache.fs.FileSystem

29 Virtualize the Hadoop Compatible File System Protocol
Host Storage Service Host NameNode Resource Manager Local Disk Local Disk Host Host Node Manager DTAP Service Node Manager DTAP Service App App Local Disk DataNode Local Disk DTAP Impl HDFS Impl DataNode HDFS Impl DTAP Impl Local Disk DFSClient Local Disk DFSClient

30 Application is cache aware
HDFS mem cache Page Cache DataNode page HDFS Implementation DFSClient Application is cache aware

31 Extend mem cache to any File System or Object storage
Page Cache HDFS GlusterFS Object Store page DTAP Service DTAP FileSystem Implementation Application is cache unaware

32 Virtualize the Hadoop Compatible File System Protocol
Advantages: Not a file system Transparent in memory cache write back, read ahead Supports multiple protocols Supports compute / data locality Considerations: New Java code Open source / proprietary May lag latest HDFS feature set

33 Let’s Review

34 Outline There are questions to be answered … Three “What”’s:
What is HDFS? What does it mean to virtualize HDFS? What are the different methods of virtualization? Instances Advantages and considerations And a “When”: When to choose HDFS storage virtualization?

35 A Few Words about Performance
Performance measurements are an art as well as a science Bottlenecks in applications Bottlenecks in infrastructure network CPU disk Configuration is key block size distro security

36 Virtualize the HDFS Implementation
Performance – VMware BDE Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers

37 Implement a HCFS via the over-the-wire protocol
Performance – Isilon Source of graph: Stefan Radtke blog post

38 Implement a HCFS via the FileSystem Java classes
Performance – Tachyon Source of graph: Haoyuan Li

39 Performance – BlueData
Virtualize the Hadoop Compatible File System Protocol Performance – BlueData Source of Graph: BlueData customer proof-of-concept results

40 Virtualized HDFS solutions provide good performance
Even with remote storage Even in virtualized environments

41 When it comes to Hadoop storage virtualization, speed is not the whole story
Other factors to consider when implementing a virtualized HDFS option: Use of a virtualized compute environment Open source / proprietary solution Required Hadoop File System features Lifespan of Hadoop cluster

42 When it comes to Hadoop storage virtualization, speed is not the whole story
Other factors to consider when selecting storage: Data accessibility Hadoop File System protocol NFS, object store, other protocols Enterprise storage services data protection geographical replication offline backup

43 Consider a Virtualized HDFS Solution
When any of the following are true: Hadoop and non-Hadoop applications are required to access the same data Do not want to replicate the data Enterprise storage data services required Need to run Hadoop in a virtual compute environment

44 Volume, Velocity, Variety
Hadoop File System Volume, Velocity, Variety Virtualization

45 Q & A Visit our booth in the Expo


Download ppt "What does it mean to virtualize the Hadoop File System?"

Similar presentations


Ads by Google