Download presentation
Presentation is loading. Please wait.
Published byHoward Alexander Modified over 8 years ago
1
Presenter: Yue Zhu, Linghan Zhang A Novel Approach to Improving the Efficiency of Storing and Accessing Small Files on Hadoop: a Case Study by PowerPoint Files Bo Dong, Jie Qiu, Qinghua Zheng, Xiao Zhong, Jingwei Li, Ying Li
2
S-2 Outline Background & Motivation Design Evaluation of Experiments Conclusions
3
S-3 Hadoop Distributed File System (HDFS) Very Large Distributed File System –10K nodes, 100 million files, 10PB Design Pattern: Master-slaves Two Types of Machines in a HDFS Cluster –NameNode (the master): the heart of an HDFS filesystem, it maintains and manages the file system metadata, e.g., what blocks make up a file, and on which DataNodes those blocks are stored. –DataNode (the slave): where HDFS stores the actual data, there are usually quite a few of these.
4
S-4 HDFS – Data Organization HDFS is a block-structured file system. A file can be made of several blocks, and they are stored across a cluster of one or more machines with data storage capacity. Each block of a file is replicated across a number of machines, to improve fault tolerance.
5
S-5 Problem of Small Files in HDFS Disk utilization –A 2.5 MB file is stored with the block size of 64 MB. But it only uses 2.5 MB of disk space. Other files cannot be written to the free space of 64 MB block. High memory usage –The metadata size for each file is 368 bytes in memory including the default 3 replicas. (16 GB memory for 24 millions files) –DataNodes periodically send reports to NameNode with the lists of blocks, and NameNode gerthers the reports and stores them in memory. High Access Costs –When reading files, HDFS client first consults NameNode for file metadata, which happens once for each file access. –Sequential files are usually not placed sequentially in block level but rather placed on different DataNodes.
6
S-6 A Case Study on PPT Files A PPT file is converted into a certain amount of preview pictures. A PPT file and all its pictures belong to a PPT courseware.
7
S-7 Outline Background & Motivation Design Evaluation of Experiments Conclusions
8
S-8 The novel approach – Basic Idea MERGING small files into larger ones. PREFETCHING to mitigate the load of NameNode and improve access efficiency. Example: File number: N 3
9
S-9 The novel approach – A share system A HDFS based file share system could be like: –User interface layer interface for uploading Browsing downloading, etc. –Business process layer file converting file merging file naming web server and cache functions contacts with HDFS though HDFS clients. –Storage layer persistence functions using HDFS cluster.
10
S-10 The novel approach – Uploading ① PPT file arrived web server; ② PPT multiple picture series; ③ Built Local index file, merge; ④ Uploaded.
11
S-11 ② Check metadata ③ tract target DataNode ④ Split file from block, return. ② Mapping ③ metadata of the merged file ④ DataNode ⑤ Fetch target block ⑥ Split ⑦ Prefetching The novel approach – Browsing. ① Check cache for the target file ② File is in Cache, browsing over
12
S-12 The novel approach – Download Down load process is just for PPT files: –if the file has been prefetched, read from cache. –if not, the download process is almost the same as browsing process, except for no prefetching here.
13
S-13 The novel approach – File merging Calculate the total number of files, pictures and local index file. (fixed length index) The sum of lengths of all files including local index file is calculated compared with HDFS block size: less than HDFS block: merged file all in one block in default order local index can be established. exceeds HDFS block size: merged file broken into blocks. Two strategies are adopted for exceeding case.
14
S-14 The novel approach – strategies for file merging Strategy 1 target: try to make picture series located in one block. method: adjusting order of picture series, each one of them is a whole. steps : 1. calculate prefix length prefix length = local index file + PPT file + standard resolution picture series 2. compare prefix length with HDFS block size exceed? Y: process over. N: go step 3. 3. adjust order of others picture series, try to fill the HDFS block size. If couldn’t, follow default order, process over.
15
S-15 The novel approach – strategies for file merging Strategy 2 target: try to make picture series located in one block. (same of strategy 1) Method: vacant domain Steps : 1. check offset of each file, and if there are any files across two blocks’ boundary. if not go to step 3, otherwise go to step 2; 2. adjust file order. Put the file on boundary to next block, add index file at the start location of next block, And reset offset to be just after the index file. 3. loop step 1 and 2.
16
S-16 The novel approach – file mapping WHY ? original files merged one. ONE SOLUTION: name mapping Depend on the given naming disciplines, four domains of name are: Name domain + resolution domain + serial number domain + block number domain [example] : A_1280_05_01.jpg THE OTHER SOLUTION: building global table (more general) Create record for each original file. kept within NameNode, and also in disk persistently.
17
S-17 The novel approach – prefetching WHY? PPT courseware files are related to each other, while HDFS doesn’t provide prefetching. HOW? Two level prefetching: 1. local index file prefetching; 2. correlated file prefetching ; WHY? Save effort of file mapping and interaction with NameNode TARGET : Prefetching records(metadata and index information) EXAMPLE: CONTENTS? files in the same PPT courseware QUANTITY? pictures in the same serie and after target pictures; + PPT file; + pictures in all series after target pictures WHEN TRIGGER? browsing process: 1. check file existed or not; Y read, N 2 2. check prefetching record existed or not; Y file prefetching, N 3 3. local index file prefetching be triggered.
18
S-18 Outline Background & Motivation Design Evaluation of Experiments Conclusions
19
S-19 Experiments - setup One Master Node (NameNode) –IBM X3650, 8 Intel Xeon CPU2.00GHz, 16GB memory, 3TB disk Eight Slaves Nodes (DateNodes) –IBM X3610, 8 Intel Xeon CPU2.00GHz, 8GB memory, 3TB disk Ubuntu server 9.04 Hadoop 0.20.1 Java version 1.6.0 HDFS block size 64MB Connected by 1.0 Gbps Ethernet network
20
S-20 Memory Usage Hadoop archive(HAR): General small file solution which archives small files into larger files. In HAR, all files in one PPT courseware are stored in one HAR file.
21
S-21 Time Efficiency MSPF: Millisecond per Accessing a File.
22
S-22 Conclusions The proposed approach adopts a combination of file merging method and two-level prefetching mechanism to mitigate small file problems on HDFS. The focus of this paper is storing small files but not processing them with MapReduce framework, e.g., Hadoop. Our project not only provides efficient I/O for small files in HDFS, but also looks at how to work with small files using MapReduce.
23
S-23 Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.