Dynamic and Scalable Distributed Metadata Management in Gluster File System Huang Qiulan Computing Center,Institute of High Energy Physics,

Slides:



Advertisements
Similar presentations
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
Advertisements

Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
Scale-Out File Server Clusters Storage Spaces Virtualization and Resiliency Hyper-V Clusters SMB Shared JBOD Storage PowerShell & SCVMM 2012 R2 Management.
Managing storage requirements in VMware Environments October 2009.
Scaling Distributed Machine Learning with the BASED ON THE PAPER AND PRESENTATION: SCALING DISTRIBUTED MACHINE LEARNING WITH THE PARAMETER SERVER – GOOGLE,
1 Principles of Reliable Distributed Systems Tutorial 12: Frangipani Spring 2009 Alex Shraer.
File System Implementation
File System Implementation
The Google File System. Why? Google has lots of data –Cannot fit in traditional file system –Spans hundreds (thousands) of servers connected to (tens.
6/24/2015B.RamamurthyPage 1 File System B. Ramamurthy.
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
World Wide Web Caching: Trends and Technology Greg Barish and Katia Obraczka USC Information Science Institute IEEE Communications Magazine, May 2000 Presented.
Module – 7 network-attached storage (NAS)
Disk Array Performance Estimation AGH University of Science and Technology Department of Computer Science Jacek Marmuszewski Darin Nikołow, Marek Pogoda,
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Presented by: Alvaro Llanos E.  Motivation and Overview  Frangipani Architecture overview  Similar DFS  PETAL: Distributed virtual disks ◦ Overview.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Interposed Request Routing for Scalable Network Storage Darrell Anderson, Jeff Chase, and Amin Vahdat Department of Computer Science Duke University.
Middleware Enabled Data Sharing on Cloud Storage Services Jianzong Wang Peter Varman Changsheng Xie 1 Rice University Rice University HUST Presentation.
CSC 456 Operating Systems Seminar Presentation (11/13/2012) Leon Weingard, Liang Xin The Google File System.
Technology Overview. Agenda What’s New and Better in Windows Server 2003? Why Upgrade to Windows Server 2003 ?  From Windows NT 4.0  From Windows 2000.
1 - Q Copyright © 2006, Cluster File Systems, Inc. Lustre Networking with OFED Andreas Dilger Principal System Software Engineer
Operating Systems (CS 340 D) Dr. Abeer Mahmoud Princess Nora University Faculty of Computer & Information Systems Computer science Department.
Scalability Terminology: Farms, Clones, Partitions, and Packs: RACS and RAPS Bill Devlin, Jim Cray, Bill Laing, George Spix Microsoft Research Dec
July 2003 Sorrento: A Self-Organizing Distributed File System on Large-scale Clusters Hong Tang, Aziz Gulbeden and Tao Yang Department of Computer Science,
A BigData Tour – HDFS, Ceph and MapReduce These slides are possible thanks to these sources – Jonathan Drusi - SCInet Toronto – Hadoop Tutorial, Amir Payberah.
© Oxford University Press 2011 DISTRIBUTED COMPUTING Sunita Mahajan Sunita Mahajan, Principal, Institute of Computer Science, MET League of Colleges, Mumbai.
FlashSystem family 2014 © 2014 IBM Corporation IBM® FlashSystem™ V840 Product Overview.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
DOE PI Meeting at BNL 1 Lightweight High-performance I/O for Data-intensive Computing Jun Wang Computer Architecture and Storage System Laboratory (CASS)
Large Scale Parallel File System and Cluster Management ICT, CAS.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 11: File System Implementation.
Ceph: A Scalable, High-Performance Distributed File System
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
CENTER FOR HIGH PERFORMANCE COMPUTING Introduction to I/O in the HPC Environment Brian Haymore, Sam Liston,
Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CASTOR project status CASTOR project status CERNIT-PDP/DM October 1999.
Distributed File Systems Questions answered in this lecture: Why are distributed file systems useful? What is difficult about distributed file systems?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Enhanced Availability With RAID CC5493/7493. RAID Redundant Array of Independent Disks RAID is implemented to improve: –IO throughput (speed) and –Availability.
DDN Web Object Scalar for Big Data Management Shaun de Witt, Roger Downing (STFC) Glenn Wright (DDN)
IHEP Computing Center Site Report Gang Chen Computing Center Institute of High Energy Physics 2011 Spring Meeting.
An Introduction to GPFS
Planning File and Print Services Lesson 5. File Services Role The File Services role and the other storage- related features included with Windows Server.
Decentralized Distributed Storage System for Big Data Presenter: Wei Xie Data-Intensive Scalable Computing Laboratory(DISCL) Computer Science Department.
E VALUATION OF GLUSTER AT IHEP CHENG Yaodong CC/IHEP
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
Marian Marinov - System Architect - Siteground.com OpenFest th annual conference Gluster Filesystem Sofia 10.Oct.2007.
File-System Management
Configuring File Services
Web Server Load Balancing/Scheduling
Distributed Cache Technology in Cloud Computing and its Application in the GIS Software Wang Qi Zhu Yitong Peng Cheng
The demonstration of Lustre in EAST data system
Web Server Load Balancing/Scheduling
Chapter 11: File System Implementation
Storage Virtualization
A Survey on Distributed File Systems
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
File System B. Ramamurthy B.Ramamurthy 11/27/2018.
Lecture 15 Reading: Bacon 7.6, 7.7
Presentation transcript:

Dynamic and Scalable Distributed Metadata Management in Gluster File System Huang Qiulan Computing Center,Institute of High Energy Physics, Computing Center,Institute of High Energy Physics, Chinese Academy of Sciences Chinese Academy of Sciences

Introduction to Gluster Gluster in IHEP Deployment, Performance and Issues Design and Implementation What we have done? Experiment results Summary HUANG QIULAN/CC/IHEP Topics

Gluster Introduction

Gluster Overview Gluster is an open-source distributed file system Linear scale-out, support several petabytes and thousands connect  No metadata structure  Elastic hashing algorithm to distribute data efficiently  Fully distributed architecture Global Namespace to support POSIX High reliability  Data replication  Data self-heal Design and implementation based on a stackable modular user space HUANG QIULAN/CC/IHEP

Gluster Architecture HUANG QIULAN/CC/IHEP Brick server: store data in EXT3/EXT4/XFS Client: Access file system by TCP/NFS/SAMBA

Stackable and Modular Design HUANG QIULAN/CC/IHEP Stackable and modular structure Each functional module called translator All translators constitute a tree Greatly reduces the complexity of system Easy to expansion of system functions

Gluster in IHEP

Deployment of Gluster file system HUANG QIULAN/CC/IHEP Version:3.2.7 ( optimization) I/O servers: 5 (39 bricks) Storage Capacity: 315TB Serviced : Cosmic Ray/Astrophysics experiments(YBJ-ARGO) Brick stoarge Computing Cluster SATA Disk Array RAID 6 ( Main ) 10Gb Ethernet Brick stoarge SATA Disk Array RAID 6 ( extended )

HUANG QIULAN/CC/IHEP Gluster Performance Performance of each server Peak IO throughput of is up to +850MB/s, the 10Gb Ethernet is fully occupied When the read/write requests increase, Wait IO of data server will up to +40%, When Wait IO >20%, “ls” performance is not optimistic

HUANG QIIULAN/CC/IHEP Gluster Issues Metadata problems When data server is busy, “ls” performance lost more With bricks increase, “mkdir”, “ rmdir” performance changed worse Directory tree inconsistent When one brick got problems, client requests would stuck Ownership of link files changed to root:root Most of problems is metadata What can we do for Gluster?

Design and Implementation

HUANG QIIULAN/CC/IHEP Architecture of our system

volume testvol-client-0 type protocol/client option remote-host option remote-subvolume /data02/gdata01 option transport-type tcp end-volume volume testvol-client-1 type protocol/client option remote-host option remote-subvolume /data03/gdata01 option transport-type tcp end-volume volume testvol-client-2 type protocol/client option remote-host option remote-subvolume /data03/gdata01 option transport-type tcp end-volume volume testvol-dht type cluster/distribute subvolumes testvol-client-0 testvol-client-1 testvol-client-2 end-volume ( Add new translator ) volume testvol-md type cluster/md subvolumes testvol-dht testvol-md-replica-0 testvol-md-replica-1 end-volume volume testvol-stat-prefetch type performance/stat-prefetch subvolumes testvol-md end-volume volume testvol type debug/io-stats option latency-measurement on option count-fop-hits on subvolumes testvol-stat-prefetch end-volume Clients Distribute/stripe/replication MD Volume Read ahead io cache stats prefetch VFS HUANG QIULAN/CC/IHEP

HUANG QIIULAN/CC/IHEP How to distribute metadata? Adaptive Directory Sub-tree Partition algorithm(ADSP) Improved sub-tree partition algorithm Partitions the namespace into sub trees by the granularity of directory Stores sub trees on the storage device using flat structure Records sub-tree distribution information and file attributions with extended attribute Adjusts sub-tree placement adaptively based on the load of metadata cluster ADSP can solve HOW and WHERE to distribute metadata Features High scalability load balance Flexibility

HUANG QIULAN/CC/IHEP ADSP Implementation Flat structure in storage device Use UUID of dir as directory name Metadata layout Stored in extended attribution Sub directories and files under /ybjgfs/argo/public File metadata stored in extended attribution File system namespace /ybjgfs/argo/public /ybjgfs/asgamma/ybjgfs/argo/user argo/ argo/public asgamma/ gfid(public)/ argo/user/ …… … ID=1ID=2 ID=n layout=2

HUANG QIULAN/CC/IHEP How to locate data? Distributed Unified Layout Algorithm(DULA) Improved consistent hashing algorithm without any routing information to locate data average time complexity is O (1) Hash ring is divided into intervals of equal length Mapping all storage devices in a hash ring and each device relate to each interval Hash(GFID)  [start, end]

Experiment results

HUANG QIIULAN/CC/IHEP Metadata performance greatly improved by ADSP Dir operation: ADSP improved about 2~3 times than GLUSTER File operation:ADSP improved about 2 times than GLUSTER and ADSP is better than LUSTRE Metadata performance(1)

HUANG QIIULAN/CC/IHEP Metadata performance(2)

HUANG QIIULAN/CC/IHEP Metadata performance(3)

HUANG QIIULAN/CC/IHEP Metadata performance(4) 1,800,000files under testdir “ls –lRa testdir” Our system ZEFS takes 276 minutes and GLUSTER takes 3643 minutes Performance of ZEFS is about 13 times that of GLUSTER

HUANG QIIULAN/CC/IHEP Summary Expanse Metadata module in Gluster framework ADSP algorithm to be responsible metadata distribution and organization DULA algorithm to solve data position in cluster Metadata performance greatly improved Single client, single process :dir operation improved about 2~3 times than Gluster, file operation improved about 2 times than Gluster and our system is better than Lustre Multi-clients, multi processes: High oncurrent access to small files, performance of our system is about 3~4 times that of Gluster, different file size has little effect on performance of directory operation. The overall trend showed greater file size, the performance is slower, but the trend is not particularly obvious. Better scalability than Gluster

Thank you Question? Author HUANG QIULAN/CC/IHEP