1 Lustre ZFS Team Sun Microsystems Lustre, ZFS, and Data Integrity ZFS BOF Supercomputing 2009 Andreas Dilger Sr. Staff Engineer Sun Microsystems.

Slides:



Advertisements
Similar presentations
RAID Redundant Array of Independent Disks
Advertisements

Metadata Performance Improvements Presentation for LUG 2011 Ben Evans Principal Software Engineer Terascala, Inc.
RELIABILITY ANALYSIS OF ZFS CS 736 Project University of Wisconsin - Madison.
“Redundant Array of Inexpensive Disks”. CONTENTS Storage devices. Optical drives. Floppy disk. Hard disk. Components of Hard disks. RAID technology. Levels.
CS-3013 & CS-502, Summer 2006 More on File Systems1 More on Disks and File Systems CS-3013 & CS-502 Operating Systems.
Ceph: A Scalable, High-Performance Distributed File System
Ceph: A Scalable, High-Performance Distributed File System Priya Bhat, Yonggang Liu, Jing Qin.
Allocation Methods - Contiguous
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
G Robert Grimm New York University SGI’s XFS or Cool Pet Tricks with B+ Trees.
Cse Feb-001 CSE 451 Section February 24, 2000 Project 3 – VM.
Overview of Lustre ECE, U of MN Changjin Hong (Prof. Tewfik’s group) Monday, Aug. 19, 2002.
Seafile - Scalable Cloud Storage System
File System Security Jason Eick and Evan Nelson. What does a file system do? A file system is a method for storing and organizing computer files and the.
Servers Redundant Array of Inexpensive Disks (RAID) –A group of hard disks is called a disk array FIGURE Server with redundant NICs.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
Chapter 6 RAID. Chapter 6 — Storage and Other I/O Topics — 2 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f.
Windows RDMA File Storage
hardware and operating systems basics.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Nexenta Proprietary Global Leader in Software Defined Storage Nexenta Technical Sales Professional (NTSP) COURSE CONTENT.
1 - Q Copyright © 2006, Cluster File Systems, Inc. Lustre Networking with OFED Andreas Dilger Principal System Software Engineer
CSE 451: Operating Systems Section 10 Project 3 wrap-up, final exam review.
Confidential1 Introducing the Next Generation of Enterprise Protection Storage Enterprise Scalability Enhancements.
Hypertable Doug Judd Zvents, Inc.. hypertable.org Background.
CEPH: A SCALABLE, HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEM S. A. Weil, S. A. Brandt, E. L. Miller D. D. E. Long, C. Maltzahn U. C. Santa Cruz OSDI 2006.
Experience with the Thumper Wei Yang Stanford Linear Accelerator Center May 27-28, 2008 US ATLAS Tier 2/3 workshop University of Michigan, Ann Arbor.
Introduction to DFS. Distributed File Systems A file system whose clients, servers and storage devices are dispersed among the machines of a distributed.
Ceph: A Scalable, High-Performance Distributed File System
Davie 5/18/2010.  Thursday, May 20 5:30pm  Ursa Minor  Co-sponsored with CSS  Guest Speakers  Dr. Craig Rich – TBA  James Schneider – Cal Poly.
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Workshop sullo Storage da Small Office a Enterprise Class Presentato da:
ZFS & TRIM. Agenda 1.ZFS Structure and Organisation 1.Overview 2.MOS Layer 3.Object-Set Layer 4.Dnode 5.Block Pointer 2.ZFS Operations 1.Writing new data.
Awesome distributed storage system
Data Evolution: 101. Parallel Filesystem vs Object Stores Amazon S3 CIFS NFS.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
2011 Storage Developer Conference. © Insert Your Company Name. All Rights Reserved. GPFS-FPO: A Cluster File System for Big Data Analytics Prasenjit Sarkar.
Dynamic and Scalable Distributed Metadata Management in Gluster File System Huang Qiulan Computing Center,Institute of High Energy Physics,
Indexing strategies and good physical designs for performance tuning Kenneth Ureña /SpanishPASSVC.
ZFS overview And how ZFS is used in ECE/CIS At the University of Delaware Ben Miller.
Lustre File System chris. Outlines  What is lustre  How does it works  Features  Performance.
Ddn.com © 2015 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others. Any statements or representations around future.
A Solution for Maintaining File Integrity within an Online Data Archive Dan Scholes PDS Geosciences Node Washington University 1.
E2800 Marco Deveronico All Flash or Hybrid system
ZFS The Future Of File Systems
ZFS / OpenStorage Lee Marzke IT Consultant - 4AERO.
Achieving the Ultimate Efficiency for Seismic Analysis
EonNAS.
The demonstration of Lustre in EAST data system
Sarah Diesburg Operating Systems COP 4610
Chapter 11: File System Implementation
Diskpool and cloud storage benchmarks used in IT-DSS
Lustre Future Development
Chapter 12: File System Implementation
Experiences and Outlook Data Preservation and Long Term Analysis
Experience of Lustre at a Tier-2 site
CERN Lustre Evaluation and Storage Outlook
Operating System I/O System Monday, August 11, 2008.
Google Filesystem Some slides taken from Alan Sussman.
CIT 480: Securing Computer Systems
A Survey on Distributed File Systems
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Overview Continuation from Monday (File system implementation)
Btrfs Filesystem Chris Mason.
CSE 451 Fall 2003 Section 11/20/2003.
Sarah Diesburg Operating Systems CS 3430
Improving performance
Hybrid Buffer Pool The Good, the Bad and the Ugly
Presentation transcript:

1 Lustre ZFS Team Sun Microsystems Lustre, ZFS, and Data Integrity ZFS BOF Supercomputing 2009 Andreas Dilger Sr. Staff Engineer Sun Microsystems

ZFS BOF Supercomputing 2009 Topics Overview ZFS/DMU features ZFS Data Integrity Lustre Data Integrity ZFS-Lustre Integration Current Status 2

ZFS BOF Supercomputing Lustre Requirements Filesystem Limits 240 PB file system size 1 trillion files (10 12 ) per file system 1.5 TB/s aggregate bandwidth 100k clients Single File/Directory Limits 10 billion files in a single directory 0 to 1 PB file size range Reliability 100h filesystem integrity check End-to-end data integrity 3

ZFS BOF Supercomputing 2009 ZFS Meets Future Requirements 4 Capacity Single filesystems 512TB+ (theoretical 2 64 devices * 2 64 bytes) Trillions of files in a single file system (theoretical 2 48 files per fileset) Dynamic addition of capacity/performance Reliability and Integrity Transaction based, copy-on-write Internal data redundancy (up to triple parity/mirror and/or 3 copies) End-to-end checksum of all data/metadata Online integrity verification and reconstruction Functionality Snapshots, filesets, compression, encryption Online incremental backup/restore Hybrid storage pools (HDD + SSD) Portable code base (BSD, Solaris,... Linux, OS/X?)

Lustre/ZFS Project Background ZFS-FUSE (2006) ● The first project of ZFS porting to the FUSE framework for the Linux in the user space Kernel DMU (2008) ● LLNL engineer ported core parts of ZFS to RedHat Linux as a kernel module Lustre-kDMU (2008 – present) ● Modify Lustre to work on ZFS/DMU storage backend for a production environment ZFS BOF Supercomputing 2009

ZFS Overview ZFS POSIX Layer (ZPL) ● Interface to user applications ● Directories, namespace, ACLs, xattr Data Management Unit (DMU) ● Objects, datatsets, snapshots ● Copy-on-write atomic transactions ● name->value lookup tables (ZAP) Adaptive Replacement Cache (ARC) ● Manage RAM, flash cache space Storage Pool Allocator (SPA) ● Allocate blocks at IO time ● Manages multiple disks directly ● Mirroring, parity, compression ZFS BOF Supercomputing 2009

Lustre/Kernel DMU Modules ZFS BOF Supercomputing 2009

Lustre Software Stack Updates MDD DMU OSD DMU MDT ZFS BOF Supercomputing MDS2.1 MDS1.8 MDS OFD DMU OSD DMU OST obdfilter fsfilt ldiskfs OST MDD ldiskfs OSD ldiskfs MDT fsfilt ldiskfs MDT 1.8/2.0 OSS MDS 2.1 OSS

ZFS BOF Supercomputing 2009 How Safe Is Your Data? There are many sources of data corruption – Software/firmware: client, NIC, router, server, HBA, disk – RAM, CPU, disk cache, media – Client network, storage network, cables Object Storage Server Client Router Application Storage Transport Network Link Network Link Server Transport Network Link

ZFS BOF Supercomputing 2009 ZFS Industry Leading Data Integrity

ZFS BOF Supercomputing 2009 End-to-End Data Integrity 11 Lustre Network Checksum Detects data corruption over network Ext3/4 does not checksum data on disk ZFS stores data/metadata checksums Fast (Fletcher-4 default, or none) Strong (SHA-256) Combine for End-to-End Integrity Integrate Lustre and ZFS checksums Avoid recompute full checksum on data Always overlap checksum coverage Use scalable tree hash method

ZFS BOF Supercomputing 2009 Hash Tree and Multiple Block Sizes 12

ZFS BOF Supercomputing 2009 End-to-End Integrity Client Write 13

ZFS BOF Supercomputing 2009 End-to-End Integrity Server Write 14

ZFS BOF Supercomputing 2009 Integrity Checking Targets 15 Scale of proposed filesystem is enormous 1 trillion files checked in 100h 2-4PB of MDT filesystem metadata ~3 million files/sec, 3GB/s+ for one pass Need to handle CMD coherency as well Distributed link count on files, directories Directory parent/child relationship Filename to FID to inode mapping

ZFS BOF Supercomputing 2009 Distributed Coherency Checking 16 Traverse MDT filesystem(s) only once Piggyback on ZFS resilver/scrub to avoid extra IO Lustre processes blocks after checksum validation Validate OST/remote MDT data opportunistically Track which objects have not been validated (1 bit/object) Each object has a back-reference to validate user Back-reference avoids need to keep global reference table Lazily scan OST/MDT in the background Load sensitive scanning Continuous online verification/fix of distributed state

Lustre DMU development Status Data Alpha ● OST basic functionality MD Alpha ● MDS basic functionality, beginning recovery support Beta ● Layouts, performance improvement, full recovery GA ● Full performance in production release ZFS BOF Supercomputing 2009

18 Questions? Thank You ZFS BOF Supercomputing 2009

Hash Tree For Non-contiguous Data 19 ZFS BOF Supercomputing 2009

T10-DIF vs. Hash Tree CRC-16 Guard Word All 1-bit errors All adjacent 2-bit errors Single 16-bit burst error bit error rate 32-bit Reference Tag Misplaced write != 2nTB Misplaced read != 2nTB Fletcher-4 Checksum All bit errors All errors affecting 4 or fewer 32-bit words Single 128-bit burst error bit error rate Hash Tree Misplaced read Misplaced write Phantom write Bad RAID reconstruction

ZFS BOF Supercomputing 2009 ZFS Hybrid Pool Example Five 400GB 4200 RPM SATA Drives Seven 146GB 10,000 RPM SAS Drives 32GB SSD ZIL Device 80GB SSD Cache Device 4 Xeon 7350 Processors (16 cores) 32GB FB DDR2 ECC DRAM OpenSolaris TM with ZFS Configuration A Configuration B

ZFS BOF Supercomputing 2009 Raw Capacity (2TB) ZFS Hybrid Pool Example Hybrid Storage(DRAM + SSD + 5x 4200 SATA) Traditional Storage (DRAM + 7x 10K SAS) Power Read IOPs Write IOPs 3.2x 1.1x 2x 1/5

Lustre kDMU Roadmap