Data Storage: File Systems and Storage Software Rainer Schwemmer LHC Workshop 14.03.2013.

Slides:



Advertisements
Similar presentations
RAID Oh yes Whats RAID? Redundant Array (of) Independent Disks. A scheme involving multiple disks which replicates data across multiple drives. Methods.
Advertisements

Distributed Processing, Client/Server and Clusters
NAS vs. SAN 10/2010 Palestinian Land Authority IT Department By Nahreen Ameen 1.
Lecture 19 Page 1 CS 111 Online Protecting Operating Systems Resources How do we use these various tools to protect actual OS resources? Memory? Files?
File Systems.
The Zebra Striped Network Filesystem. Approach Increase throughput, reliability by striping file data across multiple servers Data from each client is.
1 CSC 486/586 Network Storage. 2 Objectives Familiarization with network data storage technologies Understanding of RAID concepts and RAID levels Discuss.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
2P13 Week 11. A+ Guide to Managing and Maintaining your PC, 6e2 RAID Controllers Redundant Array of Independent (or Inexpensive) Disks Level 0 -- Striped.
CSE 486/586 CSE 486/586 Distributed Systems Case Study: Facebook f4 Steve Ko Computer Sciences and Engineering University at Buffalo.
File Systems Examples.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Computer Organization and Architecture
CSE 451: Operating Systems Winter 2010 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura.
Session 3 Windows Platform Dina Alkhoudari. Learning Objectives Understanding Server Storage Technologies Direct Attached Storage DAS Network-Attached.
Data Storage Willis Kim 14 May Types of storages Direct Attached Storage – storage hardware that connects to a single server Direct Attached Storage.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
Microsoft Load Balancing and Clustering. Outline Introduction Load balancing Clustering.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
DAS Last Update Copyright Kenneth M. Chipps Ph.D. 1.
Storage Area Networks The Basics. Storage Area Networks SANS are designed to give you: More disk space Multiple server access to a single disk pool Better.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 6 – RAID ©Manuel Rodriguez.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Managing Storage Lesson 3.
Object-based Storage Long Liu Outline Why do we need object based storage? What is object based storage? How to take advantage of it? What's.
1 The Google File System Reporter: You-Wei Zhang.
Day 10 Hardware Fault Tolerance RAID. High availability All servers should be on UPSs –2 Types Smart UPS –Serial cable connects from UPS to computer.
Database Storage Considerations Adam Backman White Star Software DB-05:
Guide to Linux Installation and Administration, 2e 1 Chapter 9 Preparing for Emergencies.
CS 346 – Chapter 12 File systems –Structure –Information to maintain –How to access a file –Directory implementation –Disk allocation methods  efficient.
Local Area Networks (LAN) are small networks, with a short distance for the cables to run, typically a room, a floor, or a building. - LANs are limited.
Hadoop Hardware Infrastructure considerations ©2013 OpalSoft Big Data.
MCTS Guide to Microsoft Windows Vista Chapter 4 Managing Disks.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Redundant Array of Independent Disks.  Many systems today need to store many terabytes of data.  Don’t want to use single, large disk  too expensive.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
File Storage Organization The majority of space on a device is reserved for the storage of files. When files are created and modified physical blocks are.
Serverless Network File Systems Overview by Joseph Thompson.
CS 153 Design of Operating Systems Spring 2015 Lecture 22: File system optimizations.
Eduardo Gutarra Velez. Outline Distributed Filesystems Motivation Google Filesystem Architecture The Metadata Consistency Model File Mutation.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
HDFS (Hadoop Distributed File System) Taejoong Chung, MMLAB.
The concept of RAID in Databases By Junaid Ali Siddiqui.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Status of the Bologna Computing Farm and GRID related activities Vincenzo M. Vagnoni Thursday, 7 March 2002.
CERN - IT Department CH-1211 Genève 23 Switzerland t High Availability Databases based on Oracle 10g RAC on Linux WLCG Tier2 Tutorials, CERN,
Review CS File Systems - Partitions What is a hard disk partition?
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.
Markus Frank (CERN) & Albert Puig (UB).  An opportunity (Motivation)  Adopted approach  Implementation specifics  Status  Conclusions 2.
Hands-On Microsoft Windows Server 2008 Chapter 7 Configuring and Managing Data Storage.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
Bernd Panzer-Steindel CERN/IT/ADC1 Medium Term Issues for the Data Challenges.
File-System Management
Storage Area Networks The Basics.
Integrating Disk into Backup for Faster Restores
Chapter 11: File System Implementation
File System Structure How do I organize a disk into a file system?
Filesystems.
Introduction to Networks
Introduction to Networks
Storage Virtualization
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
CSE 451 Fall 2003 Section 11/20/2003.
CS 295: Modern Systems Organizing Storage Devices
Improving performance
Presentation transcript:

Data Storage: File Systems and Storage Software Rainer Schwemmer LHC Workshop

Short Intro to File Systems What is a (Computer) File System? –The A method for storing and retrieving files data on a hard disk. –The file system manages a folder/directory structure, which provides an index to the files, and it defines the syntax used to access them. –It is system software that takes commands from the operating system to read and write the disk clusters (groups of sectors). What’s the difference to a database? –It is a database –There is usually only one fixed schema though: LHC Workshop Path/NameMeta InformationLocation(s) /etc/passwdSize, owner, access, …0x

How Does it Work? (conceptual) Typically based on B-Trees –Optimized search tree –Reading/Writing is the timing dominant factor, not compare –Self balancing Attributes are stored in INodes Small files are typically also stored directly inside INodes Large files are split up into clusters or, more recently: Extents Extents for large files are allocated with rather complex algorithms –These are usually the algorithms that make or break the FS –The patterns in which data is stored determine how fast it can be read later –Most of the tuning parameters of an FS are here Metadata: All the other information that is necessary to describe the layout of your data –Beware: This can eat a significant amount of your disk space! LHC Workshop Root Node Metadata Dir1 Dir2 foo foo/bar1 foo/bar2 foo/myFile Actual Data INode Extent1 Extent2 Extent3

How Does it Work? (rough example) Usually things are not so neatly organized LHC Workshop Root NodeDir1 Dir2 foo foo/bar1foo/bar2 foo/myFile2 myFile1 Extent 1 myFile2 – Extent 1 myFile2 – Extent 2 File fragmentation is still a big issue –Writing many files at the same time –At approximately the same speed –For a long time  Will bring your File System to its knees Luckily our data is mostly transient If you ever plan on having a central log server: Beware!

Scaling it up: Distributed File Systems File Systems are not limited to single storage units LHC Workshop Root NodeDir1 Dir2 foo foo/bar1foo/bar2 myFile1 Extent 1 myFile2 – Extent 1 myFile2 – Extent 3 myFile2 – Extent 2 Disk 3 Disk 2 Disk 1 foo/myFile2

Scaling it up: Distributed File Systems Not even to the same computer LHC Workshop Root NodeDir1 Dir2 foo foo/bar1foo/bar2 myFile2 – Extent 2 myFile2 – Extent 1 Disk 2 Disk 1 SSD Disk foo/myFile2 Computer 1 Computer 2 High Speed Network Meta Data Operations are the biggest bottleneck They are also the most critical ones Corrupted Meta Data can potentially destroy all your data Data is still there, but without structural information it has no meaning anymore You want to be sure that before you write any actual data, the Metadata is on disk and up to date Put it on super fast disks Put it on a very fast computer

Scaling it up: Distributed File Systems While we are at it … LHC Workshop Root NodeDir1 Dir2 foo foo/bar1foo/bar2 myFile2 – Extent 2 myFile2 – Extent 1 Disk 2 Disk 1 SSD Disk foo/myFile2 Computer 1 High Speed Network Computer 2 Computer 3 Computer N Disk Controller File System Server/Client Meta Data Controller NFS/Samba via Ethernet SCSI via Fibre Channel

Modern Cluster File Systems Load Balancing: –By distributing files over multiple disks throughput and IOPS can be improved ALMOST infinitely Tiered Storage: –Can migrate data between fast (Disk) and slow (Tape) storage depending on usage patterns Redundancy, Replication and Fault Tolerance: –Can replicate data on the fly to cope with node or disk failures (Not Raid!) or just to improve speed of local access De-duplication: –Recognizes that files are identical and stores a file only once with multiple pointers to the same data Copy on Write / Snapshotting: – If a de-duplicated file is modified, a new Extent will be created, covering the modification, while the rest of the file is still shared – A certain state of the FS can be locked. All future modifications trigger the Copy on Write mechanism  Versioned history of all files LHC Workshop

Unfortunately… Still can’t write efficiently into a single file from many sources  Does this sound familiar to you? In fact: Everything that concerns Metadata is essentially single threaded (cluster wide) –Allocating more space to a file is a Metadata operation –There are tricks around this but they only last up to a certain scale and usually make life “interesting” There are also problems in the underlying storage layer –Encapsulation used to be good but it’s not cutting it anymore  see modern IP/Ethernet devices –The Block Device model (essentially arrays) has become quite a crutch –File System and Storage are not really communicating enough with each other –Blatantly obvious if you ever had to setup and tune a multi purpose RAID system LHC Workshop

Our Problem Many Sources producing fragments of data that logically belong together –For a change, I’m not talking about event building here –Fully built and accepted events belonging to a Lumi Section / Block / Run What we ideally would like to do: LHC Workshop Compute Node StreamFile 1 Compute Node StreamFile 2 StreamFile N Cluster FS Event

Our Solution What we are actually doing: LHC Workshop 2013 Compute Node StreamFile 1 Compute Node Local FS StreamFile 2 StreamFile 1 Local FS StreamFile 2 StreamFile 1 Local FS StreamFile 2 Compute Node CASTOR Aggregated Stream Files Aggregation + Copy Process Compute Node Cluster FS CASTOR Stream Files Stream File 1 Copy Event Aggregation

Deferred Processing Farm Nodes usually come with a little bit of local storage (A little bit of storage) * (1000+ Machines) = A lot –LHCb farm currently has > 1 PB of local storage Accelerator duty cycle < 100% Store data which we currently can’t process on local FS and process in inter-fill gap/technical stop –LHCb: > 30% gain in physics performance LHC Workshop Compute Node

What happens if? One machine dies –Data is still intact –Problem is fixed much later –Machine comes back but everybody else is done  annoying Local Hard disk dies –Data is now gone –Do we care? –It’s only a small fraction –On the other hand we suddenly have a lot of opportunity for broken disks Some machines are faster than others LHC Workshop Compute Node

Unified Farm File system What happens when one machine dies? Local Hard disk dies?  Use replication to cover failures –Replication takes up a lot of space though –Software raid5/6 would be possible but not on 100s of disks Some machines are faster than others  Start pulling in files from other machines LHC Workshop Compute Node

Distributed Raid File System FUSE to the Rescue –File System in User Space –Allows to easily write highly specialized File Systems –NTFS on Linux, SSHFS, SNMPFS, CVMFS, … is based on FUSE Currently under investigation in LHCb –File system where File Extents are actually Raid Chunks –Can configure how many chunks per File – don’t need to stripe over all disks –Reading File: Read from neighbour machines in same rack –Disk/Machine failure: Rebuild data from parity on the fly while reading LHC Workshop Compute Node Chunk1Chunk2Chunk3ParityChunk1Chunk2 Chunk3Parity

FOR THE FUTURE LHC Workshop

Object Storage Devices (OSD) Method for solving the Metadata bottleneck Currently a bit overhyped –We’ll have to see what can be salvaged from the hype once it actually spreads a bit more Move a lot of the low level metadata operations into the storage device itself –Create / Destroy / Allocate –Storage Device can be a single Disk –Can also be a small set of disks A little bit like a File System on top of a set of other, simple File Systems –Master FS creates an object on a particular sub FS or set of sub FS –Delegates control of this Object to the sub FS LHC Workshop

OSD: How does it work? Clients ask Meta Controller to create an Object This object can have many different properties Redundancy Scheme –Replication / how many copies –RAID like redundancy (how many disks) Bandwidth guarantees –Assign corresponding amount of disks Parallel Object –Create Component for every client in a parallel stream How will the object be accessed Checksums LHC Workshop Meta Data Controller Client Node Client Node Object Store Create All the tuning you usually have to set up for a file system can now be done for individual files  This is a huge hassle in classical systems Create Obj Components

OSD: How does it work? Reading and writing is now managed by the Object stores directly The Meta Server does not have to be bothered anymore Each Object Store has its own local Meta Controller If a client wants to know the status of an Object, the Meta Controller will retrieve a summary from the Sub-Stores LHC Workshop Meta Data Controller Client Node Client Node Object Store Read / Write / Allocate

How will it benefit us? DAQ example An Object is created for parallel writing Every Compute node gets its own slice of the DAQ object on a nearby storage element (can even be local) No need to coordinate with a meta controller besides the creation of the original object Object has read access defined as concatenation with arbitrary ordering Client (I.e. CASTOR copy process) reads the data inside the object as a continuous file from the different storage elements LHC Workshop Compute Node Reader Compute Node

Other Benefits Fixes a lot of the problems that are currently caused by the block device model –Don’t have to rebuild a lot of free space on a RAID disk anymore –Tuning can be adjusted to the underlying storage and individually, depending on file access patterns More inherent redundancy in case of individual component failure –pNFS is actually made for this kind of storage model –Built-in High Availability mechanics Data can be moved closer to a client in the background if necessary File Systems that are currently or soon going to support OSDs: –PanFS (PANASAS)  Appliance based on OSD –Lustre: Soon TM and/but OSS –pNFS –glusterFS LHC Workshop

Closing Remarks & Things to keep in mind Expect to see a lot more NFS/SAMBA like File Systems in the future –SANs are disappearing and are being replace with specialized NAS (Network Attached Storage) appliances –Not only because FC is essentially an extortion racket –High Speed Ethernet can easily keep up with FC nowadays –NAS boxes are pre-tuned to the applications that you want to run on them  less work for users The OSD approach will probably make storage much more distributed –Eventually we might not even have to buy dedicated storage anymore –Could use local disks in our filter farms instead LHC Workshop

Thanks Ulrich Fuchs, Roberto Divia Wainer Vandelli Gerry Bauer LHC Workshop