Andrew Hanushevsky: File System I/O First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific.

Slides:



Advertisements
Similar presentations
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Disks and RAID.
Advertisements

CSE506: Operating Systems Disk Scheduling. CSE506: Operating Systems Key to Disk Performance Don’t access the disk – Whenever possible Cache contents.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Lecture 14 I/O Devices & Hard Disk Drives. vgetmem only from private heap.
CS4432: Database Systems II Data Storage - Lecture 2 (Sections 13.1 – 13.3) Elke A. Rundensteiner.
Input/Output Management and Disk Scheduling
Disk Access Model. Using Secondary Storage Effectively In most studies of algorithms, one assumes the “RAM model”: –Data is in main memory, –Access to.
IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes.
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
Disk Drivers May 10, 2000 Instructor: Gary Kimura.
CS4432: Database Systems II Lecture 2 Timothy Sutherland.
Disks.
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Disk and I/O Management
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
Lecture 11: DMBS Internals
Physical Storage and File Organization COMSATS INSTITUTE OF INFORMATION TECHNOLOGY, VEHARI.
1 Physical Data Organization and Indexing Lecture 14.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
On Windows File Access Modes : A Performance Study Jalil Boukhobza & Claude Timsit laboratory Versailles Saint Quentin University.
1Fall 2008, Chapter 12 Disk Hardware Arm can move in and out Read / write head can access a ring of data as the disk rotates Disk consists of one or more.
External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.
File Processing : Storage Media 2015, Spring Pusan National University Ki-Joune Li.
Chapter 8 External Storage. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices  Disk drives  Tape.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CS4432: Database Systems II Data Storage 1. Storage in DBMSs DBMSs manage large amounts of data How does a DBMS store and manage large amounts of data?
12/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
Disk Basics CS Introduction to Operating Systems.
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
CS399 New Beginnings Jonathan Walpole. Disk Technology & Secondary Storage Management.
CPSC 231 Secondary storage (D.H.)1 Learning Objectives Understanding disk organization. Sectors, clusters and extents. Fragmentation. Disk access time.
Programmer’s View of Files Logical view of files: –An a array of bytes. –A file pointer marks the current position. Three fundamental operations: –Read.
File System Performance CSE451 Andrew Whitaker. Ways to Improve Performance Access the disk less  Caching! Be smarter about accessing the disk  Turn.
COSC 6340: Disks 1 Disks and Files DBMS stores information on (“hard”) disks. This has major implications for DBMS design! » READ: transfer data from disk.
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
1 Components of the Virtual Memory System  Arrows indicate what happens on a lw virtual address data physical address TLB page table memory cache disk.
CS422 Principles of Database Systems Disk Access Chengyu Sun California State University, Los Angeles.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Andrew Hanushevsky: Sendfile() First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific.
File organization Secondary Storage Devices Lec#7 Presenter: Dr Emad Nabil.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems DISK I/0.
Getting the Most out of Scientific Computing Resources
Getting the Most out of Scientific Computing Resources
Lecture 16: Data Storage Wednesday, November 6, 2006.
FileSystems.
Database Management Systems (CS 564)
Andrew Hanushevsky: Exercises
File System Structure How do I organize a disk into a file system?
Lecture 11: DMBS Internals
Lecture 9: Data Storage and IO Models
CPSC-310 Database Systems
Overview Continuation from Monday (File system implementation)
Jonathan Walpole Computer Science Portland State University
Secondary Storage Management Brian Bershad
Persistence: hard disk drive
File Storage and Indexing
Persistence: I/O devices
Chapter 11 I/O Management and Disk Scheduling
Chapter 14: File-System Implementation
Secondary Storage Management Hank Levy
CS333 Intro to Operating Systems
Introduction to Operating Systems
Presentation transcript:

Andrew Hanushevsky: File System I/O First INFN International School on Architectures, tools and methodologies for developing efficient large scale scientific computing applications Ce.U.B. – Bertinoro – Italy, 12 – 17 October 2009

Goals Sensitize you to File System limitations –How I/O choices make or break performance Show what to do and not to do –Keeping performance high How to broadly translate advice –Databases and frameworks 10/16/2009Andrew Hanushevsky2

Disk Mechanics Disk surface is divided into sectors –Usually 512 bytes An I/O operation requires that the disk –Move the head to the right circular track (seek time) –Wait until the proper sector arrives (rotational delay) –Then transfer the data Sources: /16/20093Andrew Hanushevsky Disk rotation

Mechanical Devices Are Slow CharacteristicSeagateSeagate CheetahSeagate Barracuda 180X15-36LPBarracuda 36ES TypeHigh CapacityHigh PerformanceDesktop Capacity181.6GB36.7GB18.4GB Min Seek Time0.8ms0.3ms1.0ms Avg seek time7.4ms3.6ms9.5ms Spindle speed7200rpm15K rpm7200 rpm Avg Rotational Delay4.17ms2 ms4.17 ms Max xfr rate160 MB/s MB/s25 MB/s Sector Size Source: ftp://ftp.prenhall.com/pub/esm/sample_chapters/engineering_computer_science/stallings/coa6e/pdf/ch6.pdf 10/16/20094Andrew Hanushevsky

Slowness In Perspective 10/16/2009Andrew Hanushevsky t cwtSeek time 7.4msRot Delay 4.17msx Reading 64K requires, on average, 11.77ms, excluding channel wait time (cwt). Actual data transfer (x) occupies disk 1.7% of the total time. You need to read almost 2MB to achieve 50% channel utilization! The faster Cheetah drive accomplishes this 48% faster (5.652 ms) But channel utilization drops to less than 1% requiring a read of 3MB for 50% channel utilization. Seagate Barracuda 180

File System Mechanics File System groups N sectors into an I/O Unit –Usually 8 to 256 sectors (4K to 128K, sometimes more) Data always read & written in I/O units or blocks –Simplifies mapping files into memory This is why a block size is typically a multiple of the page size Data, in unit sizes, is cached in memory –Speeds future access to data within the block Additional subsequent blocks may be pre-read –With the hope they will be wanted in the future 10/16/20096Andrew Hanushevsky

File System & Slowness File system tries to hide disk slowness –Memory caching to avoid disk I/O Also done in high-end disk controller caches –Pre-reading to keep channel utilization high Done in the background to minimize impact –Also done in some high-end RAID disk controllers –Offset ordering Reduces seek time –Also done in high-end disk controllers 10/16/2009Andrew Hanushevsky7

File System Performance Varies OperationExt3 Ext4 Improvement Creation of eight 1 GB files155.9 sec145.1 sec6.9 % Write speed 55.4 MB/sec 59.3 MB/sec 7.0 % Deletion of eight 1 GB files11.87 sec0.33 sec97.2 % 10,000 random reads and writes in an 8 GB file80.0 ops/sec88.7 ops/sec10.9 % Source: 10/16/20098Andrew Hanushevsky

What This Implies FS performance  Disk Performance Behavior of application is the determinant –How much application data per I/O request? –Sequential access? –Random access? What is the r/w cycle length? –How many different blocks will be hit before a block revisit? All of these have a profound effect –Independent of file system or disk device These might make it a little better or worse 10/16/20099Andrew Hanushevsky

Effect of I/O Request Size Recall FS reads/writes data in blocks –Assume block is 4K then reading… 512 bytes = 12.5% efficiency 1024 bytes = 25.0% efficiency 2048 bytes = 50.0% efficiency 4096 bytes = 100 % efficiency –Application should try to keep efficiency high Each read should be as large as possible Subsequent reads should use cached data 10/16/200910Andrew Hanushevsky

Effect of Sequential Access This is the easiest for FS and Disk –Large disk devices are inherently sequential However, your app is not alone –Application interleaving produces random I/O FS read-ahead attempts to alleviate some of this Each sequential request should be large –1-4MB per request usually the works best 10/16/200911Andrew Hanushevsky

Effect of Random I/O Random I/O is a performance destroyer –True for mechanical disks but not for SSD’s Cycle length is important 10/16/200912Andrew Hanushevsky Page Page containing offset 0 replaced Reading offset 2048 requires re-read! 2048 Read Offset: Read Offset Cycle Length of 7 pages > cache size of 6 pages

Contrived Example? Yes, memory caches are very large today –Typically, 1,048,576 pages (4GB) Works great if your application is alone –Not true for multi-core batch nodes Can shrink to 131,072 pages (500MB) for 8 cores –Effective size per running job –Definitely not true for file servers They serve thousands of simultaneous clients 10/16/2009Andrew Hanushevsky13

Advising The File System Can use posix_fadvise() –Tells file system how the app will access data Starting at an offset for some number of bytes Allows file system to better manage the memory cache –Few OS’s support this API Not present in MacOS X 10.3, FreeBSD 6.0, NetBSD 3.0, OpenBSD 3.8, AIX 5.1, HP-UX 11, IRIX 6.5, OSF/1 5.1, Solaris 10, Cygwin 1.5.x, mingw, Interix 3.5, BeOS. Present in others but ignored (e.g.,OpenSolaris) –So, for now, consider it Linux specific Or using HP/UX /16/200914Andrew Hanushevsky

posix_fadvise() Details #include int posix_fadvise(int fd, off_t offset, off_t len, int advice); Advice: POSIX_FADV_NORMAL Standard processing POSIX_FADV_SEQUENTIAL Doubles the read-ahead size for entire file POSIX_FADV_RANDOM Disables read-ahead for entire file POSIX_FADV_WILLNEED Initiates block read for specified byte range (also see readahead()) POSIX_FADV_DONTNEED Discards cached file pages in specified byte range POSIX_FADV_NOREUSE Data will not be used again Problematic, some OS’s support this, some ignore it (e.g., Linux) 10/16/200915Andrew Hanushevsky

If It Were So Simple... Application data framework complications –Databases like mySQL –Persistency frameworks like root Most HEP applications use one or more –Actual disk device is hidden –Hard if not impossible to directly apply advice What to do? 10/16/2009Andrew Hanushevsky16

Databases & Performance Translate advice to schema development –Avoid wide tables when not needed Increases payload of only some data wanted –Use indices for sparsely accessed rows Allows database to optimize access –Normalize the tables within reason Keep related data together 10/16/2009Andrew Hanushevsky17

Frameworks & Performance Know how framework lays out data –This is the most difficult part Consult framework experts –Carefully construct your data objects Keep useful payload as large as possible –Be cognizant of any compression done by framework –Cluster related payloads as much as possible –Avoid scattered references This reduces widely spaced random reads 10/16/2009Andrew Hanushevsky18

Conclusions Be aware you’re dealing with mechanics –Disks are slow and unwieldy devices Overlap I/O and CPU as much as possible –Choose algorithms that make this possible This also requires deft multi-threading Carefully layout your data –Keep in mind the database and framework Use the advice in this section 10/16/2009Andrew Hanushevsky19