Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016.

Slides:



Advertisements
Similar presentations
I/O Management and Disk Scheduling
Advertisements

CSE506: Operating Systems Disk Scheduling. CSE506: Operating Systems Key to Disk Performance Don’t access the disk – Whenever possible Cache contents.
Chapter 3 Presented by: Anupam Mittal.  Data protection: Concept of RAID and its Components Data Protection: RAID - 2.
Day 30 I/O Management and Disk Scheduling. I/O devices Vary in many ways Vary in many ways –Data rate –Application –Complexity of control –Unit of transfer.
Input/Output Management and Disk Scheduling
Lecture 17 I/O Optimization. Disk Organization Tracks: concentric rings around disk surface Sectors: arc of track, minimum unit of transfer Cylinder:
CS 3013 & CS 502 Summer 2006 Scheduling1 The art and science of allocating the CPU and other resources to processes.
Disk Drivers May 10, 2000 Instructor: Gary Kimura.
Based on the slides supporting the text
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
Secondary Storage CSCI 444/544 Operating Systems Fall 2008.
1 Lecture 10: Uniprocessor Scheduling. 2 CPU Scheduling n The problem: scheduling the usage of a single processor among all the existing processes in.
I/O Systems and Storage Systems May 22, 2000 Instructor: Gary Kimura.
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
12.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 12: Mass-Storage Systems.
FlashFQ: A Fair Queueing I/O Scheduler for Flash-Based SSDs Kai Shen and Stan Park University of Rochester USENIX-ATC
CS4432: Database Systems II Data Storage (Better Block Organization) 1.
1 Scheduling I/O in Virtual Machine Monitors© 2008 Diego Ongaro Scheduling I/O in Virtual Machine Monitors Diego Ongaro, Alan L. Cox, and Scott Rixner.
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
Database Services for Physics at CERN with Oracle 10g RAC HEPiX - April 4th 2006, Rome Luca Canali, CERN.
Filesytems and file access Wahid Bhimji University of Edinburgh, Sam Skipsey, Chris Walker …. Apr-101Wahid Bhimji – Files access.
Performance Testing of DDN WOS Boxes Shaun de Witt, Roger Downing Future of Big Data Workshop June 27 th 2013.
CS 6560 Operating System Design Lecture 13 Finish File Systems Block I/O Layer.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
CPU Scheduling Chapter 6 Chapter 6.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
1 Distributed Operating Systems and Process Scheduling Brett O’Neill CSE 8343 – Group A6.
Disk Structure Disk drives are addressed as large one- dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.
Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.
Chapter 111 Chapter 11: Hardware (Slides by Hector Garcia-Molina,
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
Scheduling. Alternating Sequence of CPU And I/O Bursts.
1Fall 2008, Chapter 12 Disk Hardware Arm can move in and out Read / write head can access a ring of data as the disk rotates Disk consists of one or more.
Disks Chapter 5 Thursday, April 5, Today’s Schedule Input/Output – Disks (Chapter 5.4)  Magnetic vs. Optical Disks  RAID levels and functions.
Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 14: Mass-Storage Systems Disk Structure Disk Scheduling Disk Management Swap-Space.
CPU Scheduling CSCI 444/544 Operating Systems Fall 2008.
Grid Lab About the need of 3 Tier storage 5/22/121CHEP 2012, The need of 3 Tier storage Dmitri Ozerov Patrick Fuhrmann CHEP 2012, NYC, May 22, 2012 Grid.
Background: Operating Systems Brad Karp UCL Computer Science CS GZ03 / M th November, 2008.
Optimisation of Grid Enabled Storage at Small Sites Jamie K. Ferguson University of Glasgow – Jamie K. Ferguson – University.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
LINUX SCHEDULING Evolution in the 2.6 Kernel Kevin Lambert Maulik Mistry Cesar Davila Jeremy Taylor.
SLACFederated Storage Workshop Summary For pre-GDB (Data Access) Meeting 5/13/14 Andrew Hanushevsky SLAC National Accelerator Laboratory.
CMS Issues. Background – RAL Infrastructure TM Nsd Xrd- mgr TM Nsd Xrd- mgr TM Rhd stagerd TGW Rhd stagerd TGW Cupv Vmgr Vdqm nsd Cupv Vmgr Vdqm nsd Cupv.
Chapter 11 I/O Management and Disk Scheduling Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and.
Arne Wiebalck -- VM Performance: I/O
Tier-2 storage A hardware view. HEP Storage dCache –needs feed and care although setup is now easier. DPM –easier to deploy xrootd (as system) is also.
Slide 1/29 Informed Prefetching in ROOT Leandro Franco 23 June 2006 ROOT Team Meeting CERN.
I/O aspects for parallel event processing frameworks Workshop on Concurrency in the many-Cores Era Peter van Gemmeren (Argonne/ATLAS)
An Analysis of Data Access Methods within WLCG Shaun de Witt, Andrew Lahiff (STFC)
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
CPU Scheduling CS Introduction to Operating Systems.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Scheduling.
Lecture 17 Raid. Device Protocol Variants Status checks: polling vs. interrupts Data: PIO vs. DMA Control: special instructions vs. memory-mapped I/O.
Magnetic Disks Have cylinders, sectors platters, tracks, heads virtual and real disk blocks (x cylinders, y heads, z sectors per track) Relatively slow,
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Input/Output Devices ENCE 360
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Operating System I/O System Monday, August 11, 2008.
DISK SCHEDULING FCFS SSTF SCAN/ELEVATOR C-SCAN C-LOOK.
Sanuja Dabade & Eilbroun Benjamin CS 257 – Dr. TY Lin
Data Orgnization Frequently accessed data on the same storage device?
Overview Continuation from Monday (File system implementation)
CPU scheduling decisions may take place when a process:
Persistence: hard disk drive
Linux Block I/O Layer Chris Gill, Brian Kocoloski
Disk Scheduling The operating system is responsible for using hardware efficiently — for the disk drives, this means having a fast access time and disk.
Operating Systems 2019 Spring by Euiseong Seo
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Improving Performance using the LINUX IO Scheduler Shaun de Witt STFC ISGC2016

Role of the Scheduler Optimise Access to Storage CPU operations have a few processor cycles (each cycle is < 1ns) Seek operations take about ~> 8ms (25 million times longer) Reorder IO requests to minimise seek time Two main operations – sorting and merging Sorting arranges block reads sequentially Merging makes two adjacent block reads into one

Where the Scheduler Sits © Werner Fischer & Georg Schonberger

Available Schedulers Linux Kernel >= 2.4 has 4 schedulers CFQ (default) Deadline Anticipatory Noop

Completely Fair Queueing Each IO device gets its own queue and each queue gets its own timeslice Scheduler reads each queue in round-robin until end of timeslice Reads prioritised over writes

Deadline Three queues Elevator queue containing all sorted read requests Read-FIFO queue containing time ordered read requests Write-FIFO queue containing time ordered write requests Each request in FIFO queue has an ‘expiration time’ of 500 ms Normally requests are served from the elevator queue, but as requests expire in the FIFO queue, these queues will take precedence Gives good avaerage latency but can reduce global throughput

Anticipatory Similar to deadline scheduler But with some anticipatory knowledge – expectation of a subsequent read Read requests serviced within deadline, but then pauses for 6ms, waiting for a subsequent read of the next block In general, this works well if most reads are dependent (e.g. reading sequential blocks in a streaming operation) But may be poor for vector reads

NOOP Simplest scheduler Only does basic merging NO SORTING Recommended for non-seeking devices (e.g. SSDs)

Hardware Set-up at RAL Diskservers configured in RAID6 or RAID60 with 3 partitions Allows for two independent disk failures without losing data Uses hardware RAID controllers Various vendors dependent on procurement Unusual Horizontal striping Minimise RAID storage overhead

RAID Configuration FS1 FS2 FS3 ‘Typical’ RAID6 Configuration RAL RAID6 Configuration

Impact of Using Hardware RAID RAID controller has fairly big cache Able to write quickly for large periods of time But now CFQ sees only one device (or one for each filesystem) Scheduler sorts iops assuming a well defined disk layout But RAID controller has a different layout and will resort it Not normally a significant overhead, until the iops become large and effectively random Double sorting and different cache sizes between kernel and RAID controller lead to massive inefficiencies Almost all application time spent in WIO

What made RAL look at this

CMS Problems Accessing disk-only storage under LHC Run 2 Load All disk servers (~20) running at 100% WIO Only certain workflows (pileup) Job efficiencies running <30% Other Tier 1 sites also had problems, but were much better (~60% CPU efficiency) ‘Normal’ CPU efficiencies run >95%

Solutions… Limit number of transfers to each disk server Jobs timeout waiting for storage system scheduling Means few jobs run well, but most fail Limit number of pileup jobs on the processing farm No reliable mechanism for identifying them Add more disk servers Would work, but would have cost implications

Finally… Try switching scheduler… Can be done and reverted ‘on the fly’ Does not require a reboot Can be applied on a single server to investigate the impact Preemptive CAVEAT Don’t try this at home! Software RAID and different RAID controllers will behave differently Only do this if either Your storage system is already ‘overloaded’ You have tested the impact with your hardware configuration

Anecdotal Impact Switched to NOOP scheduler WIO reduced from 100% to 65% Throughput increased from 50Mb/s to 200Mb/s Based on making a change to 1 server under production workload

Anecdotal Impact

Attempt to Quantify Heavy WIO reproduced by Multiple jobs reading/writing whole files (1GB random data) Different R/W mixing Using xrootd protocol 200 different files for reading (try to remove any effect of caching reads)

50-50 Job Mix

50-50 Job Mix (100 jobs) CFQ NOOP

Read Dominated (80-20) Job Mix

Read Dominated Job Mix CFQ NOOP

Write-Dominated (20-80) Job Mix

Read Dominated Job Mix CFQ NOOP

Summary & Conclusion Attempted controlled tests showed no significant benefit of using NOOP or CFQ schedulers ATLAS problems bear this out … or to spin results positively There is no detriment observed to using NOOP over CFQ Left with the question what is it about the pile-up jobs that causes such a bad impact on the disk servers, and why does NOOP give such a big gain Questions – Because I still have many!