Tuning the Dennis Shasha and Philippe Bonnet, 2013.

Slides:



Advertisements
Similar presentations
1 Lecture 22: I/O, Disk Systems Todays topics: I/O overview Disk basics RAID Reminder: Assignment 8 due Tue 11/21.
Advertisements

Chapter 12: Secondary-Storage Structure. Outline n Cover n (Magnetic) Disk Structure n Disk Attachment n Disk Scheduling Algorithms l FCFS,
Hard Disk Drives Chapter 7.
Denny Cherry twitter.com/mrdenny.
Mass Storage Systems. Readings r Chapter , 12.7.
Discovering Computers Fundamentals, 2012 Edition
Skyward Server Design Mike Bianco.
Storage RAID Types of RAID Protocols SAN Microsoft Clustering (MSCS) What is clustering Terminology How we are configured at Grey Grey File Services.
Princess Sumaya Univ. Computer Engineering Dept. Chapter 6:
Challenges in Getting Flash Drives Closer to CPU Myoungsoo Jung (UT-Dallas) Mahmut Kandemir (PSU) The University of Texas at Dallas.
IT Infrastructure Database Tuning – Fall 2015.
IT Essentials PC Hardware & Software v5.0
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Mass-Storage Systems Revised Tao Yang.
1 Magnetic Disks 1956: IBM (RAMAC) first disk drive 5 Mb – Mb/in $/year 9 Kb/sec 1980: SEAGATE first 5.25’’ disk drive 5 Mb – 1.96 Mb/in2 625.
This courseware is copyrighted © 2011 gtslearning. No part of this courseware or any training material supplied by gtslearning International Limited to.
RAID CS5493/7493. RAID : What is it? Redundant Array of Independent Disks configured into a single logical storage unit.
CSCE 212 Chapter 8 Storage, Networks, and Other Peripherals Instructor: Jason D. Bakos.
Denny Cherry Manager of Information Systems MVP, MCSA, MCDBA, MCTS, MCITP.
1 Recap (RAID and Storage Architectures). 2 RAID To increase the availability and the performance (bandwidth) of a storage system, instead of a single.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 12: Mass-Storage Systems.
High Performance Computing Course Notes High Performance Storage.
Linux+ Guide to Linux Certification, Second Edition Chapter 7 Advanced Installation.
Computer ArchitectureFall 2008 © November 12, 2007 Nael Abu-Ghazaleh Lecture 24 Disk IO.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Mass-Storage Systems Revised Tao Yang.
12: IO Systems1 Jerry Breecher OPERATING SYSTEMS IO SYSTEMS.
COEN 180 NAS / SAN. NAS Network Attached Storage (NAS) Each storage device has its own network interface. Filers: storage device that interfaces at the.
Storage Area Network (SAN)
Storage Networking Technologies and Virtualization Section 2 DAS and Introduction to SCSI1.
OS Tuning Database Tuning - Spring Operating System Software that provide services to applications and abstracts hardware resources: – Processing.
2. Methods for I/O Operations
Trends in Storage Subsystem Technologies Michael Joyce, Senior Director Mylex & OEM Storage Subsystems IBM.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Introduction to Database Systems 1 The Storage Hierarchy and Magnetic Disks Storage Technology: Topic 1.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 12: Mass-Storage Systems.
Mass Storage System EMELIZA R. YABUT MSIT. Overview of Mass Storage Structure Traditional magnetic disks structure ◦Platter- composed of one or more.
Secondary Storage Unit 013: Systems Architecture Workbook: Secondary Storage 1G.
Hardware Overview Net+ARM – Well Suited for Embedded Ethernet
Operating Systems CMPSC 473 I/O Management (2) December Lecture 24 Instructor: Bhuvan Urgaonkar.
Sponsored by: PASS Summit 2010 Preview Storage for the DBA Denny Cherry MVP, MCSA, MCDBA, MCTS, MCITP.
Chapter 6 Advanced Installation. Objectives  Describe the types and structure of SCSI devices  Explain the different levels of RAID and types of RAID.
GeoVision Solutions Storage Management & Backup. ๏ RAID - Redundant Array of Independent (or Inexpensive) Disks ๏ Combines multiple disk drives into a.
ISDMR:BEIT VIII: CHAP2:MADHU N 1 ISM - Course Organization.
Disk Access. DISK STRUCTURE Sector: Smallest unit of data transfer from/to disk; 512B 2/4/8 adjacent sectors transferred together: Blocks Read/write heads.
Chapter 12: Mass-Storage Systems Silberschatz, Galvin and Gagne ©2005 Operating System Concepts Chapter 12: Mass-Storage Systems Overview of Mass.
Introduction. Outline What is database tuning What is changing The trends that impact database systems and their applications What is NOT changing The.
Objectives Overview Differentiate between storage and memory
Linux+ Guide to Linux Certification, Third Edition Chapter 6 Advanced Installation.
COT 4600 Operating Systems Fall 2009 Dan C. Marinescu Office: HEC 439 B Office hours: Tu-Th 3:00-4:00 PM.
Hardware Trends. Contents Memory Hard Disks Processors Network Accessories Future.
Chapter 12: Mass-Storage Systems Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Jan 1, 2005 Chapter 12: Mass-Storage.
Price Performance Metrics CS3353. CPU Price Performance Ratio Given – Average of 6 clock cycles per instruction – Clock rating for the cpu – Number of.
COMPUTER HARDWARE. Presented By  Name: MOHD. SHAFIKUR RAHMAN  ID:  Department: Computer Science & Engineering.
Chap 5: Hard Drives. Chap 5: Magnetic HD Chap 5: Hard Drive structure.
연세대학교 Yonsei University Data Processing Systems for Solid State Drive Yonsei University Mincheol Shin
© 2006 EMC Corporation. All rights reserved. The Host Environment Module 2.1.
STORAGE ARCHITECTURE/ MASTER): Disk Storage: What Are Your Options? Randy Kerns Senior Partner The Evaluator Group.
Chap 5: Hard Drives. Chap 5: Magnetic HD Chap 5: Solid State Drives.
1 Ó1998 Morgan Kaufmann Publishers Chapter 8 I/O Systems.
VSphere 5 – Maximums – Virtual Machine Compute vCPUs per VM 32 Memory vRAM per VM 1TB Swap per VM 1TB Storage SCSI Adapto rs per VM 4 SCSI Targets per.
1 Input/Output: Organization, Disk Performance, and RAID.
This courseware is copyrighted © 2016 gtslearning. No part of this courseware or any training material supplied by gtslearning International Limited to.
CSCE 385: Computer Architecture Spring 2014 Dr. Mike Turi I/O.
MOTHER BOARD PARTS BY BOGDAN LANGONE BACK PANEL CONNECTORS AND PORTS Back Panels= The back panel is the portion of the motherboard that allows.
Video Security Design Workshop:
Chapter 12: Mass-Storage Structure
CS703 - Advanced Operating Systems
Hard Drive Technologies
SCSI over PCI Express (SOP) use cases
Chapter 12: Mass-Storage Systems
CS 295: Modern Systems Storage Technologies Introduction
Presentation transcript:

Tuning the Dennis Shasha and Philippe Bonnet, 2013

Outline Configuration IO stack – SSDs and HDDs – RAID controller – Storage Area Network – block layer and file system – virtual storage Multi-core Network stack Tuning Tuning IO priorities Tuning Virtual Storage Tuning for maximum concurrency Tuning for RAM locality The priority inversion problem Transferring large files How to throw hardware at a Dennis Shasha and Philippe Bonnet, 2013

RAID controller PCI Southbridge Chipset [z68]z68 Southbridge Chipset [z68]z68 IO Dennis Shasha and Philippe Bonnet, 2013 SSD HDD PCI Express SSD SATA ports Processor [core i7]core i7 Processor [core i7]core i7 Memory bus RAM LOOK UP: Smart Response Technology (SSD caching managed by z68)Smart Response Technology 16 GB/sec 2x21GB/sec 5 GB/sec 3 GB/sec Byte addressable Block addressable

IO Dennis Shasha and Philippe Bonnet, 2013 Exercise 3.1: How many IO per second can a core i7 processor issue (assume that the core i7 performs at 180 GIPS and that it takes instructions per IO). Exercise 3.1: How many IO per second can a core i7 processor issue (assume that the core i7 performs at 180 GIPS and that it takes instructions per IO). Exercise 3.2: How many IO per second can your laptop CPU issue (look up the MIPSlook up the MIPS number associated to your processor number associated to your processor). Exercise 3.2: How many IO per second can your laptop CPU issue (look up the MIPSlook up the MIPS number associated to your processor number associated to your processor). Exercise 3.3: Define the IO architecture for your laptop/server. Exercise 3.3: Define the IO architecture for your laptop/server.

Hard Drive Dennis Shasha and Philippe Bonnet, 2013 Controller read/write head disk arm tracks platter spindle actuator disk interface

Solid State Drive Dennis Shasha and Philippe Bonnet, 2013 Page program Chip boundChannel bound Four parallel reads Four parallel writes Chip1 Chip2 Chip3 Chip4 Page transfer Page read Command Chip bound Read Write Logical address space Scheduling & Mapping Wear Leveling Garbage collection Read Program Erase Chip … … … … Flash memory array Channels Physical address space Example on a disk with 1 channel and 4 chips

RAID Dennis Shasha and Philippe Bonnet, 2013 CPU RAM PCI bridge Host Bus Adapter Caching Write-back / write-through Logical disk organization JBOD RAID Batteries

RAID Redundant Array of Inexpensive Disks RAID 0: Striping [n disks] RAID 1: Mirroring [2 disks] RAID 10: Each stripe is mirrored [2n disks] RAID 5: Floating parity [3+ Dennis Shasha and Philippe Bonnet, 2013 Exercise 3.4: A – What is the advantage of striping over magnetic disks? B- what is the advantage of striping over SSDs? Exercise 3.4: A – What is the advantage of striping over magnetic disks? B- what is the advantage of striping over SSDs?

Storage Area Network Dennis Shasha and Philippe Bonnet, 2013 A storage area network is one or more devices communicating via a serial SCSI protocol (such as FC, SAS or iSCSI). Using SANs and NAS, W. Preston, O Reilly SAN Topologies – Point-to-point – Bus Synchronous (Parallel SCSI, ATA) CSMA (Gb Ethernet) – Arbitrated Loop (FC) – Fabric (FC)

Case: TPC-C Top Performer Dennis Shasha and Philippe Bonnet, 2013 Redo Log Configuration Source: Total system cost30,528,863 USD Performance30,249,688 tpmC Total #processors108 Total #cores1728 Total storage1,76 PB Total #users24,300,000 LOOK UP: TPC-C OLTP BenchmarkTPC-C OLTP Benchmark

IO Dennis Shasha and Philippe Bonnet, 2013 DBMS DBMS IOs: Asynchronous IO Direct IO

Block Device Interface Memory <- read(LBA) Block device specific driver B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 … Bi Bi+1 Linear Space of Logical Block Addresses Dennis Shasha and Philippe Bonnet, 2013

Performance Contract HDD – The block device abstraction hides a lot of complexity while providing a simple performance contract: Sequential IOs are orders of magnitude faster than random IOs Contiguity in the logical space favors sequential Ios SSD – No intrinsic performance contract – A few invariants: No need to avoid random IOs Applications should avoid writes smaller than a flash page Applications should fill up the device IO queues (but not overflow them) so that the SSD can leverage its internal Dennis Shasha and Philippe Bonnet, 2013

Virtual Storage No Host Caching. – The database system expects that the IO it submits are transferred to disk as directly as possible. It is thus critical to disable file system caching on the host for the IOs submitted by the virtual machine. Hardware supported IO Virtualization. – A range of modern CPUs incorporate components that speed up access to IO devices through direct memory access and interrupt remapping (i.e., Intels VT-d and AMDs AMD-vi)). This should be activated. Statically allocated disk space – Static allocation avoids overhead when the database grows and favors contiguity in the logical address space (good for Dennis Shasha and Philippe Bonnet, 2013

Processor Dennis Shasha and Philippe Bonnet, 2013 Source: Principles of Computer System Design, Kaashoek and Saltzer.

Dealing with Dennis Shasha and Philippe Bonnet, 2013 Source: SMP: Symmetric Multiprocessor LOOK UP: Understanding NUMAUnderstanding NUMA NUMA: Non-Uniform Memory Access

Dealing with Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: SW for shared multi-core, Interrupts and IRQ tuningSW for shared multi-coreInterrupts and IRQ tuning CPU L1 cache Core 0 CPU L1 cache Core 1 L2 Cache Socket 0 CPU L1 cache Core 2 CPU L1 cache Core 3 L2 Cache Socket 1 System Bus IO, NIC Interrupts RAM Cache locality is king: Processor affinity Interrupt affinity Spinning vs. blocking

Network Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: Linux Network StackLinux Network Stack DBMS (host, port)

DNS Dennis Shasha and Philippe Bonnet, 2013 Source: Principles of Computer System Design, Kaashoek and Saltzer.

Tuning the Guts RAID levels Controller cache Partitioning Priorities Number of DBMS threads Processor/interrupt affinity Throwing hardware at a Dennis Shasha and Philippe Bonnet, 2013

RAID Levels Log File – RAID 1 is appropriate Fault tolerance with high write throughput. Writes are synchronous and sequential. No benefits in striping. Temporary Files – RAID 0 is appropriate – No fault tolerance. High throughput. Data and Index Files – RAID 5 is best suited for read intensive apps. – RAID 10 is best suited for write intensive apps.

Controller Cache Read-ahead: – Prefetching at the disk controller level. – No information on access pattern. – Not recommended. Write-back vs. write through: – Write back: transfer terminated as soon as data is written to cache. Batteries to guarantee write back in case of power failure Fast cache flushing is a priority – Write through: transfer terminated as soon as data is written to disk.

Partitioning There is parallelism (i) across servers, and (ii) within a server both at the CPU level and throughout the IO stack. To leverage this parallelism – Rely on multiple instances/multiple partitions per instance A single database is split across several instances. Different partitions can be allocated to different CPUs (partition servers) / Disks (partition). Problem#1: How to control overall resource usage across instances/partitions? – Fix: static mapping vs. Instance caging – Control the number/priority of threads spawned by a DBMS instance Problem#2: How to manage priorities? Problem#3: How to map threads onto the available cores – Fix: processor/interrupt Dennis Shasha and Philippe Bonnet, 2013

Instance Caging Allocating a number of CPU (core) or a percentage of the available IO bandwidth to a given DBMS Instance Two policies: – Partitioning: the total number of CPUs is partitioned across all instances – Over-provisioning: more than the total number of CPUs is allocated to all Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: Instance CagingInstance Caging # Cores Instance A (2 CPU) Instance A (2 CPU) Instance B (2 CPU) Instance B (2 CPU) Instance C (1 CPU) Instance C (1 CPU) Instance D (1 CPU) Instance D (1 CPU) Max #core PartitioningOver-provisioning Instance A (2 CPU) Instance A (2 CPU) Instance B (3 CPU) Instance B (3 CPU) Instance C (2 CPU) Instance C (2 CPU) Instance C (1 CPU) Instance C (1 CPU)

Number of DBMS Threads Given the DBMS process architecture – How many threads should be defined for Query agents (max per query, max per instance) – Multiprogramming level (see tuning the writes and index tuning) Log flusher – See tuning the writes Page cleaners – See tuning the writes Prefetcher – See index tuning Deadlock detection – See lock tuning Fix the number of DBMS threads based on the number of cores available at HW/VM level Partitioning vs. Over-provisioning Provisioning for monitoring, back-up, expensive stored Dennis Shasha and Philippe Bonnet, 2013

Priorities Mainframe OS have allowed to configure thread priority as well as IO priority for some time. Now it is possible to set IO priorities on Linux as well: – Threads associated to synchronous IOs (writes to the log, page cleaning under memory pressure, query agent reads) should have higher priorities than threads associated to asynchronous IOs (prefetching, page cleaner with no memory pressure) – see tuning the writes and index tuning – Synchronous IOs should have higher priority than asynchronous Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: Getting Priorities Straight, Linux IO prioritiesGetting Priorities StraightLinux IO priorities

The Priority Inversion Dennis Shasha and Philippe Bonnet, 2013 lock X request X Priority #1 T3 T1 T2 Priority #2 Priority #3 running waiting Transaction states Three transactions: T1, T2, T3 in priority order (high to low) – T3 obtains lock on x and is preempted – T1 blocks on x lock, so is descheduled – T2 does not access x and runs for a long time Net effect: T1 waits for T2 Solution: – No thread priority – Priority inheritance

Processor/Interrupt Affinity Mapping of thread context or interrupt to a given core – Allows cache line sharing between application threads or between application thread and interrupt (or even RAM sharing in NUMA) – Avoid dispatch of all IO interrupts to core 0 (which then dispatches software interrupts to the other cores) – Should be combined with VM options – Specially important in NUMA context – Affinity policy set at OS level or DBMS Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: Linux CPU tuningLinux CPU tuning

Processor/Interrupt Affinity IOs should complete on the core that issued them – I/O affinity in SQL server – Log writers distributed across all NUMA nodes Locking of a shared data structure across cores, and specially across NUMA nodes – Avoid multiprogramming for query agents that modify data – Query agents should be on the same NUMA node DBMS have pre-set NUMA affinity Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: Oracle and NUMA, SQL Server and NUMAOracle and NUMASQL Server and NUMA

Transferring large files (with TCP) With the advent of compute cloud, it is often necessary to transfer large files over the network when loading a database. To speed up the transfer of large files: – Increase the size of the TCP buffers – Increase the socket buffer size (Linux) – Set up TCP large windows (and timestamp) – Rely on selective Dennis Shasha and Philippe Bonnet, 2013 LOOK UP: TCP TuningTCP Tuning

Throwing hardware at a problem More Memory – Increase buffer size without increasing paging More Disks – Log on separate disk – Mirror frequently read file – Partition large files More Processors – Off-load non-database applications onto other CPUs – Off-load data mining applications to old database copy – Increase throughput to shared data Shared memory or shared disk Dennis Shasha and Philippe Bonnet, 2013

Virtual Storage Dennis Shasha and Philippe Bonnet, 2013 Intel 710 (100GN, 2.5in SATA 3Gb/s, 25nm, MLC) 170 MB/s sequential write 85 usec latency write iops Random reads (full range; 32 iodepth) 75 usec latency reads Core i5 CPU 2.67GHz Host: Ubuntu noop scheduler VM: VirtualBox 4.2 (nice -5) all accelerations enabled 4 CPUs 8 GB VDI disk (fixed) SATA Controller Guest: Ubuntu noop scheduler 4 cores Experiments with flexible I/O tester (fio):flexible I/O tester (fio) - sequential writes (seqWrites.fio) - random reads (randReads.fio) [seqWrites] ioengine=libaio Iodepth=32 rw=write bs=4k,4k direct=1 Numjobs=1 size=50m directory=/tmp [randReads] ioengine=libaio Iodepth=32 rw=read bs=4k,4k direct=1 Numjobs=1 size=50m directory=/tmp

Virtual Storage - Dennis Shasha and Philippe Bonnet, 2013 Performance on Host Performance on Guest Default page size is 4k, default iodepth is 32

Virtual Storage - Dennis Shasha and Philippe Bonnet, 2013 Page size is 4k, iodepth is 32

Virtual Storage - Dennis Shasha and Philippe Bonnet, 2013 Page size is 4k* * experiments with 32k show negligible difference