Boost Linux Performance with Enhancements from Oracle

Slides:



Advertisements
Similar presentations
1 Copyright © 2012 Oracle and/or its affiliates. All rights reserved. Convergence of HPC, Databases, and Analytics Tirthankar Lahiri Senior Director, Oracle.
Advertisements

Unix Systems Performance Tuning Project of COSC 513 Name: Qinghui Mu Instructor: Prof. Anvari.
Exadata Distinctives Brown Bag New features for tuning Oracle database applications.
Tutorial 3 - Linux Interrupt Handling -
Tools for Investigating Graphics System Performance
G Robert Grimm New York University Disco.
Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
Home: Phones OFF Please Unix Kernel Parminder Singh Kang Home:
Cs238 Lecture 3 Operating System Structures Dr. Alan R. Davis.
1 Operating Systems Ch An Overview. Architecture of Computer Hardware and Systems Software Irv Englander, John Wiley, Bare Bones Computer.
The Design of Robust and Efficient Microkernel ManRiX, The Design of Robust and Efficient Microkernel Presented by: Manish Regmi
Chapter 9 Overview  Reasons to monitor SQL Server  Performance Monitoring and Tuning  Tools for Monitoring SQL Server  Common Monitoring and Tuning.
Hardening Linux for Enterprise Applications Peter Knaggs & Xiaoping Li Oracle Corporation Sunil Mahale Network Appliance Session id:
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
SRP Update Bart Van Assche,.
Chapter 8 Windows Outline Programming Windows 2000 System structure Processes and threads in Windows 2000 Memory management The Windows 2000 file.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
UNIX System Administration OS Kernal Copyright 2002, Dr. Ken Hoganson All rights reserved. OS Kernel Concept Kernel or MicroKernel Concept: An OS architecture-design.
1 Threads, SMP, and Microkernels Chapter 4. 2 Focus and Subtopics Focus: More advanced concepts related to process management : Resource ownership vs.
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
DONE-08 Sizing and Performance Tuning N-Tier Applications Mike Furgal Performance Manager Progress Software
1 Scheduling The part of the OS that makes the choice of which process to run next is called the scheduler and the algorithm it uses is called the scheduling.
Background: I/O Concurrency Brad Karp UCL Computer Science CS GZ03 / M030 2 nd October, 2008.
3.14 Work List IOC Core Channel Access. Changes to IOC Core Online add/delete of record instances Tool to support online add/delete OS independent layer.
Fall 2013 SILICON VALLEY UNIVERSITY CONFIDENTIAL 1 Introduction to Embedded Systems Dr. Jerry Shiao, Silicon Valley University.
02/09/2010 Industrial Project Course (234313) Virtualization-aware database engine Final Presentation Industrial Project Course (234313) Virtualization-aware.
Processes CSCI 4534 Chapter 4. Introduction Early computer systems allowed one program to be executed at a time –The program had complete control of the.
Hyper-V Performance, Scale & Architecture Changes Benjamin Armstrong Senior Program Manager Lead Microsoft Corporation VIR413.
Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.
Memory and network stack tuning in Linux:
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
M a c O S X CS-450-1: Operating Systems Fall 2005 Matt Grady – Mike O’Connor – Justin Rains.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Ottawa Linux Symposium Christoph Lameter, Ph.D. Technical Lead Linux Kernel Software Silicon Graphics, Inc. Extreme High.
1 OPERATING SYSTEMS. 2 CONTENTS 1.What is an Operating System? 2.OS Functions 3.OS Services 4.Structure of OS 5.Evolution of OS.
Unit 2 Technology Systems
Lesson Objectives Aims Key Words Paging, Segmentation, Virtual Memory
Module 3: Operating-System Structures
Kernel Design & Implementation
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter Objectives In this chapter, you will learn:
Processes and threads.
Process Management Process Concept Why only the global variables?
Flash Storage 101 Revolutionizing Databases
Operating System.
Operating Systems •The kernel is a program that constitutes the central core of a computer operating system. It has complete control over everything that.
Diskpool and cloud storage benchmarks used in IT-DSS
Architecture Background
IB Computer Science Topic 2.1.1
Operating Systems and Systems Programming
HyperLoop: Group-Based NIC Offloading to Accelerate Replicated Transactions in Multi-tenant Storage Systems Daehyeok Kim Amirsaman Memaripour, Anirudh.
QNX Technology Overview
Chapter 2: The Linux System Part 1
Xen Network I/O Performance Analysis and Opportunities for Improvement
Mid Term review CSC345.
Today’s agenda Hardware architecture and runtime system
Lecture Topics: 11/1 General Operating System Concepts Processes
Threads and Concurrency
Top Half / Bottom Half Processing
Chapter 2: Operating-System Structures
In Memory OLTP Not Just for OLTP.
System Calls System calls are the user API to the OS
Chapter 2: Operating-System Structures
Michael Blinn Ben Hejl Jane McHugh Matthew VanMater
Interrupt Message Store
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Boost Linux Performance with Enhancements from Oracle Chris Mason Director of Linux Kernel Engineering

Linux Performance on Large Systems Exadata Hardware How large systems are different Finding bottlenecks Optimizations in Oracle's Unbreakable Enterprise Kernel

8 IB QDR ports (40Gb/sec each) Other assorted slots, ports, cards Exadata Hardware X2-8 8 Sockets Intel X7560 8 Cores per socket 2 threads per core 1TB of ram 8 IB QDR ports (40Gb/sec each) Other assorted slots, ports, cards

Non Uniform Memory Access X2-8 consists of four blades X2-8 NUMA Non Uniform Memory Access X2-8 consists of four blades Each blade has two CPU sockets Each blade has 256GB of ram Each blade has one or more IB cards Fast interconnect to the other blades The CPUs access resources on the same blade much faster than resources on remote blades NUMA lowers hardware costs but increases work that must be done in software to optimize the system Linux already includes extensive optimizations and frameworks to run well on NUMA systems

Am I waiting on the disk or the network? Finding Bottlenecks Are my CPUs idle? Am I waiting on the disk or the network? Am I bottlenecked on a single CPU? Where is my CPU spending all its time? Application System time (kernel overhead) Softirq processing (kernel overhead) Mpstat -P ALL 1 Gives us a per CPU report of time spent waiting for IO, busy in application or kernel code, doing interrupts etc. Large systems often have a small number of CPUs pegged at 100% while others are mostly idle

Finding Bottlenecks: Latencytop Tracks why each process waits in the kernel Can quickly determine if you're waiting on: Disk, network, kernel locks, anything that sleeps GUI mode to select a specific process Latencytop -c mode to collect information on each process over a long period of time

Finding Bottlenecks: perf When the system is CPU bound, perf can tell us why Profiling can be limited to a single CPU Very useful when only one CPU is saturated Profiles can include full back traces Explains the full call chain that leads to lock contention Example usage: Perf record -g -C 16 Record profiles on CPU 16 with call trace Perf record -g -a Record profiles on all CPUs Perf report -g Produce call graph report based on the profile

Fast networking and storage IO rates add contention in new areas Optimizing Workloads Fast networking and storage IO rates add contention in new areas Spread interrupts over CPUs local to the cards Push softirq handling out over all the CPUs Reduce lock contention both in the kernel and application Lock contention is much more expensive in NUMA Use cpusets to control CPU allocation to specific workloads

Interrupts process events from the hardware Interrupt Processing Interrupts process events from the hardware Receive network packets Disk IO completion Linux irqbalance daemon spreads interrupt processing over CPUs based on load Irqbalance modifications Only process Irqs on CPUs local to the card Usually hand tuned on NUMA systems, but we added code to do this automatically

Softirqs handle portions of the interrupt processing Waking up processes Copying data from the kernel to application memory (networking receives) Various kernel data structure updates Softirqs normally run on the same CPU that received the interrupt, but they run slightly later Spreading interrupt processing across CPUs also spreads the resulting softirq work across CPUs Interrupts must be done on CPUs local to the card for performance, but softirqs can be spread farther away

Spreading Softirqs for Storage IO affinity Records the CPU that issued an IO When the IO completes, the softirq is sent to the issuing CPU Very effective for solid state storage on large systems Reduces contention on scheduler locks because wakeups are done on the same CPU where the process last ran Enabled by default in Oracle's Unbreakable Enterprise Kernel >2x Improvement in SSD IO/s in one OLTP based test Almost 5x faster after removing driver lock contention

Spreading Softirqs for Networking Receive Packet Steering Spreads softirqs for tcpip receives across a mask of CPUs selected by the admin /sys/class/net/XX/queues/rx-N/rps_cpus XX is the network interface N is the queue number (some cards have many) Contains a mask in the taskset format of cpus to use Shotgun style spreading Hash of network headers picks the CPU Fairly random CPU selection for the softirq Not optimal on the x2-8 due to poor locality

Second stage of receive packet steering Receive Flow Steering Second stage of receive packet steering /sys/class/net/XX/queues/rx-N/rps_flow_cnt Size of the hash table for recording flows (ex 8192) As processes wait for packets the kernel remembers which sockets they are waiting on and which CPU they last used When packets come in the softirq is directed to the CPU where the process last slept More directed than receive packet steering alone Together with receive packet steering: 50% faster ipoib results on a two socket system 100-200% faster on x2-8

RDS Improvements RDS is one of the main network transports used in Exadata systems Reliable datagram services, optimized for Oracle use Enables network RDMA operations when used with Infiniband Original x2-8 target: 4x faster than a two socket system Original x2-8 numbers: slightly slower than a two socket system Final x2-8 numbers: 8x faster than the original two socket numbers

Reduce lock contention in the RDS code RDS Improvements RDS was heavily saturating one or two cores on the system, but leaving the rest of the x2-8 idle Allocate two MSI irqs for each RDS connection instead of two for the whole system Spreads interrupts across multiple CPUs Reduce lock contention in the RDS code Optimize RDMA key management for NUMA Reduce wakeups on remote CPUs Switch a number of data structures over to RCU Read, copy, update http://lwn.net/Articles/262464/

IPC Semaphores Heavily used by Oracle to wakeup processes as database transactions commit Problematic for years due to high spinlock contention inside the kernel Problematic in almost every Unix as well Accounted for 90% of the system time during x2-8 database runs New code doesn't register in system profiles (<1% of the system time)

Create simple containers associated with a set of CPUs and memory Cpusets Create simple containers associated with a set of CPUs and memory Can breakup large systems for a number of smaller workloads Example benchmark: High database lock contention on a single row Spreading across all the x2-8 CPUs is much slower than a simple two socket system Containing the workload to 32 CPUs is slightly faster than a simple two socket system (5-10%) http://www.kernel.org/doc/man-pages/online/pages/man7/cpuset.7.html

Focused optimizations for the IO, networking and IPC stacks Optimization Summary Include a long series of optimizations between the 2.6.18 and 2.6.32 kernels Many NUMA targeted improvements Focused optimizations for the IO, networking and IPC stacks Extensive profiling with Exadata workloads Work effectively spread across all the CPUs, with less lock contention and system time overhead

Resources Linux Home Page oracle.com/linux Follow us on Twitter @ORCL_Linux Free Download: Oracle Linux edelivery.oracle.com/linux Read: Oracle Linux Blog blogs.oracle.com/linux Shop Online Oracle Unbreakable Linux Support oracle.com/store © 2010 Oracle Corporation 19