1 Multi-threaded Performance Pitfalls Ciaran McHale.

Slides:



Advertisements
Similar presentations
1 Dynamic Proxies Explained Simply. 2 Dynamic Proxies License Copyright © 2008 Ciaran McHale. Permission is hereby granted, free of charge, to any person.
Advertisements

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.
Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.
Open Source Software – Lessons Learned
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Open Source Software A Commercial.
CMPT 300: Operating Systems I Dr. Mohamed Hefeeda
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services Matt Welsh, David Culler, and Eric Brewer Computer Science Division University of.
1 School of Computing Science Simon Fraser University CMPT 300: Operating Systems I Dr. Mohamed Hefeeda.
Server Architecture Models Operating Systems Hebrew University Spring 2004.
Chapter 4: Threads. 4.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th edition, Jan 23, 2005 Chapter 4: Threads Overview Multithreading.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
SEDA: An Architecture for Well-Conditioned, Scalable Internet Services
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
CS252: Systems Programming Ninghui Li Final Exam Review.
Content Management Systems …mostly Umbraco ALL ABOUT.
Computer System Architectures Computer System Software
I/O bound Jobs Multiple processes accessing the same disc leads to competition for the position of the read head. A multi -threaded process can stream.
Flash An efficient and portable Web server. Today’s paper, FLASH Quite old (1999) Reading old papers gives us lessons We can see which solution among.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
OPERATING SYSTEMS CPU SCHEDULING.  Introduction to CPU scheduling Introduction to CPU scheduling  Dispatcher Dispatcher  Terms used in CPU scheduling.
MIT Consistent Hashing: Load Balancing in a Changing World David Karger, Eric Lehman, Tom Leighton, Matt Levine, Daniel Lewin, Rina Panigrahy.
Distributed File Systems
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
CHEN Ge CSIS, HKU March 9, Jigsaw W3C’s Java Web Server.
Digital Intuition Cluster, Smart Geometry 2013, Stylianos Dritsas, Mirco Becker, David Kosdruy, Juan Subercaseaux Setup Instructions Overview 1. License.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
Andrew McNab - License issues - 10 Apr 2002 License issues for EU DataGrid (on behalf of Anders Wannanen) Andrew McNab, University of Manchester
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
1 CSE451 Scheduling Autumn 2002 Gary Kimura Lecture #6 October 11, 2002.
IBIS-AMI and Direction Decisions
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
4.1 Introduction to Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Windows XP Threads Linux Threads.
Games Development 2 Concurrent Programming CO3301 Week 9.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 13 Threads Read Ch 5.1.
International Telecommunication Union New Delhi, India, December 2011 ITU Workshop on Standards and Intellectual Property Rights (IPR) Issues Philip.
Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.
Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
Department of Computer Science and Software Engineering
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Digital Intuition Cluster, Smart Geometry 2013, Stylianos Dritsas, Mirco Becker, David Kosdruy, Juan Subercaseaux Search Basics – Templates Tutorial Overview.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
INTEL CONFIDENTIAL Intel® Smart Connect Technology Remote Wake with WakeMyPC November 2013 – Revision 1.2 CDI/IBP #:
1 Example Uses of Java Reflection Explained Simply.
End User License Agreement Permission to use and redistribute this Document is granted, provided that (1) the below copyright notice appears in all copies.
Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.
1© Copyright 2015 EMC Corporation. All rights reserved. NUMA(YEY) BY JACOB KUGLER.
Jester – The JUnit Test Tester
Chapter 4: Threads.
Open Source Software: Top 10 Myths that Every In-House, Government, & Private Practice Lawyer Should Know 2017 NAPABA Convention Washington, D.C.
Mini Robot Chassis Top View
Ramya Kandasamy CS 147 Section 3
Processes and Threads Processes and their scheduling
Threads, Events, and Scheduling
Node.Js Server Side Javascript
Chapter 4: Threads.
Multiple Processor Systems
CPU SCHEDULING.
Threads, Events, and Scheduling
Multithreaded Programming
Kernel Synchronization II
Threads, Events, and Scheduling
CS333 Intro to Operating Systems
Chapter 4: Threads.
CSC Multiprocessor Programming, Spring, 2011
Presentation transcript:

1 Multi-threaded Performance Pitfalls Ciaran McHale

2 Multi-threaded Performance Pitfalls License Copyright © 2008 Ciaran McHale. Permission is hereby granted, free of charge, to any person obtaining a copy of this training course and associated documentation files (the “Training Course"), to deal in the Training Course without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Training Course, and to permit persons to whom the Training Course is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Training Course. THE TRAINING COURSE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE TRAINING COURSE OR THE USE OR OTHER DEALINGS IN THE TRAINING COURSE.

3 Multi-threaded Performance Pitfalls Purpose of this presentation Some issues in multi-threading are counter-intuitive Ignorance of these issues can result in poor performance - Performance can actually get worse when you add more CPUs This presentation explains the counter-intuitive issues

4 1. A case study

5 Multi-threaded Performance Pitfalls Architectural diagram web browser load balancing router J2EE App Server1 J2EE App Server2 J2EE App Server6... CORBA C++ server on 8-CPU Solaris box DB

6 Multi-threaded Performance Pitfalls Architectural notes The customer felt J2EE was slower than CORBA/C++ So, the architecture had: - Multiple J2EE App Servers acting as clients to… - Just one CORBA/C++ server that ran on an 8-CPU Solaris box The customer assumed the CORBA/C++ server “should be able to cope with the load”

7 Multi-threaded Performance Pitfalls Strange problems were observed Throughput of the CORBA server decreased as the number of CPUs increased - It ran fastest on 1 CPU - It ran slower but “fast enough” with moderate load on 4 CPUs (development machines) - It ran very slowly on 8 CPUs (production machine) The CORBA server ran faster if a thread pool limit was imposed Under a high load in production: - Most requests were processed in < 0.3 second - But some took up to a minute to be processed - A few took up to 30 minutes to be processed This is not what you hope to see

8 2. Analysis of the problems

9 Multi-threaded Performance Pitfalls What went wrong? Investigation showed that scalability problems were caused by a combination of: - Cache consistency in multi-CPU machines - Unfair mutex wakeup semantics These issues are discussed in the following slides Another issue contributed (slightly) to scalability problems: - Bottlenecks in application code - A discussion of this is outside the scope of this presentation

10 Multi-threaded Performance Pitfalls Cache consistency RAM access is much slower than speed of CPU - Solution: high-speed cache memory sits between CPU and RAM Cache memory works great: - In a single-CPU machine - In a multi-CPU machine if the threads of a process are “bound” to a CPU Cache memory can backfire if the threads in a program are spread over all the CPUs: - Each CPU has a separate cache - Cache consistency protocol require cache flushes to RAM (cache consistency protocol is driven by calls to lock() and unlock() )

11 Multi-threaded Performance Pitfalls Cache consistency (cont’) Overhead of cache consistency protocols worsens as: - Overhead of a cache synchronization increases (this increases as the number of CPUs increase) - Frequency of cache synchronization increases (this increases with the rate of mutex lock() and unlock() calls) Lessons: - Increasing number of CPUs can decrease performance of a server - Work around this by: -Having multiple server processes instead of just one -Binding each process to a CPU (avoids need for cache synchronization) - Try to minimize need for mutex lock() and unlock() in application -Note: malloc() / free(), and new / delete use a mutex

12 Multi-threaded Performance Pitfalls Unfair mutex wakeup semantics A mutex does not guarantee First In First Out (FIFO) wakeup semantics - To do so would prevent two important optimizations (discussed on the following slides) Instead, a mutex provides: - Unfair wakeup semantics -Can cause temporary starvation of a thread -But guarantees to avoid infinite starvation - High speed lock() and unlock()

13 Multi-threaded Performance Pitfalls Unfair mutex wakeup semantics (cont’) Why does a mutex not provide fair wakeup semantics? Because most of the time, speed matter more than fairness - When FIFO wakeup semantics are required, developers can write a FIFOMutex class and take a performance hit

14 Multi-threaded Performance Pitfalls Mutex optimization 1 Pseudo-code: void lock() { if (rand() % 100) < 98) { add thread to head of list; // LIFO wakeup } else { add thread to tail of list; // FIFO wakeup } } Notes: - Last In First Out (LIFO) wakeup increases likelihood of cache hits for the woken-up thread (avoids expense of cache misses) - Occasionally putting a thread at the tail of the queue prevents infinite starvation

15 Multi-threaded Performance Pitfalls Mutex optimization 2 Assume several threads concurrently execute the following code: for (i = 0; i < 1000; i++) { lock(a_mutex); process(data[i]); unlock(a_mutex); } A thread context switch is (relatively) expensive - Context switching on every unlock() would add a lot of overhead Solution (this is an unfair optimization): - Defer context switches until the end of the current thread’s time slice - Current thread can repeatedly lock() and unlock() mutex in a single time slice

16 3. Improving Throughput

17 Multi-threaded Performance Pitfalls Improving throughput 20X increase in throughput was obtained by combination of: - Limiting size of the CORBA server’s thread pool -This Decreased the maximum length of the mutex wakeup queue -Which decreased the maximum wakeup time - Using several server processes (each with a small thread pool) rather than one server process (with a very large thread pool) - Binding each server process to one CPU -This avoided the overhead of cache consistency -Binding was achieved with the pbind command on Solaris -Windows has an equivalent of process binding: -Use the SetProcessAffinityMask() system call -Or, in Task Manager, right click on a process and choose the menu option (this menu option is visible only if you have a multi-CPU machine)

18 4. Finishing up

19 Multi-threaded Performance Pitfalls Recap: architectural diagram web browser load balancing router J2EE App Server1 J2EE App Server2 J2EE App Server6... CORBA C++ server on 8-CPU Solaris box DB

20 Multi-threaded Performance Pitfalls The case study is not an isolated incident The project’s high-level architecture is quite common: - Multi-threaded clients communicate with a multi-threaded server - Server process is not “bound” to a single CPU - Server’s thread pool size is unlimited (this is the default case in many middleware products) Likely that many projects have similar scalability problems: - But the system load is not high enough (yet) to trigger problems Problems are not specific to CORBA - They are independent of your choice of middleware technology Multi-core CPUs are becoming more common - So, expect to see these scalability issues occurring more frequently

21 Multi-threaded Performance Pitfalls Summary: important things to remember Recognize danger signs: - Performance drops as number of CPUs increases - Wide variation in response times with a high number of threads Good advice for multi-threaded servers: - Put a limit on the size of a server’s thread pool - Have several server processes with a small number of threads instead of one process with many threads - Bind each a server process to a CPU