Using the VTune Analyzer on Multithreaded Applications

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Intel® performance analyze tools Nikita Panov Idrisov Renat.
Chang, Chia-Hao (Howard) Comp 1631 Winter Semester Multi-Core/Processor.
12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.
Recap.
The many-core architecture 1. The System One clock Scheduler (ideal) distributes tasks to the Cores according to a task map Cores 256 simple RISC Cores,
Fall 2001CS 4471 Chapter 2: Performance CS 447 Jason Bakos.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
Computer System Architectures Computer System Software
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Performance Evaluation of Parallel Processing. Why Performance?
Binary Image Compression via Monochromatic Pattern Substitution: A Sequential Speed-Up Luigi Cinque and Sergio De Agostino Computer Science Department.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
ICOM 6115©Manuel Rodriguez-Martinez ICOM 6115 – Computer Networks and the WWW Manuel Rodriguez-Martinez, Ph.D. Lecture 6.
Lecture 2 Process Concepts, Performance Measures and Evaluation Techniques.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Lecture 1: Performance EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2013, Dr. Rozier.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
Dean Tullsen UCSD.  The parallelism crisis has the feel of a relatively new problem ◦ Results from a huge technology shift ◦ Has suddenly become pervasive.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Concurrency and Performance Based on slides by Henri Casanova.
Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)
Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.
Tuning Threaded Code with Intel® Parallel Amplifier.
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
Distributed and Parallel Processing George Wells.
Lecture 5. Example for periority The average waiting time : = 41/5= 8.2.
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Introduction to Operating Systems Concepts
Paul Cockshott Glasgow
Introduction to threads
CPU Central Processing Unit
OPERATING SYSTEMS CS 3502 Fall 2017
Processes and threads.
Resource Aware Scheduler – Initial Results
Multi-core processors
Central Processing Unit- CPU
Atomic Operations in Hardware
Computer Structure Multi-Threading
Atomic Operations in Hardware
Async or Parallel? No they aren’t the same thing!
Multi-core processors
Core i7 micro-processor
Introduction to Parallelism.
Operating System Concepts
EE 193: Parallel Computing
Multicultural Social Community Development Institute ( MSCDI)
CMSC 611: Advanced Computer Architecture
Threads Chapter 4.
Multithreaded Programming
Concurrency: Mutual Exclusion and Process Synchronization
Lecture 18 Syed Mansoor Sarwar
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Multithreading Why & How.
Lecture 2 The Art of Concurrency
- When you approach operating system concepts there might be several confusing terms that may look similar but in fact refer to different concepts:  multiprogramming, multiprocessing, multitasking,
Objectives Describe how common characteristics of CPUs affect their performance: clock speed, cache size, number of cores Explain the purpose and give.
CS 286 Computer Organization and Architecture
CMSC 611: Advanced Computer Architecture
Operating System Overview
EN Software Carpentry Python – A Crash Course Esoteric Sections Parallelization
Chapter 2: Performance CS 447 Jason Bakos Fall 2001 CS 447.
CSC Multiprocessor Programming, Spring, 2011
Chapter 3: Process Management
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Using the VTune Analyzer on Multithreaded Applications Intel Software College

Using the VTune Analyzer on Multithreaded Applications Objectives Two uses for VTune Performance Analyzer Improving the efficiency of computation Improving your threading model Example: mandelbrot3.exe Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications Agenda Determine if a sample section of code can be threaded Determine if the implemented threads are balanced or not Using the VTune Analyzer on Multithreaded Applications

Use VTune Analyzer on Serial Applications Determine what parts of your application (if any) that, when threaded, will speed up your app CPU-bound apps can potentially run twice as fast on dual-core processors Memory bound apps may potentially run 50% faster I/O bound applications may not run any faster Find main performance bottlenecks in your application (sampling) Determine whether or not it makes sense to thread there If not, look further up the program’s calling sequence to find a more appropriate place to consider (callgraph) Using the VTune Analyzer on Multithreaded Applications

Use VTune Analyzer on Multithreaded Applications Determine if your current threading model is balanced On the thread view, each of the threads should be consuming the same amount of time If not, you must shift the amount of work between the threads Use “Samples Over Time” view Look for idle CPU time Could more threads utilize idle resources? In the process view this can show up as ntoskrnl, or intelppm.sys Sometimes the process view actually says idle task Using the VTune Analyzer on Multithreaded Applications

Example: mandelbrot3.exe Single threaded app, analyze to see if it can be made Demo machine: Four socket, dual core, hyper-threaded (HT) machine (4 sockets * 2 cores per socket * 2 HT CPUs per core) = the OS sees 16 processors Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications PROCESS VIEW: Analyzer shows much idle time which we’d expect on a 16 CPU system running not very much except our example program. Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications Module view: show us the executables that that consumed time in the Mandelbrot process Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications THREAD VIEW: Show us the threads in the Mandelbrot application (only 1 thread does any significant computation). Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications SOURCE VIEW: 99% of CPU time used by that application was spent in the GenerateScanLine function. Specifically, in CalculatePixels. If we threaded here, we’d be creating, synchronizing and destroying a pixel for each thread on the screen. Thread management would dominate the execution time. The same would be true for creating a thread per scan line. We need to look further up the calling sequence to find a better place to create threads. Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications CALLGRAPH: Best spot, Generate_Display which calls GenerateScanline. The callgraph shows us that both of these functions are on the critical path, this is a good sign for our plan. Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications SOURCE VIEW for modified code. Speedup, sure, but now that we’ve gone multithreaded, let’s make sure each of the threads is doing the same amount of work. Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications CHECK THE LOAD BALANCING using SAMPLES OVER TIME. Each square is a slice of time, and red squares indicate a larger number of amount of CPU usage in that slice of time. Many squares for each thread are red at the same instant in time, which indicates the threads are running in parallel. BUT, threads are finishing at different times, which means some threads have more work to do than other threads. This means there will be idle CPUs when the first threads finish. Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications We saw different threads are doing different amounts of work. Some threads are drawing parts of the screen that aren’t covered by much of the fractal so they don’t consume much CPU time. SO, we interleaved the threads and scanlines (on a 16 CPU system) so that the first thread does the first line, the 17th line, the 33rd line, and so forth. The second thread does the second line, the 18th, and so on. This should cause a more even computational distribution over all the threads. The above is the sample over time view of the amended code. (2156 ms to 297 ms) Using the VTune Analyzer on Multithreaded Applications

Using the VTune Analyzer on Multithreaded Applications This should always be the last slide of all presentations. Using the VTune Analyzer on Multithreaded Applications