- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby Gottlieb Intel Corporation Feb
- 2 - Copyright © 2004 Intel Corporation. All Rights Reserved. Agenda Threading gains and challenges Threading gains and challenges Optimization methodology, project milestones Optimization methodology, project milestones –Developing Benchmark –VTune™ Performance Analyzer –Threading: Overview of approaches –Intel® Thread Checker –Intel® Thread Profiler –Streaming SIMD Extensions (SSE) and micro architectural issue Project example Project example [Mark] is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries
- 3 - Copyright © 2004 Intel Corporation. All Rights Reserved. Dual-Core Systems One package with 2 cores One package with 2 cores Software impact Software impact –2 Cores 2 processors –2 Cores 2x resources Use threads to exploit full resources of dual core processors Efficiently Utilize Dual Cores
- 4 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threads Defined OS creates process for each program loaded OS creates process for each program loaded –Each process executes as a separate thread Additional threads can be created within the process Additional threads can be created within the process –Each thread has its own Stack and Instruction Pointer –All threads share code and data OS creates process for each program loaded OS creates process for each program loaded –Each process executes as a separate thread Additional threads can be created within the process Additional threads can be created within the process –Each thread has its own Stack and Instruction Pointer –All threads share code and data … Data Code thread2() Stack IP threadN() Stack IP Process thread1() Stack IP Efficiently Utilize Dual Cores
- 5 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threading Software OpenMP* threads OpenMP* threads – Windows* threads Windows* threads – POSIX* threads (pthreads) POSIX* threads (pthreads) – Efficiently Utilize Dual Cores * Other names and brands may be claimed as the property of others. If both cores fully busy, then 2x speedup possible
- 6 - Copyright © 2004 Intel Corporation. All Rights Reserved. Correctness Bug: Data Races Thread1 x = a + b Thread2 b = 42 What is value of x if: What is value of x if: –Thread1 runs before Thread2? –Thread2 runs before Thread1? Data race: concurrent read, modify, write of same address Data race: concurrent read, modify, write of same address x = 3 Challenges Unique to Threading x = 43 Suppose: a=1, b=2 Suppose: a=1, b=2 Outcome depends on thread execution order
- 7 - Copyright © 2004 Intel Corporation. All Rights Reserved. Solving Data Races: Synchronization Thread1 Acquire(L) a = 1 b = 2 x = a + b Release(L) Acquisition of mutex L ensures atomic access Acquisition of mutex L ensures atomic access –Only one thread can hold lock at a time Example APIs: Example APIs: -EnterCriticalSection(), LeaveCriticalSection() -pthread_mutex_lock(), pthread_mutex_unlock() Thread2 Acquire(L) b = 42 Release(L) Challenges Unique to Threading
- 8 - Copyright © 2004 Intel Corporation. All Rights Reserved. Amdahl’s Law If only 1/2 of the code is parallel, 2X speedup is unlikely If only 1/2 of the code is parallel, 2X speedup is unlikely P P = parallel portion of process N = number of processors (cores) O = parallel overhead Efficiently Utilize Dual Cores time P P(1-P) T Total
- 9 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threads Intro New Class of Problems Correctness bugs Correctness bugs Data races Deadlock and more… Performance bottlenecks Performance bottlenecks Overhead Load balance and more… Intel® Threading Tools can help! Intel® Thread Checker finds correctness bugs Thread Profiler feature pinpoints bottlenecks Challenges Unique to Threading
Copyright © 2004 Intel Corporation. All Rights Reserved. Methodology & Milestones: Getting Started –Most of the world apps are not threaded: There are 106,177 registered Projects in ( ) Almost all the applications are not performance sensitive. Some performance sensitive apps are too small, too big, or too complex –Is the app a representative picture of the real software world? –If so, we have a problem in our multi core strategy. – Learning the App. No need to understand every algorithm but overall understanding is a must. Call graph of VTune™ analyzer is a great tool for this task. –Develop a Benchmark Representative benchmark must define a benchmark before optimizing. A good benchmark must be automatic (VTune™ analyzer tuning assistant), not too short (above 30 seconds) and not too long. Surprisingly, selecting a good benchmark is time consuming and difficult.
Copyright © 2004 Intel Corporation. All Rights Reserved. Using VTune™ Performance Analyzer Sampling is surprisingly easy to use: Sampling is surprisingly easy to use: –Easy to get good results from sampling without any training. –Time breakdown is the first step for the threading decision-making process. –Hot spots might be vectorized Call graph as a tool to understand the code and select threading direction. Call graph as a tool to understand the code and select threading direction. –Setting the /fixed:no flag for the linker –Call graph provides hierarchical view and overall timing. –Call graph overhead makes it too inaccurate for timing; must use Sampling for correct time estimates.
Copyright © 2004 Intel Corporation. All Rights Reserved. Threading The most challenging part of the project: how to thread. The most challenging part of the project: how to thread. –Added difficulty—Shared resources like FSB or L2 may eliminate the speedup potential –Functional or data decomposition? –In many cases you can find mostly functional parallelism, which only scales to 2 -3 threads. –Examples: Identify the stages and let thread 0 work on N+1 front end of data element while thread 1 works on the back end of Data element N. Assign thread per channel in stereo. –For good data decomposition, the code should be designed in advance to be threaded. A desirable goal is maintain the exact results in order to simplify the testing. So Breaking input to chunks does not work if there is any history between data elements. –If data decomposition worked on relatively small part of the project Almost no speedup because of the synchronization overhead. OpenMP is very convenient for data decomposition experimentation. OpenMP is very convenient for data decomposition experimentation. Supported by the Intel® compiler. It became more legitimate with intro in the MS.NET 2005 compiler*. * Other names and brands may be claimed as the property of others.
Copyright © 2004 Intel Corporation. All Rights Reserved. Debugging the Threaded App Convert app to serial code and debug first while running thread 0 before thread 1 and then in reverse order. Convert app to serial code and debug first while running thread 0 before thread 1 and then in reverse order. –This methodology is good for 75% of the bugs and does not require any tricky debugging technique. –Try running in parallel and start looking for shared data elements. Intel® Tread Checker to the rescue. Intel® Tread Checker to the rescue. –“No, it is not broken, just build a very small example and be patient”. It takes a long time. –Intel® Thread Checker gives excellent analysis capabilities. The location of the faulty data element allocation the read location the write location the call stack that brings us to this location.
Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Checker 2.0 Features Locates threading bugs: Locates threading bugs: –Data races (storage conflicts) –Deadlocks (potential and actual) –Win32 threading API usage problems –Memory leaks and overwrites Isolates bugs to source code line Isolates bugs to source code line Describes possible causes of errors and suggests resolutions Describes possible causes of errors and suggests resolutions Categorizes errors by severity level Categorizes errors by severity level
Copyright © 2004 Intel Corporation. All Rights Reserved. Diagnostics List Screen shot: Intel® Thread Checker Diagnostics List in Terse mode Summary and legend Verbose diagnostics
Copyright © 2004 Intel Corporation. All Rights Reserved. Source Code View Screen shot: Intel® Thread Checker Each Diagnostics in List links to its source code line(s)
Copyright © 2004 Intel Corporation. All Rights Reserved. Help with Diagnostics 1) Right-click here... 2) More help! Screen shot: Intel® Thread Checker
Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Checker Example: From Sphinx final report.
Copyright © 2004 Intel Corporation. All Rights Reserved. Threading, Performance Check what percentage of the code is threaded. Check what percentage of the code is threaded. –Setting the upper bound for potential performance. –Can use VTune™ analyzer to see how much time each thread runs. –Check if the total instruction count of the threaded app is equal to the instruction count of the original app. In many cases there is a huge overhead for threading, or just a bug (doing some work twice). Evaluate the amount of parallel work. Evaluate the amount of parallel work. –Even if both threads spend the same amount of time, they may not be doing it at the same time. –If a (already debugged) threaded app runs much slower than the scalar app, look for false sharing issues: “No, converting each local variable to an array of 2 variables is not a good idea for threading efficiency.” From one of my meetings, trying to explain how come the threaded app is 14X slower than the original app. Check the critical path. Check the critical path. –Intel ® Thread profiler is great for the job after you figure out how to use it and its cryptic terminology. –Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Profiler.
Copyright © 2004 Intel Corporation. All Rights Reserved. The Thread Profiler Feature Pinpoints threading performance bottlenecks in apps threaded with: Pinpoints threading performance bottlenecks in apps threaded with: –Microsoft* Windows* threads on Microsoft* Windows* systems –POSIX* pthreads on Linux* systems –OpenMP* on Microsoft* Windows* and Linux* systems Plugs into VTune™ environment Plugs into VTune™ environment –Microsoft* Windows* for IA-32 systems –Linux* for IA-32 systems Intel® Threading Tools * Other names and brands may be claimed as the property of others.
Copyright © 2004 Intel Corporation. All Rights Reserved. Thread Profiler Feature Analysis Monitors execution flows to find Critical Path Monitors execution flows to find Critical Path –Longest execution flow is the Critical Path Analyzes Critical Path Analyzes Critical Path –System utilization Over-subscribed vs. under-subscribed –Thread state transitions Blocked -> Running Captures threads timeline Captures threads timeline –Visualize threading structure Intel® Threading Tools
Copyright © 2004 Intel Corporation. All Rights Reserved. Thread Profiler Critical Path Thread 1 Thread 2 Thread 3 T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 Critical Path View Time Start with the critical path Start with the critical path Separate according to system utilization Separate according to system utilization Add overhead Add overhead Further analyze by thread state Further analyze by thread state Acquire lock L Wait for Threads 2 & 3 Wait for L Release LWait for L Release L Idle Serial Parallel Under-subscribed Over-subscribed Cruise time Overhead Blocking time Impact time Intel® Threading Tools Analysis shown for 2-way system
Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Profiler (OpenMP) Example: From FAAD final report.
Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread profiler (Win32 API) From FAAD From GainMPEG: So what’s wrong with this picture?
Copyright © 2004 Intel Corporation. All Rights Reserved. Streaming SIMD Extensions Coding & Micro-architecture Intel® Streaming SIMD Extensions Intel® Streaming SIMD Extensions –Optimizing the slow thread first in case of functional decomposition. –In C++, use the class libraries. –In C, use intrinsics. –Use inline assembly if the compiler does not behave as expected. –For integer code or code with many shuffle instructions, inline assembly might be the only solution. But will it be accepted back to the open source tree? Micro architectural issues Micro architectural issues –Use VTune™ analyzer tuning assistant Its simpler than trying to learn all the ugly stuff It actually works and finds big issues in some cases. Clock Ticks (ms)
Copyright © 2004 Intel Corporation. All Rights Reserved. Micro arch tuning: VTune Tuning Assist Phase 1 – identify main slow-down reasons The CPI is high High branch mispredictions impact Many L2 Demand Misses Use precise events to focus on instructions of interest.
Copyright © 2004 Intel Corporation. All Rights Reserved. Example: Phase 2 – focus on problem sources Branch mispredictions L2 load misses
Copyright © 2004 Intel Corporation. All Rights Reserved. Impact: WEB Publications From audio encoding LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper (in Word format) describing the programming effort. a paper Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho- acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, "In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one." We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews. The successful projects have high impact. The successful projects have high impact. The successful projects have big impact
Copyright © 2004 Intel Corporation. All Rights Reserved. The LAME example: What is the LAME Project? An educational tool used for learning about MP3 encoding. It’s goal is to improve: An educational tool used for learning about MP3 encoding. It’s goal is to improve: –Psycho-acoustics quality. –The speed of MP3 encoding. LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products. LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products. Project goals: Project goals: –Speeding up the encryption of an audio stream. –Turning LAME into a Multi-Threaded (MT) engine. –Be 1:1 bit compatible with the original version. –Optimize specifically for SMT platforms. –64 bit port and CMP related optimizations. FOR MORE INFO
Copyright © 2004 Intel Corporation. All Rights Reserved. MP3 Encoding Overview Break up the audio stream into frames (uniform chunks, typically ~1K) Perceptual Model Analysis Filterbank MDCTQuantization Audio Stream Bitstream Encode Frame 1Frame 2Frame 4 Psycho- Acoustic Read Frame Frame 3 Huffman Encoding Specifically in LAME
Copyright © 2004 Intel Corporation. All Rights Reserved. This is actually Data Decomposition LAME MT – Intuitive approach Frame 1Frame 2Frame 4Frame 3 The intuitive approach: Thread 1: Thread 2: An unbreakable dependence due to Huffman Encoding Frame 5Frame 6
Copyright © 2004 Intel Corporation. All Rights Reserved. Analysis Filterbank MDCTQuantization Huffman Encoding Psycho- Acoustic Read Frame LAME MT – Functional Decomposition T1: T2: Frame 1Frame 2Frame 4 Frame 3 Frame 5Frame 6 Floating Point Intensive Integer Intensive
Copyright © 2004 Intel Corporation. All Rights Reserved. Results
Copyright © 2004 Intel Corporation. All Rights Reserved. Results due to Multi-Threading SMT Platform CBR / VBR SMP Platform CBR / VBR Using Microsoft’s Compiler* 22% / 32% 38% / 62% Using Intel® Compiler % / 29% 44% / 59% * Other names and brands may be claimed as the property of others.
Copyright © 2004 Intel Corporation. All Rights Reserved. Overall Performance Results HT Platform CBR / VBR CMP Platform CBR / VBR LAME MT code + Using Intel® Compiler % / 70% 78% / 109% The Lame example: high quality threading job.
Copyright © 2004 Intel Corporation. All Rights Reserved. Some Observations What can be accepted: What can be accepted: –Threading. There is always something to thread, but not always with significant gain. –Differentiation via micro architecture. Must be done on the same micro architecture. If not, we may find that we helped some competitor instead of Intel. –Streaming SIMD Extensions opportunities. –64 bit porting. A huge opportunity. Can be used if the student can’t find other options. Porting the assembly code will definitely show benefit. It is a big task waiting to be done. Things that didn't go as expected: Things that didn't go as expected: –Finding the good and influential candidates. It becomes more difficult every semester. –One semester is too short for many apps. –Returning code to the moderators: Only some parts of some projects were accepted by the open source moderator. None of the projects were fully accepted.
Copyright © 2004 Intel Corporation. All Rights Reserved. Backup