- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby.

Slides:



Advertisements
Similar presentations
Intel Software College Tuning Threading Code with Intel® Thread Profiler for Explicit Threads.
Advertisements

Performance Tuning Panotools - PTMender. Layout Project Goal About Panotools Multi-threading SIMD, micro-architectural pitfalls Results.
Part IV: Memory Management
CSCI 4717/5717 Computer Architecture
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Multi-Threading LAME MP3 Encoder
MP3 Optimization Exploiting Processor Architecture and Using Better Algorithms Mancia Anguita Universidad de Granada J. Manuel Martinez – Lechado Vitelcom.
Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Chapter 6: Process Synchronization
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Copyright © 2003, SAS Institute Inc. All rights reserved. Where's Waldo Uncovering Hard-to-Find Application Killers Claire Cates SAS Institute, Inc
Using Programmer-Written Compiler Extensions to Catch Security Holes Authors: Ken Ashcraft and Dawson Engler Presented by : Hong Chen CS590F 2/7/2007.
The Path to Multi-core Tools Paul Petersen. Multi-coreToolsThePathTo 2 Outline Motivation Where are we now What is easy to do next What is missing.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
Software Performance Tuning Project Flake Prepared by: Meni Orenbach Roman Kaplan Advisors: Zvika Guz Kobi Gottlieb.
CS 300 – Lecture 20 Intro to Computer Architecture / Assembly Language Caches.
Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval Advisors: Liat Atsmon Koby Gottlieb.
Computer Organization and Architecture
Software Performance Tuning Project Monkey’s Audio Prepared by: Meni Orenbach Roman Kaplan Advisors: Liat Atsmon Kobi Gottlieb.
Submitters:Vitaly Panor Tal Joffe Instructors:Zvika Guz Koby Gottlieb Software Laboratory Electrical Engineering Faculty Technion, Israel.
Threaded Programming Methodology Intel Software College.
Speex encoder project Presented by: Gruper Leetal Kamelo Tsafrir Instructor: Guz Zvika Software performance enhancement using multithreading, SIMD and.
INTEL CONFIDENTIAL Why Parallel? Why Now? Introduction to Parallel Programming – Part 1.
Chocolate Bar! luqili. Milestone 3 Speed 11% of final mark 7%: path quality and speed –Some cleverness required for full marks –Implement some A* techniques.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.
Multi-core Programming VTune Analyzer Basics. 2 Basics of VTune™ Performance Analyzer Topics What is the VTune™ Performance Analyzer? Performance tuning.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Analyzing parallel programs with Pin Moshe Bach, Mark Charney, Robert Cohn, Elena Demikhovsky, Tevi Devor, Kim Hazelwood, Aamer Jaleel, Chi- Keung Luk,
Software Performance Analysis Using CodeAnalyst for Windows Sherry Hurwitz SW Applications Manager SRD Advanced Micro Devices Lei.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.
Performance of mathematical software Agner Fog Technical University of Denmark
Games Development 2 Concurrent Programming CO3301 Week 9.
* Third party brands and names are the property of their respective owners. Performance Tuning Linux* Applications LinuxWorld Conference & Expo Gary Carleton.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Assembly Code Optimization Techniques for the AMD64 Athlon and Opteron Architectures David Phillips Robert Duckles Cse 520 Spring 2007 Term Project Presentation.
Performance Counters on Intel® Core™ 2 Duo Xeon® Processors Michael D’Mello
Parallelization of likelihood functions for data analysis Alfio Lazzaro CERN openlab Forum on Concurrent Programming Models and Frameworks.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
Single Node Optimization Computational Astrophysics.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Join us on Twitter: #AU2013 Building Well-Performing Autodesk® AutoCAD® Applications Albert Szilvasy Software Architect.
Sunpyo Hong, Hyesoon Kim
1 How to do Multithreading First step: Sampling and Hotspot hunting Myongji University Sugwon Hong 1.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
Tuning Threaded Code with Intel® Parallel Amplifier.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Introduction to threads
Using the VTune Analyzer on Multithreaded Applications
Effective Data-Race Detection for the Kernel
Hyperthreading Technology
Many-core Software Development Platforms
Intel® Parallel Studio and Advisor
Chapter 4: Threads.
Hardware Multithreading
Multithreaded Programming
Multithreading Why & How.
Chapter 4: Threads & Concurrency
Presentation transcript:

- 1 - Copyright © 2004 Intel Corporation. All Rights Reserved. Maximizing Application’s Performance by Threading, SIMD and micro arcitecture tuning Koby Gottlieb Intel Corporation Feb

- 2 - Copyright © 2004 Intel Corporation. All Rights Reserved. Agenda Threading gains and challenges Threading gains and challenges Optimization methodology, project milestones Optimization methodology, project milestones –Developing Benchmark –VTune™ Performance Analyzer –Threading: Overview of approaches –Intel® Thread Checker –Intel® Thread Profiler –Streaming SIMD Extensions (SSE) and micro architectural issue Project example Project example [Mark] is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries

- 3 - Copyright © 2004 Intel Corporation. All Rights Reserved. Dual-Core Systems One package with 2 cores One package with 2 cores Software impact Software impact –2 Cores  2 processors –2 Cores  2x resources Use threads to exploit full resources of dual core processors Efficiently Utilize Dual Cores

- 4 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threads Defined OS creates process for each program loaded OS creates process for each program loaded –Each process executes as a separate thread Additional threads can be created within the process Additional threads can be created within the process –Each thread has its own Stack and Instruction Pointer –All threads share code and data OS creates process for each program loaded OS creates process for each program loaded –Each process executes as a separate thread Additional threads can be created within the process Additional threads can be created within the process –Each thread has its own Stack and Instruction Pointer –All threads share code and data … Data Code thread2() Stack IP threadN() Stack IP Process thread1() Stack IP Efficiently Utilize Dual Cores

- 5 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threading Software OpenMP* threads OpenMP* threads – Windows* threads Windows* threads – POSIX* threads (pthreads) POSIX* threads (pthreads) – Efficiently Utilize Dual Cores * Other names and brands may be claimed as the property of others. If both cores fully busy, then 2x speedup possible

- 6 - Copyright © 2004 Intel Corporation. All Rights Reserved. Correctness Bug: Data Races Thread1 x = a + b Thread2 b = 42 What is value of x if: What is value of x if: –Thread1 runs before Thread2? –Thread2 runs before Thread1? Data race: concurrent read, modify, write of same address Data race: concurrent read, modify, write of same address x = 3 Challenges Unique to Threading x = 43 Suppose: a=1, b=2 Suppose: a=1, b=2 Outcome depends on thread execution order

- 7 - Copyright © 2004 Intel Corporation. All Rights Reserved. Solving Data Races: Synchronization Thread1 Acquire(L) a = 1 b = 2 x = a + b Release(L) Acquisition of mutex L ensures atomic access Acquisition of mutex L ensures atomic access –Only one thread can hold lock at a time Example APIs: Example APIs: -EnterCriticalSection(), LeaveCriticalSection() -pthread_mutex_lock(), pthread_mutex_unlock() Thread2 Acquire(L) b = 42 Release(L) Challenges Unique to Threading

- 8 - Copyright © 2004 Intel Corporation. All Rights Reserved. Amdahl’s Law If only 1/2 of the code is parallel, 2X speedup is unlikely If only 1/2 of the code is parallel, 2X speedup is unlikely P P = parallel portion of process N = number of processors (cores) O = parallel overhead Efficiently Utilize Dual Cores time P P(1-P) T Total

- 9 - Copyright © 2004 Intel Corporation. All Rights Reserved. Threads Intro New Class of Problems Correctness bugs Correctness bugs Data races Deadlock and more… Performance bottlenecks Performance bottlenecks Overhead Load balance and more… Intel® Threading Tools can help! Intel® Thread Checker finds correctness bugs Thread Profiler feature pinpoints bottlenecks Challenges Unique to Threading

Copyright © 2004 Intel Corporation. All Rights Reserved. Methodology & Milestones: Getting Started –Most of the world apps are not threaded: There are 106,177 registered Projects in ( ) Almost all the applications are not performance sensitive. Some performance sensitive apps are too small, too big, or too complex –Is the app a representative picture of the real software world? –If so, we have a problem in our multi core strategy. – Learning the App. No need to understand every algorithm but overall understanding is a must. Call graph of VTune™ analyzer is a great tool for this task. –Develop a Benchmark Representative benchmark must define a benchmark before optimizing. A good benchmark must be automatic (VTune™ analyzer tuning assistant), not too short (above 30 seconds) and not too long. Surprisingly, selecting a good benchmark is time consuming and difficult.

Copyright © 2004 Intel Corporation. All Rights Reserved. Using VTune™ Performance Analyzer Sampling is surprisingly easy to use: Sampling is surprisingly easy to use: –Easy to get good results from sampling without any training. –Time breakdown is the first step for the threading decision-making process. –Hot spots might be vectorized Call graph as a tool to understand the code and select threading direction. Call graph as a tool to understand the code and select threading direction. –Setting the /fixed:no flag for the linker –Call graph provides hierarchical view and overall timing. –Call graph overhead makes it too inaccurate for timing; must use Sampling for correct time estimates.

Copyright © 2004 Intel Corporation. All Rights Reserved. Threading The most challenging part of the project: how to thread. The most challenging part of the project: how to thread. –Added difficulty—Shared resources like FSB or L2 may eliminate the speedup potential –Functional or data decomposition? –In many cases you can find mostly functional parallelism, which only scales to 2 -3 threads. –Examples: Identify the stages and let thread 0 work on N+1 front end of data element while thread 1 works on the back end of Data element N. Assign thread per channel in stereo. –For good data decomposition, the code should be designed in advance to be threaded. A desirable goal is maintain the exact results in order to simplify the testing. So Breaking input to chunks does not work if there is any history between data elements. –If data decomposition worked on relatively small part of the project  Almost no speedup because of the synchronization overhead. OpenMP is very convenient for data decomposition experimentation. OpenMP is very convenient for data decomposition experimentation. Supported by the Intel® compiler. It became more legitimate with intro in the MS.NET 2005 compiler*. * Other names and brands may be claimed as the property of others.

Copyright © 2004 Intel Corporation. All Rights Reserved. Debugging the Threaded App Convert app to serial code and debug first while running thread 0 before thread 1 and then in reverse order. Convert app to serial code and debug first while running thread 0 before thread 1 and then in reverse order. –This methodology is good for 75% of the bugs and does not require any tricky debugging technique. –Try running in parallel and start looking for shared data elements. Intel® Tread Checker to the rescue. Intel® Tread Checker to the rescue. –“No, it is not broken, just build a very small example and be patient”. It takes a long time. –Intel® Thread Checker gives excellent analysis capabilities. The location of the faulty data element allocation the read location the write location the call stack that brings us to this location.

Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Checker 2.0 Features Locates threading bugs: Locates threading bugs: –Data races (storage conflicts) –Deadlocks (potential and actual) –Win32 threading API usage problems –Memory leaks and overwrites Isolates bugs to source code line Isolates bugs to source code line Describes possible causes of errors and suggests resolutions Describes possible causes of errors and suggests resolutions Categorizes errors by severity level Categorizes errors by severity level

Copyright © 2004 Intel Corporation. All Rights Reserved. Diagnostics List Screen shot: Intel® Thread Checker Diagnostics List in Terse mode Summary and legend Verbose diagnostics

Copyright © 2004 Intel Corporation. All Rights Reserved. Source Code View Screen shot: Intel® Thread Checker Each Diagnostics in List links to its source code line(s)

Copyright © 2004 Intel Corporation. All Rights Reserved. Help with Diagnostics 1) Right-click here... 2) More help! Screen shot: Intel® Thread Checker

Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Checker Example: From Sphinx final report.

Copyright © 2004 Intel Corporation. All Rights Reserved. Threading, Performance Check what percentage of the code is threaded. Check what percentage of the code is threaded. –Setting the upper bound for potential performance. –Can use VTune™ analyzer to see how much time each thread runs. –Check if the total instruction count of the threaded app is equal to the instruction count of the original app. In many cases there is a huge overhead for threading, or just a bug (doing some work twice). Evaluate the amount of parallel work. Evaluate the amount of parallel work. –Even if both threads spend the same amount of time, they may not be doing it at the same time. –If a (already debugged) threaded app runs much slower than the scalar app, look for false sharing issues: “No, converting each local variable to an array of 2 variables is not a good idea for threading efficiency.” From one of my meetings, trying to explain how come the threaded app is 14X slower than the original app. Check the critical path. Check the critical path. –Intel ® Thread profiler is great for the job after you figure out how to use it and its cryptic terminology. –Note that Win32 API Thread Profiler is not the same tool as the OpenMP Thread Profiler.

Copyright © 2004 Intel Corporation. All Rights Reserved. The Thread Profiler Feature Pinpoints threading performance bottlenecks in apps threaded with: Pinpoints threading performance bottlenecks in apps threaded with: –Microsoft* Windows* threads on Microsoft* Windows* systems –POSIX* pthreads on Linux* systems –OpenMP* on Microsoft* Windows* and Linux* systems Plugs into VTune™ environment Plugs into VTune™ environment –Microsoft* Windows* for IA-32 systems –Linux* for IA-32 systems Intel® Threading Tools * Other names and brands may be claimed as the property of others.

Copyright © 2004 Intel Corporation. All Rights Reserved. Thread Profiler Feature Analysis Monitors execution flows to find Critical Path Monitors execution flows to find Critical Path –Longest execution flow is the Critical Path Analyzes Critical Path Analyzes Critical Path –System utilization Over-subscribed vs. under-subscribed –Thread state transitions Blocked -> Running Captures threads timeline Captures threads timeline –Visualize threading structure Intel® Threading Tools

Copyright © 2004 Intel Corporation. All Rights Reserved. Thread Profiler Critical Path Thread 1 Thread 2 Thread 3 T0T0 T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 T7T7 T8T8 T9T9 T 10 T 11 T 12 T 13 T 14 T 15 Critical Path View Time Start with the critical path Start with the critical path Separate according to system utilization Separate according to system utilization Add overhead Add overhead Further analyze by thread state Further analyze by thread state Acquire lock L Wait for Threads 2 & 3 Wait for L Release LWait for L Release L Idle Serial Parallel Under-subscribed Over-subscribed Cruise time Overhead Blocking time Impact time Intel® Threading Tools Analysis shown for 2-way system

Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread Profiler (OpenMP) Example: From FAAD final report.

Copyright © 2004 Intel Corporation. All Rights Reserved. Intel® Thread profiler (Win32 API) From FAAD From GainMPEG: So what’s wrong with this picture?

Copyright © 2004 Intel Corporation. All Rights Reserved. Streaming SIMD Extensions Coding & Micro-architecture Intel® Streaming SIMD Extensions Intel® Streaming SIMD Extensions –Optimizing the slow thread first in case of functional decomposition. –In C++, use the class libraries. –In C, use intrinsics. –Use inline assembly if the compiler does not behave as expected. –For integer code or code with many shuffle instructions, inline assembly might be the only solution. But will it be accepted back to the open source tree? Micro architectural issues Micro architectural issues –Use VTune™ analyzer tuning assistant Its simpler than trying to learn all the ugly stuff It actually works and finds big issues in some cases. Clock Ticks (ms)

Copyright © 2004 Intel Corporation. All Rights Reserved. Micro arch tuning: VTune Tuning Assist Phase 1 – identify main slow-down reasons The CPI is high High branch mispredictions impact Many L2 Demand Misses Use precise events to focus on instructions of interest.

Copyright © 2004 Intel Corporation. All Rights Reserved. Example: Phase 2 – focus on problem sources Branch mispredictions L2 load misses

Copyright © 2004 Intel Corporation. All Rights Reserved. Impact: WEB Publications From audio encoding LAME MT is, as you might have guessed, a multithreaded version of the LAME MP3 encoder. LAME MT was created as a demonstration of the benefits of multithreading specifically on a Hyper-Threaded CPU like the Pentium 4. You can even download a paper (in Word format) describing the programming effort. a paper Rather than run multiple parallel threads, LAME MT runs the MP3 encoder's psycho- acoustic analysis function on a separate thread from the rest of the encoder using simple linear pipelining. That is, the psycho-acoustic analysis happens one frame ahead of everything else, and its results are buffered for later use by the second thread. The author notes, "In general, this approach is highly recommended, for it is exponentially harder to debug a parallel application than a linear one." We have results for two different 64-bit versions of LAME MT from different compilers, one from Microsoft and one from Intel, doing two different types of encoding, variable bit rate and constant bit rate. We are encoding a massive 10-minute, 6-second 101MB WAV file here, as we have done in our previous CPU reviews. The successful projects have high impact. The successful projects have high impact. The successful projects have big impact

Copyright © 2004 Intel Corporation. All Rights Reserved. The LAME example: What is the LAME Project? An educational tool used for learning about MP3 encoding. It’s goal is to improve: An educational tool used for learning about MP3 encoding. It’s goal is to improve: –Psycho-acoustics quality. –The speed of MP3 encoding. LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products. LAME is the most popular state of the art MP3 encoder/decoder used by today’s leading products. Project goals: Project goals: –Speeding up the encryption of an audio stream. –Turning LAME into a Multi-Threaded (MT) engine. –Be 1:1 bit compatible with the original version. –Optimize specifically for SMT platforms. –64 bit port and CMP related optimizations. FOR MORE INFO

Copyright © 2004 Intel Corporation. All Rights Reserved. MP3 Encoding Overview Break up the audio stream into frames (uniform chunks, typically ~1K) Perceptual Model Analysis Filterbank MDCTQuantization Audio Stream Bitstream Encode Frame 1Frame 2Frame 4 Psycho- Acoustic Read Frame Frame 3 Huffman Encoding Specifically in LAME

Copyright © 2004 Intel Corporation. All Rights Reserved. This is actually Data Decomposition LAME MT – Intuitive approach Frame 1Frame 2Frame 4Frame 3 The intuitive approach: Thread 1: Thread 2: An unbreakable dependence due to Huffman Encoding Frame 5Frame 6

Copyright © 2004 Intel Corporation. All Rights Reserved. Analysis Filterbank MDCTQuantization Huffman Encoding Psycho- Acoustic Read Frame LAME MT – Functional Decomposition T1: T2: Frame 1Frame 2Frame 4 Frame 3 Frame 5Frame 6 Floating Point Intensive Integer Intensive

Copyright © 2004 Intel Corporation. All Rights Reserved. Results

Copyright © 2004 Intel Corporation. All Rights Reserved. Results due to Multi-Threading SMT Platform CBR / VBR SMP Platform CBR / VBR Using Microsoft’s Compiler* 22% / 32% 38% / 62% Using Intel® Compiler % / 29% 44% / 59% * Other names and brands may be claimed as the property of others.

Copyright © 2004 Intel Corporation. All Rights Reserved. Overall Performance Results HT Platform CBR / VBR CMP Platform CBR / VBR LAME MT code + Using Intel® Compiler % / 70% 78% / 109% The Lame example: high quality threading job.

Copyright © 2004 Intel Corporation. All Rights Reserved. Some Observations What can be accepted: What can be accepted: –Threading. There is always something to thread, but not always with significant gain. –Differentiation via micro architecture. Must be done on the same micro architecture. If not, we may find that we helped some competitor instead of Intel. –Streaming SIMD Extensions opportunities. –64 bit porting. A huge opportunity. Can be used if the student can’t find other options. Porting the assembly code will definitely show benefit. It is a big task waiting to be done. Things that didn't go as expected: Things that didn't go as expected: –Finding the good and influential candidates. It becomes more difficult every semester. –One semester is too short for many apps. –Returning code to the moderators: Only some parts of some projects were accepted by the open source moderator. None of the projects were fully accepted.

Copyright © 2004 Intel Corporation. All Rights Reserved. Backup