Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Computer Organization and Architecture

CSCI 4717/5717 Computer Architecture

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property.

Chapter 8: Central Processing Unit

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Computer Organization and Architecture

Computer Organization and Architecture

Computer Architecture and Data Manipulation Chapter 3.

1 Microprocessor-based Systems Course 4 - Microprocessors.

Computer Organization and Architecture The CPU Structure.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Chapter 12 Pipelining Strategies Performance Hazards.

Chapter 12 Three System Examples The Architecture of Computer Hardware and Systems Software: An Information Technology Approach 3rd Edition, Irv Englander.

Data Manipulation Computer System consists of the following parts:

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Chapter 12 CPU Structure and Function. Example Register Organizations.

PhD/Master course, Uppsala  Understanding the interaction between your program and computer  Structuring the code  Optimizing the code  Debugging.

CHAPTER 8: CPU and Memory Design, Enhancement, and Implementation

Computer Organization and Assembly language

From Essentials of Computer Architecture by Douglas E. Comer. ISBN © 2005 Pearson Education, Inc. All rights reserved. 7.2 A Central Processor.

Pipelining By Toan Nguyen.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Basics and Architectures

Computers organization & Assembly Language Chapter 0 INTRODUCTION TO COMPUTING Basic Concepts.

Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.

Advanced Computer Architecture 0 Lecture # 1 Introduction by Husnain Sherazi.

RISC Architecture RISC vs CISC Sherwin Chan.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Chapter 2 Data Manipulation. © 2005 Pearson Addison-Wesley. All rights reserved 2-2 Chapter 2: Data Manipulation 2.1 Computer Architecture 2.2 Machine.

Computer Architecture 2 nd year (computer and Information Sc.)

ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.

Pipelining and Parallelism Mark Staveley

Computer and Information Sciences College / Computer Science Department CS 206 D Computer Organization and Assembly Language.

Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja

EKT303/4 Superscalar vs Super-pipelined.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

Simple ALU How to perform this C language integer operation in the computer C=A+B; ? The arithmetic/logic unit (ALU) of a processor performs integer arithmetic.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

1 Basic Processor Architecture. 2 Building Blocks of Processor Systems CPU.

The Processor & its components. The CPU The brain. Performs all major calculations. Controls and manages the operations of other components of the computer.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

PipeliningPipelining Computer Architecture (Fall 2006)

Chapter Overview General Concepts IA-32 Processor Architecture

Advanced Architectures

Advanced Topic: Alternative Architectures Chapter 9 Objectives

Assembly Language for Intel-Based Computers, 5th Edition

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Introduction to Pentium Processor

Superscalar Processors & VLIW Processors

Morgan Kaufmann Publishers Computer Organization and Assembly Language

Computer Architecture

COMPUTER ORGANIZATION AND ARCHITECTURE

Presentation transcript:

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. 1 Introduction to application optimizations with usage of Intel ® performance tools. Andrei Anufrienko Intel Compiler Group

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The objectives of this course: Get a basic understanding of : the main factors of the processor performance, base performance improvement techniques, Intel ® tools for performance analysis, main options and components of the Intel compiler, theoretical foundations of some performance optimizations.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. You will be able to: describe the main problems of the processor performance; investigate the application using the VTune ™ Performance Analyzer and find problem areas; identify the main problems of an application analyzed; develop a strategy to improve application performance; describe the main components of the compiler and its functions; control the level of optimization with command line options.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Course plan Intel microprocessor architecture and main factors affecting processor performance; VTune ™ Performance Analyzer usage; The role of the compiler in improving application performance; Some theoretical concepts. Control flow graph, data-flow analysis; Permutation optimizations and their applicability. Dependencies; Vectorization; Parallelization using OMP directives and auto parallelization; The main components of the compiler, their tasks and interconnection.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel microprocessor architecture and the main factors affecting the processor performance.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Simplified processor model Operative memory (RAM) Operative memory (RAM) System bus Arithmetic logic unit (ALU) Arithmetic logic unit (ALU) Control unit Control unit Input-output unit Input-output unit External memory External memory Input-output bus Input-output bus Processor Commands Data Registers

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Simplified processor model  Control Unit, CU  Arithmetic and Logic Unit, ALU  System registers  Front Side Bus, FSB  Memory  Peripheral devices Control Unit (CU): decodes instructions received from the memory; controls ALU; performs data transfer between the CPU registers, memory, peripheral devices. ALU consists of different parts, allowing to perform arithmetic and logical operations on the system registers. System registers - a piece of memory inside the CPU that is used for temporary storage of an information processed by the processor. A system bus is used for data transfer between the CPU and memory, as well as between the CPU and peripherals.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. High performance is one of the key factors in the competition of the computer systems manufacturers Processor performance is directly related to the amount of computational work that can be processed at a time. Roughly speaking: Performance = Number of instructions / Time We'll talk about performance on the basis of IA32 and IA32E architectures (IA32 with EM64T). Factors affecting the processor performance: CPU clock frequency; Accessible memory amount and speed; The performance of the instructions and completeness of the instruction set; The internal memory registers usage; The quality of pipelining; The quality of prediction; The quality of the prefetching; Superscalarity; The quality of vectorization; Parallelization and multicore.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Clock rate Because the processor is made of different components, working with different speeds, there is a processor timer which is providing the synchronization by sending periodic sync. Its frequency is called the clock speed of the processor. Memory speed and amount MB of memory A new system registers, and a new mode of memory - 16MB of memory the first 32-bit processor - 4GB Technology EM64T (Extended Memory 64 Technology) - ~ 264B

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The performance of the instructions and completeness of the instruction set Performance depends on how well the instructions are implemented, how well the basic instruction set covers all possible tasks. CISC, RISC (complex, reduced instruction set computing) Modern Intel processors are a hybrid of CISC and RISC; before executing a processor converts CISC instructions into simpler RISC instruction set.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Registers and memory System registers have the smallest access time, so the number of available registers affects the performance of the microprocessor. Register spilling – lack of system registers causes great exchange between registers and stack of application. Ia32e Technology EM64T – added additional system registers. Now the memory access speed is much lower than the speed of calculations. There are two characteristics describing the properties of memory: Response time (latency) – the number of processor cycles required to transfer data from the memory unit. Bandwidth – number of items can be sent from the processor to memory at one cycle. Two possible performance improvement strategies – to reduce response time or prefetch the necessary memory.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Reducing the memory access time is achieved via cache system (small amount of memory located on processor). Memory blocks are preloaded into the cash. If the address is in the cache memory - there is a “hit” and data acquisition is greatly increased. Otherwise – “cash miss” and additional time is needed. In this case, the block of memory is read into the cache for one or more cycles of bus, called the filling cache lines. (Size of cash line is 64 bytes.) There are different kinds of cash: fully associative cache memory (each block can appear anywhere inside the cache) direct mapping from memory (each block can be loaded into one place) various hybrid options (pie memory, the memory of the set-associative access) – Set-associative access: lest significant bits are used to determine cache line this memory can be loaded to; cash line may contain a few words from main memory, the mapping inside the line is held on an associative basis. The quality of the memory access is main key to the performance.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Modern computing architectures contains complicated cash hierarchy. Nehalem: i7 oL1 - latency 4 oL2 - latency 11 oL3 - latency 38 oOperative memory latency > 100 Proactive memory access mechanism is implemented with a hardware prefetching which based on the history of cash misses. It tries to detect and prefetch independent streams of data. There is a special set of instructions allows to induce the processor to load the memory specified into cache (software prefetching).

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The principle of locality. The quality of the prefetch. Reference locality helps to reuse variables or related data. There is difference between temporal locality – reuse of certain data and resources, and spatial locality - use of data located in the memory beside. The caching mechanism uses the principle of temporal locality. (Before new cash line is loaded to cash some cash line should be freed. Cash mechanism selects one which has oldest access time. Prefetching engine uses the principle of spatial locality. It tries to define the pattern of memory access to pre-load to cache memory which will be need soon. Size of preloaded memory (cash line) is 64 bytes. Thus in case of good spatial locality (data used jointly during calculation is located in the memory beside) less cash lines should be loaded to the cache. One of known performance problem is “cache aliasing” – bad memory locations of various objects participated in a calculation causes the replacement of useful cache lines by some other needed addresses. xyz xyz Z=sqrt(y2+x2) One cash line should be loaded Up to three cash lines should be loaded

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Pipeline Instruction fetch Register fetch Instruction decode Execution Data fetch Write back instr instr. 2instr instr. 3instr instr. 4instr. 1-- instr. 5instr. 1- instr. 2 instr. 3 instr. 4 instr. 5 instr. 6 instr. 7 tick

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The quality of pipelining, instruction level of parallelism Pipelining assumes that successive instructions will be processed together during execution but on different phases of pipeline. Typical instruction execution can be divided into the following steps:  instruction fetch - IF;  decoding command / register selection - ID;  operation / calculation of effective memory addresses - EX;  memory access – MEM;  storing the result - WB. Pipelining improves throughput of the processor, but if the instructions depend on the results of the previous instructions, there will be delays. Thus the benefits of pipelining depends on level of instruction parallelism.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The quality of prediction The instructions may depend on the data and control logic. (Data dependence and control flow dependence). The efficiency of pipeline is limited by various conditional branches inside instruction flow. If there is conditional branch than following instructions aren’t known until the condition isn’t calculated. Should the pipeline be stopped? Branch predictor is designed to solve this problem. Predictor selects one possible way and continues instructions fetching and processing. All processed instructions are located in pipeline storage. If predictor assumption was correct all of them are marked as proper, otherwise “branch misprediction” is happened – pipeline storage should be clean and new instructions should be fetched. There are static and dynamic predictors: Static predictor uses some simple rules; – Trivial prediction – the branch will be not executed if the transition is carried forward and will be made if this is a back jump; Dynamic predictor collects the statistics on every branch and its choice based on this information. There is also branch target prediction, which predicts unconditional jumps.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Superscalarity Superscalar processor – a processor which is capable to perform multiple operations per one clock cycle. It has several execution units. The superscalar technique has several identifying characteristics: Instructions are issued from a sequential instruction stream There is special device which detects data dependences between instructions at run time. The CPU accepts multiple instructions per clock cycle Modern CPU is always superscalar and pipelined. Each execution unit has own specialization. "Diversity“ of instructions and high level of instruction parallelism causes best CPU effectiveness.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Simplified processor model Operative memory (RAM) Operative memory (RAM) Prefetching Arithmetic устройство (ALU) Arithmetic устройство (ALU) Control unit (CU) Control unit (CU) Input-output unit Input-output unit External memory External memory Input-output bus Input-output bus Superscalar Branch prediction Branch prediction Регистры Arithmetic logical unit (ALU) Arithmetic logical unit (ALU) Registers Cashes

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Vector instructions and Vectorization A typical vector instruction performs an elementary operation on two vector sequences in the memory or vector registers of fixed length C (1: n) = A (1: n) + B (1: n) Fortran array sections are convenient to notate vector opertaions Vectorization - the process of converting a scalar calculations, in which an operation is performed on a pair of operands, to the vector representation, in which an operation is performed on a pair of vector operands. Each vector contains several scalar operands. Pentium III compute system of x86 family introduced SSE (Streaming SIMD Extensions). There were eight 128 bit registers (XMM0-XMM7) and 70 new instructions including working with real numbers. SSE2, SSE3, SSEE3, SSE4, SSE4.2, AVX - further extensions of SSE.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Look ahead and out-of-order execution Modern x86 family microprocessors have advanced processor mechanisms to view the instruction flow and identify instructions that can be computed in parallel. If there are enough instructions in look-ahead buffer which can be processed together, than processor pipeline will work with maximum effectiveness. This approach leads to execution with change of the instruction sequence (out-of-order execution). Implementation of out-of-order mechanisms makes processor architecture more complicated and causes additional energy costs. There are Intel processors without out-of-order support. (Itanium, Atom). In this case instruction scheduling is key factor of good processor performance.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Parallelization and multi-core Multitasking is a method where multiple tasks, also known as processes, share common resources of microprocessor. Multithreading computers have hardware support to efficiently execute multiple threads. Threads are parts of a process and share the same memory. Multithreading allows to divide a calculation into several parts which are processed in parallel. Hyper-threading technology allows to mix instruction sequences of different processes to improve instruction level parallelism. Pentium 4 - Core i7 Cores – microprocessor contains several superscalar pipelines which have own calculation resources but share system bus, memory and up level cashes. Multiprocessor solutions contains several processors. Multiprocessor and multi-core systems allow to increase the application performance by creating multiple threads

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Main characteristics of the application, affecting its performance Calculations efficiency, Memory usage effectiveness, Correct branch prediction, Efficient use of vector instructions, The effectiveness of parallelization, Instructional parallelism level.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Performance measuring What factors affect the performance of a specific program? Compiler quality Performance of computer system Consumers need criteria to determine the computer system performance A representative set of typical tasks; Universal testing scheme; Independence from MP manufacturers. Spec.org (Standart Performance Evaluated Corporation) - non-profit organization for training, support and maintenance of a standard set of tests to compare the performance of different computer systems. This organization develops and publishes standard suites for performance measuring. CPU designed to measure performance. Can be used to compare the programs running on different computer systems. OMP measures the performance on tests using OpenMP standard for parallel processing with shared memory (shared-memory parallel processing).

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Optimizing compiler role Compiler translates the entire source program into an equivalent program in the resulting machine code or assembly language. Does the compiler have any role in the struggle for the performance of the MP?  The compiler is used during testing and debugging functionality of the new MP.  Performance of new computer system related with new instruction set, increasing number of registers can be demonstrated only with optimizing compiler which supports these innovations.  The compiler is able to hide the architects misses.

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. List of literature for deeper study 1.Randy Allen & Ken Kennedy “Optimizing compilers for modern architectures” 2.David F. Bacon, Susan L. Graham and Oliver J.Sharp “Compiler transformations for High-Performance Computing” 3.Aart J.C. Bik “The Software Vectorization Handbook” 4.Richard Gerber, Aart J.C. Bik, Kevin B.Smith, Xinmin Tian “The Software Optimization Cookbook” 5.Intel® 64 and IA-32 Intel Architecture Software Developer's Manual 6.Intel® 64 and IA-32 Architectures Optimization Reference Manual 7.Agner Fog “Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms”

Software & Services Group, Developer Products Division Copyright© 2010, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Thank you!