Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002
Agenda Hyper-Threading Technology Overview Hyper-Threading Technology Overview Introduction: Intel SW Development Tools Introduction: Intel SW Development Tools –Motivation –Challenges –Intel SW Tools Intel Compilers Overview Intel Compilers Overview –Technologies supported –SPEC and other benchmarks –Some features supported by Intel Compilers
Today’s Processors Hyper-Threading Overview Today’s Processors Single Processor Systems Single Processor Systems –Instruction Level Parallelism (ILP) –Performance improved with more CPU resources Multiprocessor Systems Multiprocessor Systems –Thread Level Parallelism (TLP) –Performance improved by adding more CPUs Hyper-Threading technology enables TLP to single processor system.
Today’s Software Hyper-Threading Overview Today’s Software Sequential tasks Sequential tasks Parallel tasks Parallel tasks Open File Edit Spell Check Open DB’s Address Book InBox Meeting
Multi-Processing Hyper-Threading Overview Multi-Processing Multi-tasking workload + processor resources => Improves MT Performance Multi-tasking workload + processor resources => Improves MT Performance Run parallel tasks using multiple processors Run parallel tasks using multiple processors CPU 1 CPU 2 CPU 3
Hyper-Threading: Quick View
Dual-Core Architecture Hyper-Threading Processor Execution Resources ASAS Multiprocessor Processor Execution Resources AS AS AS = Architecture State (eax, ebx, control registers, etc.), xAPIC Hyper-Threading Technology looks like two processors to software Hyper-Threading Technology looks like two processors to software Hyper-Threading Technology
Hyper-Threading Architecture Overview Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.
Hyper-Threading Architecture Details
Resource Utilization Hyper-Threading Overview Resource Utilization Time (proc. cycles) Note: Each box represents a processor execution unit Hyper- Threading Multiprocessing With Hyper-Threading
Performance Benefit Hyper-Threading Technology CodeDescriptionA1Engineering A2Genetics A3Chemistry A4Engineering A5Weather A6Genetics A7CFD A8FEA A9FEA “Hyper-Threading Technology: Impact on Compute- Intensive Workloads,” Intel Technical Journal, Vol. 6, 2002.
Key Point Hyper-Threading Technology gives better utilization of processor resources Hyper-Threading Technology gives more computing power for multithreaded applications Hyper-Threading Technology
Collateral Web Sites Web Sites – – – Documentation and application notes Documentation and application notes –IA-32 Intel ® Architecture Software Developer’s Manual –Intel Pentium ® 4 and Intel Xeon TM Processor Optimization Manual –Intel App Note AP485 - “Intel Processor Identification and CPU Instructions” –Intel App Note AP 949 “Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor” –Intel App Note “Detecting Support for Jackson Technology Enabled Processors”
Collateral (Cont’d) Intel Technology Journal Intel Technology Journal – Intel Threading Tools Intel Threading Tools – OpenMP OpenMP – HT Overview HT Overview –
Performance Advantage Optimization Path StandardCompiler Little or No Code Change Minor Code Change (1 Line) 13x Analysis with VTune™ 1x Intel SW Development Tools 4x IntelCompiler 7x 9x OpenMPThreading IntelCompilerIntelCompiler 15x faster OpenMPThreading IntelCompiler Minor Code Change PerformanceLibraries (IPP or MKL) StandardCompiler PerformanceLibraries PerformanceLibraries
Sunset Simulation Optimized Performance Intel SW Development Tools 15x faster
Intel® Compilers C, C++ and Fortran95 C, C++ and Fortran95 – Available on Windows* and Linux* – Available for 32-bit and 64-bit platforms Utilization of latest processor/platform features Utilization of latest processor/platform features – Optimizations for NetBurst™ architecture (Pentium® 4 and Xeon™ processor) – Optimizations for Itanium® architecture Seamless integration into Windows* (IDE) and Linux* environment Seamless integration into Windows* (IDE) and Linux* environment Source and binary compatible with Microsoft* compiler; mostly source compatible with GNU (gcc) Source and binary compatible with Microsoft* compiler; mostly source compatible with GNU (gcc) Intel SW Development Tools – Compilers
Benchmarks: Intel® Compilers 6.0 for Windows* SPECint_base2000 Configuration info: Intel® Pentium® 4 Processor, 2.4 GHz, Intel® Medford 850 Motherboard, (D850MD 850 motherboard) Chipset, 256 MB Memory, Windows* XP Professional Edition (build 2600), GeForce 3/nVidia* Graphics SPECfp_base2000 (Geomean of Fortran) CVF* 6.6Intel® Fortran Compiler % Faster Floating-point Performance!! Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. U sers’ results are dependent upon the application characteristics (loopy vs. flat), mix of C and C++, and other factors. For more information on performance tests and on the performance of Intel products, reference [ or call (U.S.) or Leading C++ Compiler Intel® C++ Compiler % Faster Integer Performance!! SPECint_base2000 = 703 SPECint_base2000 = 825 Geomean of Fortran = 881 Geomean of Fortran = 686 Intel SW Development Tools – Compilers
Intel® C++ Compiler 6.0 for Linux* PovRay Image Rendering Time Configuration info: Intel® Pentium® 4 processor, 2.0 GHz, 256 MB Memory, nVidia* GeForce 2 graphics card, Linux* 2.4.7, PovRay 3.1G Intel SW Development Tools – Compilers 60% 80% 100% 120% 140% 160% gcc 2.96, O2 and Fast-math Optimization Intel® 6.0 Comparable Optimization Intel® 6.0 Maximum Optimization Seconds Seconds Seconds Improvement
Special Performance Features Auto-Vectorization for NetBurst™ architecture Auto-Vectorization for NetBurst™ architecture Software-Pipelining for EPIC architecture Software-Pipelining for EPIC architecture Auto-Parallelization and OpenMP based parallelization Auto-Parallelization and OpenMP based parallelization –for Hyper-Threading and multi-processor systems Data Pre-Fetching Data Pre-Fetching Profile-Guided Optimization (PGO) Profile-Guided Optimization (PGO) Inter-procedural Optimization (IPO) Inter-procedural Optimization (IPO) CPU Dispatch CPU Dispatch –Establishes code path at runtime dependent on actual processor type –Allows single binary with optimal performance across processor families Intel SW Development Tools – Compilers
Techniques Overview Exploit parallelism to speedup application Exploit parallelism to speedup application Vectorization Vectorization –Supported by programming languages and compilers –Motivated by modern architectures Superscalarity, deeply pipelined core SIMD Software pipelining on Itanium™ architecture Parallelization Parallelization – OpenMP™ directives for shared memory multiprocessor systems –MPI computations for clusters Features by Intel Compilers
Intel processors and vectorization Pentium® with MMX™ technology, Pentium® II processors Pentium® III processor Pentium® 4 processor Integer types, 64 bits Streaming SIMD Extensions (SSE), Single precision floating point Streaming SIMD Extensions 2 (SSE 2), Double precision floating point, Integer types, 128 bits Type of processorVectorization features supported Features by Intel Compilers - Vectorization
Compiler automatically transforms sequential code for SIMD execution Compiler automatically transforms sequential code for SIMD execution Automatic Vectorization for (i=0; i<n; i++) { a[i] = a[i] + b[i]; a[i] = sin(a[i]); } for(i=0; i<n; i=i+VL) { a(i : i+VL-1) = a(i : i+VL-1) + b(i : i+VL-1); a(i : i+VL-1) = _vmlSin(a(i : i+VL-1)); } icl - Qx[MKW] Run-Time Library HW SIMD instruction Features by Intel Compilers - Vectorization
Vectorization Example a b Scalar Vector Features by Intel Compilers - Vectorization double a[N], b[N]; int i; for (i = 0; i < N; i++) a[i] = a[i] + b[i]; icl - QxW
Reduction Example a Loop kernel Postlude float a[N], x; int i; x=0.0; for (i = 0; i < N; i++) x += a[i]; Features by Intel Compilers - Vectorization
Parallel Program Development Ease of use/ maintenaince Explicit threading using operating system calls With industry standard OpenMP* directives Automatically using the compiler Parallelization Features by Intel Compilers - Parallelization
Autoparallelization float a[N], b[N], c[N]; int i; for (i=0; i<N; i++) c[i] = a[i]*b[i]; icl -Qparallel foo.c { -xparallel on Linux} …. foo.c foo.c(7) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED...../foo.exe -- Executable detects and uses number of processors … -Qpar_report[n] - get helpful messages from the compiler Features by Intel Compilers - Parallelization
OpenMP™ Directives OpenMP* standard ( OpenMP* standard ( –Set of directives to enable the writing of multithreaded programs Use of shared memory parallelism on programming language level Use of shared memory parallelism on programming language level –Portability –Performance Support by Intel® Compilers Support by Intel® Compilers –Windows*, Linux* –IA-32 and Itanium™ architectures Features by Intel Compilers - Parallelization
Simple Directives foo(float *a, float *b, float *c) { int i; #pragma parallel for (i=0; i<N; i++) { *c++ = (*a++)*bar(b++); }; } Pointers and procedure calls with escaped pointers prevent analysis for autoparallelization Use simple directives instead Features by Intel Compilers - Parallelization
void foo() { int a[1000], b[1000], c[1000], x[1000], i, NUM; /* parallel region */ /* parallel region */ #pragma omp parallel private(NUM) shared(x, a, b, c) { NUM = omp_get_num_threads(); { NUM = omp_get_num_threads(); #pragma omp for private(i) /* work-sharing for loop */ for (i = 0; i< 1000; i++) { for (i = 0; i< 1000; i++) { x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ } }} OpenMP* Directives icl -Qopenmp -c foo.c { -xopenmp on Linux} foo.c foo.c(10) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. foo.c(7) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED. Features by Intel Compilers - Parallelization
OpenMP™ + Vectorization Combined speedup Combined speedup Order of use might be important Order of use might be important –Parallelization overhead –Vectorize inner loops –Parallelize outer loops Supported by Intel® Compilers Supported by Intel® Compilers Features by Intel Compilers
Make performance a feature of your applications today – stay competitive Make performance a feature of your applications today – stay competitive Intel® Compilers Leading-Edge compiler technologies Leading-Edge compiler technologies Compatible with leading industry standard compilers Compatible with leading industry standard compilers Processor optimized code generation Processor optimized code generation Support single source code across Intel processor families Support single source code across Intel processor families Intel SW Development Tools
Collateral Intel Technology Journal Intel Technology Journal – Intel Threading Tools Intel Threading Tools – OpenMP OpenMP – HT Overview HT Overview –
To be continued…