Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002.

Slides:



Advertisements
Similar presentations
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Advertisements

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Introductions to Parallel Programming Using OpenMP
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Computers Organization & Assembly Language Chapter 1 THE 80x86 MICROPROCESSOR.
Performance Analysis of Multiprocessor Architectures
Pentium 4 and IA-32 ISA ELEC 5200/6200 Computer Architecture and Design, Fall 2006 Lectured by Dr. V. Agrawal Lectured by Dr. V. Agrawal Kyungseok Kim.
MULTICORE, PARALLELISM, AND MULTITHREADING By: Eric Boren, Charles Noneman, and Kristen Janick.
Introduction CS 524 – High-Performance Computing.
IA-32 Processor Architecture
Instruction Level Parallelism (ILP) Colin Stevens.
1 Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures Steven Ge, Xinmin Tian, and Yen-Kuang Chen IEEE Pacific-Rim.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
7-Aug-15 (1) CSC Computer Organization Lecture 6: A Historical Perspective of Pentium IA-32.
1 Comparing The Intel ® Core ™ 2 Duo Processor to a Single Core Pentium ® 4 Processor at Twice the Speed Performance Benchmarking and Competitive Analysis.
Part I IA-32 Execution Layer Part II 64-bit Extension Technology Intel Alex.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
Intel Compilers 9.x on the Intel® Core Duo™ Processor Windows version Intel Software College.
1 Intel® Compilers For Xeon™ Processor.
1 Day 1 Module 2:. 2 Use key compiler optimization switches Upon completion of this module, you will be able to: Optimize software for the architecture.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Assembly Language for Intel-Based Computers, 4 th Edition Chapter 2: IA-32 Processor Architecture (c) Pearson Education, All rights reserved. You.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Computer Performance Computer Engineering Department.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Multi-core architectures. Single-core computer Single-core CPU chip.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
Copyright © 2002, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners
History of Microprocessor MPIntroductionData BusAddress Bus
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
Computer Organization David Monismith CS345 Notes to help with the in class assignment.
Performance Optimization Getting your programs to run faster CS 691.
ACMSE’04, ALDepartment of Electrical and Computer Engineering - UAH Execution Characteristics of SPEC CPU2000 Benchmarks: Intel C++ vs. Microsoft VC++
 Copyright, HiPERiSM Consulting, LLC, George Delic, Ph.D. HiPERiSM Consulting, LLC (919) P.O. Box 569, Chapel Hill, NC.
Hyper Threading (HT) and  OPs (Micro-Operations) Department of Computer Science Southern Illinois University Edwardsville Summer, 2015 Dr. Hiroshi Fujinoki.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Performance Optimization Getting your programs to run faster.
DEV490 Easy Multi-threading for Native.NET Apps with OpenMP ™ and Intel ® Threading Toolkit Software Application Engineer, Intel.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.
Processor Level Parallelism. Improving the Pipeline Pipelined processor – Ideal speedup = num stages – Branches / conflicts mean limited returns after.
Introdution to SSE or How to put your algorithms on steroids! Christian Kerl
Copyright © Curt Hill Parallelism in Processors Several Approaches.
EKT303/4 Superscalar vs Super-pipelined.
Single Node Optimization Computational Astrophysics.
A parallel High Level Trigger benchmark (using multithreading and/or SSE)‏ Håvard Bjerke.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
My Coordinates Office EM G.27 contact time:
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
A Survey of the Current State of the Art in SIMD: Or, How much wood could a woodchuck chuck if a woodchuck could chuck n pieces of wood in parallel? Wojtek.
Native Computing & Optimization on Xeon Phi John D. McCalpin, Ph.D. Texas Advanced Computing Center.
Compilers: History and Context COMP Outline Compilers and languages Compilers and architectures – parallelism – memory hierarchies Other uses.
The Art of Parallel Processing
Exploiting Parallelism
Getting Started with Automatic Compiler Vectorization
Morgan Kaufmann Publishers
Many-core Software Development Platforms
Intel® Parallel Studio and Advisor
Liquid computing – the rVEX approach
Chapter 1 Introduction.
Multi-Core Programming Assignment
CS 286 Computer Organization and Architecture
6- General Purpose GPU Programming
CSE 502: Computer Architecture
Presentation transcript:

Hyper-Threading Intel Compilers Andrey Naraikin Senior Software Engineer Software Products Division Intel Nizhny Novgorod Lab November 29, 2002

Agenda Hyper-Threading Technology Overview Hyper-Threading Technology Overview Introduction: Intel SW Development Tools Introduction: Intel SW Development Tools –Motivation –Challenges –Intel SW Tools Intel Compilers Overview Intel Compilers Overview –Technologies supported –SPEC and other benchmarks –Some features supported by Intel Compilers

Today’s Processors Hyper-Threading Overview Today’s Processors Single Processor Systems Single Processor Systems –Instruction Level Parallelism (ILP) –Performance improved with more CPU resources Multiprocessor Systems Multiprocessor Systems –Thread Level Parallelism (TLP) –Performance improved by adding more CPUs Hyper-Threading technology enables TLP to single processor system.

Today’s Software Hyper-Threading Overview Today’s Software Sequential tasks Sequential tasks Parallel tasks Parallel tasks Open File Edit Spell Check Open DB’s Address Book InBox Meeting

Multi-Processing Hyper-Threading Overview Multi-Processing Multi-tasking workload + processor resources => Improves MT Performance Multi-tasking workload + processor resources => Improves MT Performance Run parallel tasks using multiple processors Run parallel tasks using multiple processors CPU 1 CPU 2 CPU 3

Hyper-Threading: Quick View

Dual-Core Architecture Hyper-Threading Processor Execution Resources ASAS Multiprocessor Processor Execution Resources AS AS AS = Architecture State (eax, ebx, control registers, etc.), xAPIC Hyper-Threading Technology looks like two processors to software Hyper-Threading Technology looks like two processors to software Hyper-Threading Technology

Hyper-Threading Architecture Overview Pentium, VTune and Xeon is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States or other countries.

Hyper-Threading Architecture Details

Resource Utilization Hyper-Threading Overview Resource Utilization Time (proc. cycles) Note: Each box represents a processor execution unit Hyper- Threading Multiprocessing With Hyper-Threading

Performance Benefit Hyper-Threading Technology CodeDescriptionA1Engineering A2Genetics A3Chemistry A4Engineering A5Weather A6Genetics A7CFD A8FEA A9FEA “Hyper-Threading Technology: Impact on Compute- Intensive Workloads,” Intel Technical Journal, Vol. 6, 2002.

Key Point Hyper-Threading Technology gives better utilization of processor resources Hyper-Threading Technology gives more computing power for multithreaded applications Hyper-Threading Technology

Collateral Web Sites Web Sites – – – Documentation and application notes Documentation and application notes –IA-32 Intel ® Architecture Software Developer’s Manual –Intel Pentium ® 4 and Intel Xeon TM Processor Optimization Manual –Intel App Note AP485 - “Intel Processor Identification and CPU Instructions” –Intel App Note AP 949 “Using Spin-Loops on Intel Pentium 4 Processor and Intel Xeon Processor” –Intel App Note “Detecting Support for Jackson Technology Enabled Processors”

Collateral (Cont’d) Intel Technology Journal Intel Technology Journal – Intel Threading Tools Intel Threading Tools – OpenMP OpenMP – HT Overview HT Overview –

Performance Advantage Optimization Path StandardCompiler Little or No Code Change Minor Code Change (1 Line) 13x Analysis with VTune™ 1x Intel SW Development Tools 4x IntelCompiler 7x 9x OpenMPThreading IntelCompilerIntelCompiler 15x faster OpenMPThreading IntelCompiler Minor Code Change PerformanceLibraries (IPP or MKL) StandardCompiler PerformanceLibraries PerformanceLibraries

Sunset Simulation Optimized Performance Intel SW Development Tools 15x faster

Intel® Compilers C, C++ and Fortran95 C, C++ and Fortran95 – Available on Windows* and Linux* – Available for 32-bit and 64-bit platforms Utilization of latest processor/platform features Utilization of latest processor/platform features – Optimizations for NetBurst™ architecture (Pentium® 4 and Xeon™ processor) – Optimizations for Itanium® architecture Seamless integration into Windows* (IDE) and Linux* environment Seamless integration into Windows* (IDE) and Linux* environment Source and binary compatible with Microsoft* compiler; mostly source compatible with GNU (gcc) Source and binary compatible with Microsoft* compiler; mostly source compatible with GNU (gcc) Intel SW Development Tools – Compilers

Benchmarks: Intel® Compilers 6.0 for Windows* SPECint_base2000 Configuration info: Intel® Pentium® 4 Processor, 2.4 GHz, Intel® Medford 850 Motherboard, (D850MD 850 motherboard) Chipset, 256 MB Memory, Windows* XP Professional Edition (build 2600), GeForce 3/nVidia* Graphics SPECfp_base2000 (Geomean of Fortran) CVF* 6.6Intel® Fortran Compiler % Faster Floating-point Performance!! Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. U sers’ results are dependent upon the application characteristics (loopy vs. flat), mix of C and C++, and other factors. For more information on performance tests and on the performance of Intel products, reference [ or call (U.S.) or Leading C++ Compiler Intel® C++ Compiler % Faster Integer Performance!! SPECint_base2000 = 703 SPECint_base2000 = 825 Geomean of Fortran = 881 Geomean of Fortran = 686 Intel SW Development Tools – Compilers

Intel® C++ Compiler 6.0 for Linux* PovRay Image Rendering Time Configuration info: Intel® Pentium® 4 processor, 2.0 GHz, 256 MB Memory, nVidia* GeForce 2 graphics card, Linux* 2.4.7, PovRay 3.1G Intel SW Development Tools – Compilers 60% 80% 100% 120% 140% 160% gcc 2.96, O2 and Fast-math Optimization Intel® 6.0 Comparable Optimization Intel® 6.0 Maximum Optimization Seconds Seconds Seconds Improvement

Special Performance Features Auto-Vectorization for NetBurst™ architecture Auto-Vectorization for NetBurst™ architecture Software-Pipelining for EPIC architecture Software-Pipelining for EPIC architecture Auto-Parallelization and OpenMP based parallelization Auto-Parallelization and OpenMP based parallelization –for Hyper-Threading and multi-processor systems Data Pre-Fetching Data Pre-Fetching Profile-Guided Optimization (PGO) Profile-Guided Optimization (PGO) Inter-procedural Optimization (IPO) Inter-procedural Optimization (IPO) CPU Dispatch CPU Dispatch –Establishes code path at runtime dependent on actual processor type –Allows single binary with optimal performance across processor families Intel SW Development Tools – Compilers

Techniques Overview Exploit parallelism to speedup application Exploit parallelism to speedup application Vectorization Vectorization –Supported by programming languages and compilers –Motivated by modern architectures  Superscalarity, deeply pipelined core  SIMD  Software pipelining on Itanium™ architecture Parallelization Parallelization – OpenMP™ directives for shared memory multiprocessor systems –MPI computations for clusters Features by Intel Compilers

Intel processors and vectorization Pentium® with MMX™ technology, Pentium® II processors Pentium® III processor Pentium® 4 processor Integer types, 64 bits Streaming SIMD Extensions (SSE), Single precision floating point Streaming SIMD Extensions 2 (SSE 2), Double precision floating point, Integer types, 128 bits Type of processorVectorization features supported Features by Intel Compilers - Vectorization

Compiler automatically transforms sequential code for SIMD execution Compiler automatically transforms sequential code for SIMD execution Automatic Vectorization for (i=0; i<n; i++) { a[i] = a[i] + b[i]; a[i] = sin(a[i]); } for(i=0; i<n; i=i+VL) { a(i : i+VL-1) = a(i : i+VL-1) + b(i : i+VL-1); a(i : i+VL-1) = _vmlSin(a(i : i+VL-1)); } icl - Qx[MKW] Run-Time Library HW SIMD instruction Features by Intel Compilers - Vectorization

Vectorization Example a b Scalar Vector Features by Intel Compilers - Vectorization double a[N], b[N]; int i; for (i = 0; i < N; i++) a[i] = a[i] + b[i]; icl - QxW

Reduction Example a Loop kernel Postlude float a[N], x; int i; x=0.0; for (i = 0; i < N; i++) x += a[i]; Features by Intel Compilers - Vectorization

Parallel Program Development Ease of use/ maintenaince Explicit threading using operating system calls With industry standard OpenMP* directives Automatically using the compiler Parallelization Features by Intel Compilers - Parallelization

Autoparallelization float a[N], b[N], c[N]; int i; for (i=0; i<N; i++) c[i] = a[i]*b[i]; icl -Qparallel foo.c { -xparallel on Linux} …. foo.c foo.c(7) : (col. 2) remark: LOOP WAS AUTO-PARALLELIZED...../foo.exe -- Executable detects and uses number of processors … -Qpar_report[n] - get helpful messages from the compiler Features by Intel Compilers - Parallelization

OpenMP™ Directives OpenMP* standard ( OpenMP* standard ( –Set of directives to enable the writing of multithreaded programs Use of shared memory parallelism on programming language level Use of shared memory parallelism on programming language level –Portability –Performance Support by Intel® Compilers Support by Intel® Compilers –Windows*, Linux* –IA-32 and Itanium™ architectures Features by Intel Compilers - Parallelization

Simple Directives foo(float *a, float *b, float *c) { int i; #pragma parallel for (i=0; i<N; i++) { *c++ = (*a++)*bar(b++); }; } Pointers and procedure calls with escaped pointers prevent analysis for autoparallelization Use simple directives instead Features by Intel Compilers - Parallelization

void foo() { int a[1000], b[1000], c[1000], x[1000], i, NUM; /* parallel region */ /* parallel region */ #pragma omp parallel private(NUM) shared(x, a, b, c) { NUM = omp_get_num_threads(); { NUM = omp_get_num_threads(); #pragma omp for private(i) /* work-sharing for loop */ for (i = 0; i< 1000; i++) { for (i = 0; i< 1000; i++) { x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ x[i] = bar(a[i], b[i], c[i], NUM); /* assume bar has no side-effects */ } }} OpenMP* Directives icl -Qopenmp -c foo.c { -xopenmp on Linux} foo.c foo.c(10) : (col. 1) remark: OpenMP DEFINED LOOP WAS PARALLELIZED. foo.c(7) : (col. 1) remark: OpenMP DEFINED REGION WAS PARALLELIZED. Features by Intel Compilers - Parallelization

OpenMP™ + Vectorization Combined speedup Combined speedup Order of use might be important Order of use might be important –Parallelization overhead –Vectorize inner loops –Parallelize outer loops Supported by Intel® Compilers Supported by Intel® Compilers Features by Intel Compilers

Make performance a feature of your applications today – stay competitive Make performance a feature of your applications today – stay competitive Intel® Compilers Leading-Edge compiler technologies Leading-Edge compiler technologies Compatible with leading industry standard compilers Compatible with leading industry standard compilers Processor optimized code generation Processor optimized code generation Support single source code across Intel processor families Support single source code across Intel processor families Intel SW Development Tools

Collateral Intel Technology Journal Intel Technology Journal – Intel Threading Tools Intel Threading Tools – OpenMP OpenMP – HT Overview HT Overview –

To be continued…