Overview of Intel® Core 2 Architecture and Software Development Tools June 2009.

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
DSPs Vs General Purpose Microprocessors
Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.
The University of Adelaide, School of Computer Science
Contents Even and odd memory banks of 8086 Minimum mode operation
Intel® performance analyze tools Nikita Panov Idrisov Renat.
Intel Core2 GHz Q6700 L2 Cache 8 Mbytes (4MB per pair) L1 Cache: (128 KB Instruction +128KB Data at the core level???) L3 Cache: None? CPU.
Advanced microprocessor optimization Kampala August, 2007 Agner Fog
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Intel® Core™ Duo Processor Behrooz Jafarnejad Winter 2006.
1 Microprocessor-based Systems Course 4 - Microprocessors.
Processor history / DX/SX SX/DX Pentium 1997 Pentium MMX
COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.
1 Lecture 6 Performance Measurement and Improvement.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
OPERATING SYSTEM OVERVIEW
CS 300 – Lecture 23 Intro to Computer Architecture / Assembly Language Virtual Memory Pipelining.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Copyright © 2006, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners Intel® Core™ Duo Processor.
Intel® Processor Architecture: Multi-core Overview Intel® Software College.
CH12 CPU Structure and Function
CS 6354 by WeiKeng Qin, Jian Xiang, & Ren Xu December 8, 2009.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.
Basics and Architectures
Topic:The Motorola M680X0 Family Team:Ulrike Eckardt Frederik Fleck André Kudra Jan Schuster Date:Thursday, 12/10/1998 CS-350 Computer Organization Term.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
History of Microprocessor MPIntroductionData BusAddress Bus
1 Multithreaded Programming Concepts Myongji University Sugwon Hong 1.
Performance of mathematical software Agner Fog Technical University of Denmark
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
1 Optimizing compiler tools and building blocks project Alexander Drozdov, PhD Sergey Novikov, PhD.
Hyper Threading Technology. Introduction Hyper-threading is a technology developed by Intel Corporation for it’s Xeon processors with a 533 MHz system.
Introduction to MMX, XMM, SSE and SSE2 Technology
1 CSCI 2510 Computer Organization Memory System II Cache In Action.
SSE and SSE2 Jeremy Johnson Timothy A. Chagnon All images from Intel® 64 and IA-32 Architectures Software Developer's Manuals.
PROCESSOR Ambika | shravani | namrata | saurabh | soumen.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Lab Activities 1, 2. Some of the Lab Server Specifications CPU: 2 Quad(4) Core Intel Xeon 5400 processors CPU Speed: 2.5 GHz Cache : Each 2 cores share.
SSU 1 Dr.A.Srinivas PES Institute of Technology Bangalore, India 9 – 20 July 2012.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
*Pentium is a trademark or registered trademark of Intel Corporation or its subsidiaries in the United States and other countries Performance Monitoring.
Lecture 3 Dr. Muhammad Ayaz Computer Organization and Assembly Language. (CSC-210)
Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,
Hardware Architecture
CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.
Chapter Overview General Concepts IA-32 Processor Architecture
Microarchitecture.
Distributed Processors
Roadmap C: Java: Assembly language: OS: Machine code: Computer system:
CS2100 Computer Organisation
Multi-Core Programming Assignment
Chapter 11 Processor Structure and function
Multicore and GPU Programming
CS2100 Computer Organisation
Presentation transcript:

Overview of Intel® Core 2 Architecture and Software Development Tools June 2009

Overview of Architecture & Tools We will discuss:  What lecture materials are available  What labs are available  What target courses could be impacted  Some high level discussion of underlying technology

Objectives After completing this module, you will:  Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects  Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance  Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core  Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Why is the Industry moving to Multi-core? In order to increase performance and reduce power consumption Its is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency

Power and Frequency Power vs. Frequency Curve for Single Core Architecture Frequency (GHz) Power (w) Dropping Frequency = Large Drop Power Lower Frequency Allows Headroom for 2nd Core

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Processor-independent optimizations /OdDisables optimizations /O1Optimizes for Binary Size and for Speed: Server Code /O2Optimizes for Speed (default): Vectorization on Intel 64 /O3Optimizes for Data Cache: Loopy Floating Point Code /ZiCreates symbols for debugging /Ob0Turns off inlining which can sometimes help the Analysis tools do a more through job

AutoVectorization optimizations QaxSSE2Intel Pentium 4 and compatible Intel processors. QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support QaxSSE3_ATOMCan generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Centrino(R) Atom(TM) Processor Technology Extensions 3 (SSE3) instruction support QaxSSSE3 Intel(R) Core(TM)2 processor family with SSSE3 QaxSSE4.1Intel(R) 45nm Hi-k next generation Intel Core(TM) microarchitecture with support for SSE4 Vectorizing Compiler and Media Accelerator instructions QaxSSE4.2Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family. Intel has a long history of providing auto-vectorization switches along with support for new processor instructions and backward support for older instructions is maintained Developers should keep an eye on new developments in order to leverage the power of the latest processors

More Advanced optimizations QipoInterprocedural optimization performs a static, topological analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector Qparallelenable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel Qopenmpenable the compiler to generate multi-threaded code based on the OpenMP* directives

Lab 1 - AutoParallelization Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature Follow the VectorSum activity in the student lab doc Try AutoParallel compilation on Lab called VectorSum Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this

Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot information – where in your code does the application spend most of its time Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio

Parallel Amplifier Hotspots

What does hotspot analysis show?

What about drilling down?

The call stack The call stack shows the callee/caller relationship among function in he code

Found potential parallelism

Lab 2 – Mandelbrot Hotspot Analysis Objective: Use sampling to find some parallelism in the Mandelbrot application Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc Identify candidate loops that could be parallelized

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core  High level overview – Intel® Core Architecture Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Mobile Platform Optimized 1-4 Execution Cores 3/6MB L2 Cache Sizes 64 Byte L2 cache line 64-bit 6M 6M L2 4M 4M L2 Desktop Platform Optimized 2-4 Execution Cores 2X3, 2X6 MB L2 Cache Sizes 64 Byte L2 Cache line 64-bit Server Platform Optimized 4 Execution Cores 2x6 L2 Caches 64 Byte L2 Cache line DP/MP support 64-bit 2 cores 4 cores **Feature Names TBD 6M 2X6M L2 2X3M L2 2 cores 4 cores 12M 4 cores 2X6M L2 12M Intel® Core 2 Architecture Snapshot in time during Penryn, Yorkfield, harpertown Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials

Memory Hierarchy Magnetic Disk Main Memory L2 Cache L1 Cache CPU ~ 1’s Cycle ~ 1’s - 10 Cycle ~ 100’s Cycle ~ 1000’s Cycle

High Level Architectural view AAAA EEEE C1C2 BB AA EE C B Intel Core 2 Duo Processor Intel Core 2 Quad Processor A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface Memory 64B Cache Line Dual Core has shared cache Quad core has both shared And separated cache Intel® Core™ Microarchitecture – Memory Sub-system

With a separated cache CPU1 CPU2 Memory Front Side Bus (FSB) Cache Line Shipping L2 Cache Line ~Half access to memory Intel® Core™ Microarchitecture – Memory Sub-system

CPU2 Advantages of Shared Cache – using Advanced Smart Cache® Technology CPU1 Memory Front Side Bus (FSB) Cache Line L2 is shared: No need to ship cache line Intel® Core™ Microarchitecture – Memory Sub-system

False Sharing Performance issue in programs where cores may write to different memory addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores Core 0 Core 1 Time 1 0 X[0] = 1 X[1] =1 1 X[0] = 0 X[1] = X[0] = False Sharing not an issue in shared cache It is an issue in separated cache

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Super Scalar Execution FP SIMD INT Multiple Execution units Allow SIMD parallelism Many instructions can be retired in a clock cycle Multiple operations executed within a single core at the same time

IntelSSE IntelSSE4.1 IntelSSE IntelSSE IntelSSSE instr Single- Precision Vectors Streaming operations 144 instr Double- precision Vectors 8/16/32 64/128-bit vector integer 13 instr Complex Data 32 instr Decode 47 instructions Video Accelerators Graphics building blocks Advanced vector instr Will be continued by Intel SSE4.2 (XML processing end 2008) See - instructions-paper.pdf History of SSE Instructions Long history of new instructions Most require using packing & unpacking instructions

SSE Data Types & Speedup Potential 4x floats SSE 16x bytes 8x 16-bit shorts 4x 32-bit integers 2x 64-bit integers 1x 128-bit integer 2x doubles SSE-2 SSE-3 SSE-4 Potential speedup (in the targeted loop) roughly the same as the amount of packing ie. For floats – speedup ~ 4X

Goal of SSE(x) + Scalar processing  traditional mode  one instruction produces one result X Y X + Y = SIMD processing  with SSE(2,3,4)  one instruction produces multiple results+ x3x2x1x0 y3y2y1y0 x3+y3x2+y2x1+y1x0+y0 X Y X + Y = Uses full width of XMM registers Many functional units Choice of many of instructions Not all loops can be vectorized Cant vectorize most function calls

Lab 3 – IPO assisted Vectorization Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line To add switches to make envirnment use nmake all CF=“/QxSSE3” as example

Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects

Cache effects Cache effects can sometimes impact the speed of an application by as much as 10X or even 100X To take advantage of cache hierarchy in your machine, you should use and re-use data already in cache as much as possible Avoid accessing memory in non- contiguous memory locations – especially in loops You may need to consider a loop interchange to access data in a more efficient manner

Loop Interchange Very important for the vectorizer! for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; for(i=0;i<NUM;i++) for(k=0;k<NUM;k++) for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; Fast Loop Index Non unit stride skipping in memory can cause cache thrashing – particularly for arrays sizes 2^n

Unit Stride Memory Access (C/C++) bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i Next fastest loop index Consecutive memory index

Pan ready to fry eggs Refrigerator Poor Cache Uilization - with Eggs : Carton represents cache line Refrigerator represents main memory Table represents cache When table is filled up – old cartons are evicted and most eggs are wasted Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton. When table fills up old carton is evicted User requests one specific egg User requests 2 nd specific egg User requests a 3rd egg – Carton evicted

Refrigerator Previous user had used all eggs on table : Good Cache Utilization - with Eggs Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user User requests Eggs 1-8User requests Eggs 9-16 User eventually asks for all the eggs Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table User fries all eggs in carton before egg from next carton is requested

Lab 4 – Matrix Multiply Cache Effects Objective: Explore the impact of poor cache utilization on performance with Parallel Studio and explore how to manipulation loops to achieve significantly better cache utilization & performance

BACKUP