Download presentation
Presentation is loading. Please wait.
Published byAubrie Underwood Modified over 9 years ago
1
Overview of Intel® Core 2 Architecture and Software Development Tools June 2009
2
Overview of Architecture & Tools We will discuss: What lecture materials are available What labs are available What target courses could be impacted Some high level discussion of underlying technology
3
Objectives After completing this module, you will: Be aware of and have access to several hours worth of MC topics including Architecture, Compiler Technology, Profiling Technology, OpenMP, & Cache Effects Be able create exercises on how to avoid coding common threading hazards associated with some MC systems – such as Poor Cache Utilization, False Sharing and Threading load imbalance Be able create exercises on how to use selected compiler directives & switches to improve behavior on each core Be able create exercises on how to take advantage VTune analyzer to quickly identify load imbalance issues, poor cache reuse and false sharing issues
4
Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects
5
Why is the Industry moving to Multi-core? In order to increase performance and reduce power consumption Its is much more efficient to run several cores at a lower frequency than one single core at a much faster frequency
6
Power and Frequency Power vs. Frequency Curve for Single Core Architecture 9 59 109 159 209 259 309 359 00.20.40.60.811.21.41.61.822.22.42.62.833.23.4 Frequency (GHz) Power (w) Dropping Frequency = Large Drop Power Lower Frequency Allows Headroom for 2nd Core
7
Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects
8
Processor-independent optimizations /OdDisables optimizations /O1Optimizes for Binary Size and for Speed: Server Code /O2Optimizes for Speed (default): Vectorization on Intel 64 /O3Optimizes for Data Cache: Loopy Floating Point Code /ZiCreates symbols for debugging /Ob0Turns off inlining which can sometimes help the Analysis tools do a more through job
9
AutoVectorization optimizations QaxSSE2Intel Pentium 4 and compatible Intel processors. QaxSSE3 Intel(R) Core(TM) processor family with Streaming SIMD Extensions 3 (SSE3) instruction support QaxSSE3_ATOMCan generate MOVBE instructions for Intel processors and can optimize for the Intel(R) Atom(TM) Processor and Intel(R) Centrino(R) Atom(TM) Processor Technology Extensions 3 (SSE3) instruction support QaxSSSE3 Intel(R) Core(TM)2 processor family with SSSE3 QaxSSE4.1Intel(R) 45nm Hi-k next generation Intel Core(TM) microarchitecture with support for SSE4 Vectorizing Compiler and Media Accelerator instructions QaxSSE4.2Can generate Intel(R) SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel(R) Core(TM) i7 processors. Can generate Intel(R) SSE4 Vectorizing Compiler and Media Accelerator, Intel(R) SSSE3, SSE3, SSE2, and SSE instructions and it can optimize for the Intel(R) Core(TM) processor family. Intel has a long history of providing auto-vectorization switches along with support for new processor instructions and backward support for older instructions is maintained Developers should keep an eye on new developments in order to leverage the power of the latest processors
10
More Advanced optimizations QipoInterprocedural optimization performs a static, topological analysis of your application. With /Qipo (-ipo), the analysis spans all of your source files built with /Qipo (-ipo). In other words, code generation in module A can be improved by what is happening in module B. May enable other optimizations like autoparallel and autovector Qparallelenable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel Qopenmpenable the compiler to generate multi-threaded code based on the OpenMP* directives
11
Lab 1 - AutoParallelization Objective: Use auto-parallelization on a simple code to gain experience with using the compiler’s auto-parallelization feature Follow the VectorSum activity in the student lab doc Try AutoParallel compilation on Lab called VectorSum Extra credit: parallelize manually and see how you can beat the auto-parallel option – see openmp section for constructs to try this
12
Parallel Studio to find where to parallelize Parallel Studio will be used in several labs to find appropriate locations to add parallelism to the code. Parallel Amplifier specifically is used to find hotspot information – where in your code does the application spend most of its time Parallel amplifier does not require instrumenting your code in order to find hotspots, compiling with symbol information is a good idea - /Zi Compiling with /Ob0 turns off inlining and sometimes seems to give a more through analysis in Parallel Studio
13
Parallel Amplifier Hotspots
14
What does hotspot analysis show?
15
What about drilling down?
16
The call stack The call stack shows the callee/caller relationship among function in he code
17
Found potential parallelism
18
Lab 2 – Mandelbrot Hotspot Analysis Objective: Use sampling to find some parallelism in the Mandelbrot application Follow the Mandelbrot activity called Mandelbrot Sampling in the student lab doc Identify candidate loops that could be parallelized
19
Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core High level overview – Intel® Core Architecture Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects
20
Mobile Platform Optimized 1-4 Execution Cores 3/6MB L2 Cache Sizes 64 Byte L2 cache line 64-bit 6M 6M L2 4M 4M L2 Desktop Platform Optimized 2-4 Execution Cores 2X3, 2X6 MB L2 Cache Sizes 64 Byte L2 Cache line 64-bit Server Platform Optimized 4 Execution Cores 2x6 L2 Caches 64 Byte L2 Cache line DP/MP support 64-bit 2 cores 4 cores **Feature Names TBD 6M 2X6M L2 2X3M L2 2 cores 4 cores 12M 4 cores 2X6M L2 12M Intel® Core 2 Architecture Snapshot in time during Penryn, Yorkfield, harpertown Software develoers should know number of cores, cache line size and cache sizes to tackle Cache Effects materials
21
Memory Hierarchy Magnetic Disk Main Memory L2 Cache L1 Cache CPU ~ 1’s Cycle ~ 1’s - 10 Cycle ~ 100’s Cycle ~ 1000’s Cycle
22
High Level Architectural view AAAA EEEE C1C2 BB AA EE C B Intel Core 2 Duo Processor Intel Core 2 Quad Processor A = Architectural State E = Execution Engine & Interrupt C = 2nd Level Cache B = Bus Interface Memory 64B Cache Line Dual Core has shared cache Quad core has both shared And separated cache Intel® Core™ Microarchitecture – Memory Sub-system
23
With a separated cache CPU1 CPU2 Memory Front Side Bus (FSB) Cache Line Shipping L2 Cache Line ~Half access to memory Intel® Core™ Microarchitecture – Memory Sub-system
24
CPU2 Advantages of Shared Cache – using Advanced Smart Cache® Technology CPU1 Memory Front Side Bus (FSB) Cache Line L2 is shared: No need to ship cache line Intel® Core™ Microarchitecture – Memory Sub-system
25
False Sharing Performance issue in programs where cores may write to different memory addresses BUT in the same cache lines Known as Ping-Ponging – Cache line is shipped between cores Core 0 Core 1 Time 1 0 X[0] = 1 X[1] =1 1 X[0] = 0 X[1] = 0 1 0 X[0] = 2 1 1 2 False Sharing not an issue in shared cache It is an issue in separated cache
26
Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects
27
Super Scalar Execution FP SIMD INT Multiple Execution units Allow SIMD parallelism Many instructions can be retired in a clock cycle Multiple operations executed within a single core at the same time
28
IntelSSE IntelSSE4.1 IntelSSE2 1999 2000 IntelSSE3 2004 IntelSSSE3 2006 2007 70 instr Single- Precision Vectors Streaming operations 144 instr Double- precision Vectors 8/16/32 64/128-bit vector integer 13 instr Complex Data 32 instr Decode 47 instructions Video Accelerators Graphics building blocks Advanced vector instr Will be continued by Intel SSE4.2 (XML processing end 2008) See - http://download.intel.com/technology/architecture/new- instructions-paper.pdf History of SSE Instructions Long history of new instructions Most require using packing & unpacking instructions
29
SSE Data Types & Speedup Potential 4x floats SSE 16x bytes 8x 16-bit shorts 4x 32-bit integers 2x 64-bit integers 1x 128-bit integer 2x doubles SSE-2 SSE-3 SSE-4 Potential speedup (in the targeted loop) roughly the same as the amount of packing ie. For floats – speedup ~ 4X
30
Goal of SSE(x) + Scalar processing traditional mode one instruction produces one result X Y X + Y = SIMD processing with SSE(2,3,4) one instruction produces multiple results+ x3x2x1x0 y3y2y1y0 x3+y3x2+y2x1+y1x0+y0 X Y X + Y = Uses full width of XMM registers Many functional units Choice of many of instructions Not all loops can be vectorized Cant vectorize most function calls
31
Lab 3 – IPO assisted Vectorization Objective: Explore how inlining a function can dramatically improve performance by allowing vectorization of loop with function call Open SquareChargeCVectorizationIPO folder and use “nmake all” to build the project from the command line To add switches to make envirnment use nmake all CF=“/QxSSE3” as example
32
Agenda Multi-core Motivation Tools Overview Taking advantage of Multi-core Taking advantage of parallelism within each core (SSEx) Avoiding Memory/Cache effects
33
Cache effects Cache effects can sometimes impact the speed of an application by as much as 10X or even 100X To take advantage of cache hierarchy in your machine, you should use and re-use data already in cache as much as possible Avoid accessing memory in non- contiguous memory locations – especially in loops You may need to consider a loop interchange to access data in a more efficient manner
34
Loop Interchange Very important for the vectorizer! for(i=0;i<NUM;i++) for(j=0;j<NUM;j++) for(k=0;k<NUM;k++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; for(i=0;i<NUM;i++) for(k=0;k<NUM;k++) for(j=0;j<NUM;j++) c[i][j] =c[i][j] + a[i][k] * b[k][j]; Fast Loop Index Non unit stride skipping in memory can cause cache thrashing – particularly for arrays sizes 2^n
35
Unit Stride Memory Access (C/C++) bN-10bN-1N-1 bk0bk1bk2bk3bkN-1 b10b11b12b13b1N-1 b00b01b02b03b0N-1 j j b k Fastest incremented index Consecutive memory access aN-10aN-1N-1 ai0ai1ai2ai3aiN-1 a10a11a12a13a1N-1 a00a01a02a03a0N-1 k a k i Next fastest loop index Consecutive memory index
36
Pan ready to fry eggs Refrigerator Poor Cache Uilization - with Eggs : Carton represents cache line Refrigerator represents main memory Table represents cache When table is filled up – old cartons are evicted and most eggs are wasted Request for an egg not already on table, brings a new carton of eggs from the refrigerator, but user only fries one egg from each carton. When table fills up old carton is evicted User requests one specific egg User requests 2 nd specific egg User requests a 3rd egg – Carton evicted
37
Refrigerator Previous user had used all eggs on table : Good Cache Utilization - with Eggs Carton eviction doesn’t hurt us because we’ve already fried all the eggs in the cartons on the table – just like previous user User requests Eggs 1-8User requests Eggs 9-16 User eventually asks for all the eggs Request for one egg brings new carton of eggs from refrigerator User specifically requests eggs form carton already on table User fries all eggs in carton before egg from next carton is requested
38
Lab 4 – Matrix Multiply Cache Effects Objective: Explore the impact of poor cache utilization on performance with Parallel Studio and explore how to manipulation loops to achieve significantly better cache utilization & performance
40
BACKUP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.