Investigating Architectural Balance using Adaptable Probes.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

DSPs Vs General Purpose Microprocessors

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

Parallel computer architecture classification

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

The Imagine Stream Processor Flexibility with Performance March 30, 2001 William J. Dally Computer Systems Laboratory Stanford University

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

The Pentium 4 CPSC 321 Andreas Klappenecker. Today’s Menu Advanced Pipelining Brief overview of the Pentium 4.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Computer performance.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Basics and Architectures

HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Pipelining and Parallelism Mark Staveley

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

1 CSCI 2510 Computer Organization Memory System II Cache In Action.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

EKT303/4 Superscalar vs Super-pipelined.

Lx: A Technology Platform for Customizable VLIW Embedded Processing.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Sunpyo Hong, Hyesoon Kim

The Imagine Stream Processor Ujval J. Kapasi, William J. Dally, Scott Rixner, John D. Owens, and Brucek Khailany Presenter: Lu Hao.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Vector computers.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Use of Pipelining to Achieve CPI < 1

CS 352H: Computer Systems Architecture

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Lynn Choi School of Electrical Engineering

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

CC 423: Advanced Computer Architecture Limits to ILP

Vector Processing => Multimedia

Memory Hierarchies.

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

CSE 502: Computer Architecture

Overview Problem Solution CPU vs Memory performance imbalance

Presentation transcript:

Investigating Architectural Balance using Adaptable Probes

Overview  Gap between peak and sustained performance well known problem in HPC  Generally attributed to memory system, but difficult to identify bottleneck  Application benchmarks too complex to isolate specific architectural features  Microbenchmarks too narrow to predict actual code performance  We use an adaptable probe to isolate performance limitations: Give application and hardware developers possible optimizations  Sqmat uses 4 paramters to captures behavior broad range of scientific code: Working set size(N), Computational Intensity(M), Indirection(S), Irregularity(S)  Architectures examined: Intel Itanium2, AMD Opteron, IBM Power3, IBM Power4

Sqmat overview  Sqmat based on matrix multiplication and linear solvers  Java program used to generate optimally unrolled C code  Square a set of matrices M times in (use enough matrices to exceed cache) M controls computational intensity (CI) - the ratio between flops and mem access  Each matrix is size NxN N controls working set size: 2N 2 registers required per matrix  Direct Storage: Sqmat’s matrix entries stored continuously in memory  Indirect: Entries accessed indirectly through pointer Parameter S controls degree of indirection, S matrix entries stored contiguously, then random jump in memory

Unit Stride Algorithmic Peak  Curve increases until memory system fully utilized, plateaus when FPU units saturate  Itanium2 requires longer time to achieve plateau due to register spill penalty  Opteron’s SIMD nature of SSE2 inhibits high algorithmic peak  Power3 effective hiding latency of cache-access  Power4’s deep pipeline inhibits ability to find sufficient ILP to saturate FPUs

Slowdown due to Indirection  Operton, Power3/4 less 10% penalty once M>8 - demonstrating bandwidth between cache and processor effectively delivers addresses and values  Itanium2 showing high penalty for indirection - issue is currently under invesigation Unit stride access via indirection (S=1)

Cost of Irregularity (1)  Itanium and Opteron perform well for irregular accesses due to: Itanium2’s L2 caching of FP values (reduces cost of cache miss) Opteron’s low memory latency due to on-chip memory controller

Cost of Irregularity (2) –1–1 –6–6 –11 –16 –21 –1–1–2–2–4–4–8–8–16–32–64–128–256 –512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses –1–1 –3–3 –5–5 –7–7 –9–9 –11 –13 –15 –1–1–2–2 –4–4 –8–8–16–32–64–128–256–512 –M–M –slowdown for irregular access –100% (S=1) –50% (S=2) –25% (S=4) –12.5% (S=8) –6.25% (S=16) –3.13% (S=32) –1.56% (S=64) –0.78% (S=128) –0.39% (S=256) –random accesses – Irregularity on Power3, N=4– Irregularity on Power4, N=4  Power3 and Power4 perform well for irregular accesses due to: Power3’s high penalty cache miss (35 cycles) and limited prefetch abilities Power4’s requires 4 cache-line hit to activate prefetching

Tolerating Irregularity  S50 Start with some M at S=  (indirect unit stride) For a given M, how large must S be to achieve at least 50% of the original performance?  M50 Start with M=1, S=  At S=1 (every access random), how large must M be to achieve 50% of the original performance

Tolerating Irregularity  Probe stresses the balance points of processor design (PMEO-04) Gather/Scatter expensive on commodity cache-based systems Power4 can is only 1.6% (1 in 64) Itanium2: much less sensitive at 25% (1 in 4) Huge amount of computation may be required to hide overhead of irregular data access Itanium2 requires CI of about 9 flops/word Power4 requires CI of almost 75! Interested in developing application driven architectural probes for evaluation of emerging petascale systems S50: What % of memory access can be random before performance decreases by half? M50: How much computational intensity is required to hide penalty of all random access?

Emerging Architectures  General purpose procs badly suited for data intensive ops Large caches not useful Low memory bandwidth Superscalar methods of increasing ILP inefficient Power consumption  Application-specific ASICs Good, but expensive/slow to design.  Solution: general purpose “memory aware” processors Large number of ALUs: to exploit data-parallelism Huge memory bandwidth: to keep ALUs busy Concurrency: overlap memory w/ computation

VIRAM Overview  MIPS core (200 MHz)  Main memory system  8 banks w/13 MB of on-chip DRAM  Large 6.4 GBytes/s on-chip peak bandwidth  Cach-less Vector unit  Energy efficient way to express fine-grained parallelism and exploit bandwidth  Single issue, in order  Low power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabricated by IBM: Taped-out 02/2003  To hide DRAM access load/store, arithmetic instructions deeply pipelined (15 stages)  We use simulator with Cray’s vcc compiler

VIRAM Power Efficiency  Comparable performance with lower clock rate  Large power/performance advantage for VIRAM from PIM technology, data parallel execution model

Stream Processing  Stream: ordered set of records (homogenous, arbitrary data type)  Stream programming: data is streams, computation is kernel  Kernel loop through all stream elements (sequential order)  Perform compound (multiword) operation on each stream element  Sngle arithmetic operation performed on each vector element (then store in register) Example: stereo depth extraction  Data and Functional Parallelism  High Computational rate  Little Data Reuse  Producer-Consumer and Spatial locality  Ex: Multimedia, signal processing, graphics

Imagine Overview  “Vector VLIW” processor  Coprocessor to off-chip host processor  8 arithmetic clusters control in SIMD w/ VLIW instructions  Central 128KB Stream Register 32GB/s  SRF can overlap computation w/ memory (double buffering)  SRF cab reuse intermediate results (proc-consum locality)  Stream-aware memory system with 2.7 GB/s off-chip  544 GB/s intercluster comm  Host sends instuctions to stream controller, SC issues commands to on-chip modules

VIRAM and Imagine  Imagine order of magnitude higher performance  VIRAM twice memory bandwidth, less power consumption  Notice peak Flop/Word ratios VIRAM IMAGINE Memory IMAGINE SRF Bandwidth GB/s Peak Float 32bit1.6 GF/s20 GF/s20 Peak Float/Word Speed MHz Chip Area15x18mm12x12mm Data widths64/32/1632/16/8 Transistors130 x x 10 6 Power Consmp2 Watts10 Watts

SQMAT: Performance Crossover  Large number of ops/word N 10 where N=3x3  Crossover point L=64 (cycles), L = 256 (MFlop)  Imagine power becomes apparent almost 4x VIRAM at L=1024 Codes at this end of spectrum greatly benefit from Imagine arch

Stencil Probe  Stencil computations core of wide range of scientific applications Applications include Jacobi solvers, complex multigrid, block structured AMR  We developing adaptable stencil probe to model range of computations  Findings isolate importance of streaming memory accesses which engage automatic prefetch engines, thus greatly increasing memory throughput  Previous L1 tiling techniques mostly ineffective for stencil computations on modern microprocessors Small blocks inhibit automatic prefetching performance Modern large on-chip L2/L3 caches have similar bandwidth to L1  Currently investigating tradeoffs between blocking and prefetching (paper in preparation)  Interested in exploring potential benefits of enhancing commodity processors with explicitly programmable prefetching