Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Slides:

Advertisements

Similar presentations

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

Advertisements

FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.

A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Performance Evaluation of Two Emerging Media Processors: VIRAM and Imagine Leonid Oliker Future Technologies Group Computational Research Division LBNL.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 2 (and Appendix B) Memory Hierarchy Design Computer Architecture A Quantitative Approach,

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Introduction CS 524 – High-Performance Computing.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

1 Chapter Seven Large and Fast: Exploiting Memory Hierarchy.

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Evaluating the Tera MTA Allan Snavely, Wayne Pfeiffer et.

Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS

Suitability of Alternative Architectures for Scientific Computing in 5-10 Years LDRD 2002 Strategic-Computational Review July 31, 2001 PIs: Xiaoye Li,

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,

Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Computer Performance Computer Engineering Department.

Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

October 12, 2004Thomas Sterling - Caltech & JPL 1 Roadmap and Change How Much and How Fast Thomas Sterling California Institute of Technology and NASA.

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Evaluation of Modern Parallel Vector Architectures Leonid Oliker Future Technologies Group Computational Research Division LBNL

Lixia Liu, Zhiyuan Li Purdue University, USA PPOPP 2010, January 2009.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Investigating Architectural Balance using Adaptable Probes.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Trading Cache Hit Rate for Memory Performance Wei Ding, Mahmut Kandemir, Diana Guttman, Adwait Jog, Chita R. Das, Praveen Yedlapalli The Pennsylvania State.

1 Chapter Seven. 2 Users want large and fast memories! SRAM access times are ns at cost of $100 to $250 per Mbyte. DRAM access times are ns.

Data Structures and Algorithms in Parallel Computing Lecture 7.

A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)

IMP: Indirect Memory Prefetcher

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

1  2004 Morgan Kaufmann Publishers Locality A principle that makes having a memory hierarchy a good idea If an item is referenced, temporal locality:

Tools for High Performance Scientific Computing Kathy Yelick U.C. Berkeley.

CS203 – Advanced Computer Architecture Performance Evaluation.

Analyzing Memory Access Intensity in Parallel Programs on Multicore Lixia Liu, Zhiyuan Li, Ahmed Sameh Department of Computer Science, Purdue University,

Optimizing the Performance of Sparse Matrix-Vector Multiplication

CS203 – Advanced Computer Architecture

Lynn Choi School of Electrical Engineering

How do we evaluate computer architectures?

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

Programming Models for SimMillennium

Morgan Kaufmann Publishers

Scaling for the Future Katherine Yelick U.C. Berkeley, EECS

Vector Processing => Multimedia

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Chapter 4 Multiprocessors

Mattan Erez The University of Texas at Austin

A microprocessor into a memory chip Dave Patterson, Berkeley, 1997

Presentation transcript:

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)

P. Husbands, IPDPS 2002 Motivation  Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones)  E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~ 20% of peak on 1.5GHz P4  Even worse when parallel efficiency considered  Overall ~10% across application benchmarks  Is memory bandwidth the problem?  Performance directly related to how well memory system performs  But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)

P. Husbands, IPDPS 2002 Solutions?  Better Software  ATLAS, FFTW, Sparsity, PHiPAC  Power and packaging are important too!  New buildings and infrastructure needed for many recent/planned installations  Alternative Architectures  One idea: Tighter integration of processor and memory  BlueGene/L (~ 25 cycles to main memory)  VIRAM –Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM

P. Husbands, IPDPS 2002 VIRAM Overview 14.5 mm 20.0 mm  MIPS core (200 MHz)  Main memory system  13 MB of on-chip DRAM  Large on-chip bandwidth 6.4 GBytes/s peak to vector unit  Vector unit  Energy efficient way to express fine- grained parallelism and exploit bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops  1.6 Gflops (single-precision)  Fabrication by IBM  Tape-out in O(1 month)  Our results use simulator with Cray’s vcc compiler

P. Husbands, IPDPS 2002 Our Task  Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines  For now focus on serial performance  Benchmark VIRAM on Scientific Computing kernels  Originally for multimedia applications  Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser)  Isolate performance limiting features of architectures  More than just memory bandwidth

P. Husbands, IPDPS 2002 Benchmarks Considered  Transitive-closure (small & large data set)  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Fetch-and-increment a stream of “random” addresses  Sparse matrix-vector product:  Order 10000, #nonzeros  Computing a histogram  Different algorithms investigated: 64-elements sorting kernel; privatization; retry  2D unstructured mesh adaptation TransitiveGUPSSPMVHistogramMesh Ops/step 2121N/A Mem/step 2 ld 1 st2 ld 2 st3 ld2 ld 1 stN/A

P. Husbands, IPDPS 2002 The Results Comparable performance with lower clock rate

P. Husbands, IPDPS 2002 Power Efficiency  Large power/performance advantage for VIRAM from  PIM technology  Data parallel execution model

P. Husbands, IPDPS 2002 Ops/Cycle

GUPS  1 op, 2 loads, 1 store per step  Mix of indexed and unit stride operations  Address generation key here (only 4 per cycle on VIRAM)

P. Husbands, IPDPS 2002 Histogram  1 op, 2 loads, 1 store per step  Like GUPS, but duplicates restrict available parallelism and make it more difficult to vectorize  Sort method performs best on VIRAM on real data  Competitive when histogram doesn’t fit in cache

P. Husbands, IPDPS 2002 Which Problems are Limited by Bandwidth?  What is the bottleneck in each case?  Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak)  SPMV and Mesh limited by address generation, bank conflicts, and parallelism  For Histogram lack of parallelism, not memory bandwidth

P. Husbands, IPDPS 2002 Summary and Future Directions  Performance advantage  Large on applications limited only by bandwidth  More address generators/sub-banks would help irregular performance  Performance/Power advantage  Over both low power and high performance processors  Both PIM and data parallelism are key  Performance advantage for VIRAM depends on application  Need fine-grained parallelism to utilize on-chip bandwidth  Future steps  Validate our work on real chip!  Extend to multi-PIM systems  Explore system balance issues –Other memory organizations (banks, bandwidth vs. size of memory) –# of vector units –Network performance vs. on-chip memory

P. Husbands, IPDPS 2002 The Competition SPARC IIi MIPS R10K P IIIP 4Alpha EV6 Make Sun Ultra 10 Origin 2000 Intel Mobile Dell Compaq DS10 Clock 333MHz180MHz600MHz1.5GHz466MHz L KB32+32KB32KB12+8KB64+64KB L2 2MB1MB256KB 2MB Mem 256MB1GB128MB1GB512MB

P. Husbands, IPDPS 2002 Transitive Closure (Floyd-Warshall)  2 ops, 2 loads, 1 store per step  Good for vector processors:  Abundant, regular parallelism and unit stride

P. Husbands, IPDPS 2002 SPMV  2 ops, 3 loads per step  Mix of indexed and unit stride operations  Good performance for ELLPACK, but only when we have same number of nonzeros per row

P. Husbands, IPDPS 2002 Mesh Adaptation  Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges  Extensive reorganization required to take advantage of vectorization  Many indexed memory operations (limited again by address generation)