Suitability of Alternative Architectures for Scientific Computing in 5-10 Years LDRD 2002 Strategic-Computational Review July 31, 2001 PIs: Xiaoye Li,

Slides:



Advertisements
Similar presentations
Instruction Set Design
Advertisements

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
A NOVEL APPROACH TO SOLVING LARGE-SCALE LINEAR SYSTEMS Ken Habgood, Itamar Arel Department of Electrical Engineering & Computer Science GABRIEL CRAMER.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Introduction CS 524 – High-Performance Computing.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
Optimization of Sparse Matrix Kernels for Data Mining Eun-Jin Im and Katherine Yelick U.C.Berkeley.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Latency vs. Bandwidth Which Matters More? Katherine Yelick U.C. Berkeley and LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke, Parry Husbands.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
INTRODUCTION Crusoe processor is 128 bit microprocessor which is build for mobile computing devices where low power consumption is required. Crusoe processor.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point Coprocessor Yusaku Yamamoto 1 Takafumi Miyata.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
Investigating Adaptive Compilation using the MIPSpro Compiler Keith D. Cooper Todd Waterman Department of Computer Science Rice University Houston, TX.
JAVA AND MATRIX COMPUTATION
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Investigating Architectural Balance using Adaptable Probes.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.
Data Structures and Algorithms in Parallel Computing Lecture 7.
The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.
IMP: Indirect Memory Prefetcher
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.
Use of Pipelining to Achieve CPI < 1
Optimizing the Performance of Sparse Matrix-Vector Multiplication
CS 352H: Computer Systems Architecture
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
Richard Dorrance Literature Review: 1/11/13
Programming Models for SimMillennium
Sparse Matrix-Vector Multiplication (Sparsity, Bebop)
Vector Processing => Multimedia
Spare Register Aware Prefetching for Graph Algorithms on GPUs
for more information ... Performance Tuning
CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang
Lecture 4: Instruction Set Design/Pipelining
CSE 502: Computer Architecture
Introduction to Computer Systems Engineering
Presentation transcript:

Suitability of Alternative Architectures for Scientific Computing in 5-10 Years LDRD 2002 Strategic-Computational Review July 31, 2001 PIs: Xiaoye Li, Bob Lucas, Lenny Oliker, Katherine Yelick Others: Brian Gaeke, Parry Husbands, Hyun Jin Kim, Hyn Jin Moon

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Outline  Project Goals  FY01 progress report  Benchmark kernels definition  Performance on IRAM, comparisons with “conventional” machines  Management plan  Funding opportunities in future

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Motivation and Goal  NERSC-3 (now) and NERSC-4 (in 2-3 years) consist of large clusters of commodity SMPs. What about 5-10 years from now?  Future architecture technologies:  PIM (e.g. IRAM, DIVA, Blue Gene)  SIMD/Vector/Stream (e.g. IRAM, Imagine, Playstation)  Low power, narrow data types (e.g., MMX, IRAM, Imagine)  Feasibility of building large-scale systems:  What will the commodity building blocks (nodes and networks) be?  Driven by NERSC and DOE scientific applications codes.  Where do the needs diverge from big market applications?  Influence future architectures

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Computational Kernels and Applications  Kernels  Designed to stress memory systems  Some taken from the Data Intensive Systems Stressmarks  Unit and constant stride memory  Transitive-closure  FFT  Dense, sparse linear algebra (BLAS 1 and 2)  Indirect addressing  Pointer-jumping, Neighborhood (Image), sparse CG  NSA Giga-Updates Per Second (GUPS)  Frequent branching a well and irregular memory acess  Unstructured mesh adaptation  Examples of NERSC/DOE applications that may benefit:  Omega3P, accelerator design (SLAC; AMR and sparse linear algebra)  Paratec, material science package (LBNL; FFT and dense linear algebra)  Camille, 3D atmospheric circulation model (preconditioned CG)  HyperClaw, simulate gas dynamics in AMR framework (LBNL)  NWChem, quantum chemistry (PNNL; global arrays and linear algebra)

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick VIRAM Overview (UCB) 14.5 mm 20.0 mm  MIPS core (200 MHz)  Single-issue, 8 Kbyte I&D caches  Vector unit (200 MHz)  32 64b elements per register  256b datapaths, (16b, 32b, 64b ops)  4 address generation units  Main memory system  12 MB of on-chip DRAM in 8 banks  12.8 GBytes/s peak bandwidth  Typical power consumption: 2.0 W  Peak vector performance  1.6/3.2/6.4 Gops wo. multiply-add  1.6 Gflops (single-precision)  Same process technology as Blue Gene  But for single chip for multi-media

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Status of IRAM Benchmarking Infrastructure  Improved the VIRAM simulator.  Refining the performance model for double-precision FP performance.  Making the backend modular to allow for other microarchitectures.  Packaging the benchmark codes.  Build and test scripts plus input data (small and large data sets).  Added documentation.  Prepare for final chip benchmarking  Tape-out scheduled by UCB for 9/01.

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Media Benchmarks  FFT uses in-register permutations, generalized reduction  All others written in C with Cray vectorizing compiler

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Power Advantage of PIM+Vectors  100x100 matrix vector multiplication (column layout)  Results from the LAPACK manual (vendor optimized assembly)  VIRAM performance improves with larger matrices!  VIRAM power includes on-chip main memory!

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Benchmarks for Scientific Problems  Transitive-closure (small & large data set)  Pointer-jumping (small & large working set)  Computing a histogram  Used for image processing of a 16-bit greyscale image: 1536 x 1536  2 algorithms: 64-elements sorting kernel; privatization  Needed for sorting  Neighborhood image processing (small & large images)  NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)  Sparse matrix-vector product:  Order 10000, #nonzeros  2D unstructured mesh adaptation  initial grid: 4802 triangles, final grid: 24010

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Benchmark Performance on IRAM Simulator  IRAM (200 MHz, 2 W) versus Mobile Pentium III (500 MHz, 4 W)

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Conclusions and VIRAM Future Directions  VIRAM outperforms Pentium III on Scientific problems  With lower power and clock rate than the Mobile Pentium  Vectorization techniques developed for the Cray PVPs applicable.  PIM technology provides low power, low cost memory system.  Similar combination used in Sony Playstation.  Small ISA changes can have large impact  Limited in-register permutations sped up 1K FFT by 5x.  Memory system can still be a bottleneck  Indexed/variable stride costly, due to address generation.  Future work:  Ongoing investigations into impact of lanes, subbanks  Technical paper in preparation – expect completion 09/01  Run benchmark on real VIRAM chips  Examine multiprocessor VIRAM configurations

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Project Goals for FY02 and Beyond  Use established data-intensive scientific benchmarks with other emerging architectures:  IMAGINE (Stanford Univ.)  Designed for graphics and image/signal processing  Peak 20 GLOPS (32-bit FP)  Key features: vector processing, VLIW, a streaming memory system. (Not a PIM-based design.)  Preliminary discussions with Bill Dally.  DIVA (DARPA-sponsored: USC/ISI)  Based on PIM “smart memory” design, but for multiprocessors  Move computation to data  Designed for irregular data structures and dynamic databases.  Discussions with Mary Hall about benchmark comparisons

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Management Plan  Roles of different groups and PIs  Senior researchers working on particular class of benchmarks  Parry: sorting and histograms  Sherry: sparse matrices  Lenny: unstructured mesh adaptation  Brian: simulation  Jin and Hyun: specific benchmarks  Plan to hire additional postdoc for next year (focus on Imagine)  Undergrad model used for targeted benchmark efforts  Plan for using computational resources at NERSC  Few resourced used, except for comparisons

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Future Funding Prospects  FY2003 and beyond  DARPA initiated DIS program  Related projects are continuing under Polymorphic Computing  New BAA coming in “High Productivity Systems”  Interest from other DOE labs (LANL) in general problem  General model  Most architectural research projects need benchmarking  Work has higher quality if done by people who understand apps.  Expertise for hardware projects is different: system level design, circuit design, etc.  Interest from both IRAM and Imagine groups show level of interest

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Long Term Impact  Potential impact on Computer Science  Promote research of new architectures and micro- architectures  Understand future architectures  Preparation for procurements  Provide visibility of NERSC in core CS research areas  Correlate applications: DOE vs. large market problems  Influence future machines through research collaborations

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick The End

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Integer Benchmarks  Strided access important, e.g., RGB  narrow types limited by address generation  Outer loop vectorization and unrolling used  helps avoid short vectors  spilling can be a problem

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Status of benchmarking software release Build and test scripts (Makefiles, timing, analysis,...) Standard random number generator Optimized GUPS inner loop GUPS C codes Pointer Jumping Pointer Jumping w/Update Transitive Field Conjugate Gradient (Matrix) Neighborhood Optimized vector histogram code Vector histogram code generator GUPS Docs Test cases (small and large working sets) Optimized Unoptimized  Future work: Write more documentation, add better test cases as we find them Incorporate media benchmarks, AMR code, library of frequently-used compiler flags & pragmas

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Status of benchmarking work  Two performance models:  simulator (vsim-p), and trace analyzer (vsimII)  Recent work on vsim-p:  Refining the performance model for double-precision FP performance.  Recent work on vsimII:  Making the backend modular  Goal: Model different architectures w/ same ISA.  Fixing bugs in the memory model of the VIRAM-1 backend.  Better comments in code for better maintainability.  Completing a new backend for a new decoupled cluster architecture.

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Comparison with Mobile Pentium  GUPS: VIRAM gets 6 x more GUPS Data element width 16 bit32 bit 64 bit Mobile Pentium GUPS VIRAM GUPS Transitive PointerUpdate VIRAM=30-50% faster than P-III Ex. time for VIRAM rises much more slowly w/ data size than for P-III

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Sparse CG  Solve Ax = b; Sparse matrix-vector multiplication dominates.  Traditional CRS format requires:  Indexed load/store for X/Y vectors  Variable vector length, usually short  Other formats for better vectorization:  CRS with narrow band (e.g., RCM ordering)  Smaller strides for X vector  Segmented-Sum (Modified the old code developed for Cray PVP)  Long vector length, of same size  Unit stride  ELL format: make all rows the same length by padding zeros  Long vector length, of same size  Extra flops

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick SMVM Performance  DIS matrix: N = 10000, M = (~ 17 nonzeros per row)  IRAM results (MFLOPS)  Mobile PIII (500 MHz)  CRS: 35 MFLOPS SubBanks1248 CRS CRS banded 110 SEG-SUM ELL (4.6 X more flops) 511 (111) 570 (124) 612 (133) 632 (137)

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick 2D Unstructured Mesh Adaptation  Powerful tool for efficiently solving computational problems with evolving physical features (shocks, vortices, shear layers, crack propagation)  Complicated logic and data structures  Difficult to achieve high efficiently  Irregular data access patterns (pointer chasing)  Many conditionals / integer intensive  Adaptation is tool for making numerical solution cost effective  Three types of element subdivision

B. Gaeke, P. Husbands, X. Li, R. Lucas, L. Oliker, K. Yelick Vectorization Strategy and Performance Results Vectorization Strategy and Performance Results  Color elements based on vertices (not edges)  Guarantees no conflicts during vector operations  Vectorize across each subdivision (1:2, 1:3, 1:4) one color at a time  Difficult: many conditionals, low flops, irregular data access, dependencies  Initial grid: 4802 triangles, Final grid triangles  Preliminary results demonstrate VIRAM 4.5x faster than Mobile Pentium III 500  Higher code complexity (requires graph coloring + reordering) Pentium III Lane2 Lanes4 Lanes Time (ms)