Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.

Slides:



Advertisements
Similar presentations
Clare Smtih SHARC Presentation1 The SHARC Super Harvard Architecture Computer.
Advertisements

The University of Adelaide, School of Computer Science
High Performance Computing The GotoBLAS Library. HPC: numerical libraries  Many numerically intensive applications make use of specialty libraries to.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.
1cs542g-term Notes  Assignment 1 is out (questions?)
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 1 Fundamentals of Quantitative Design and Analysis Computer Architecture A Quantitative.
RISC vs CISC CS 3339 Lecture 3.2 Apan Qasem Texas State University Spring 2015 Some slides adopted from Milo Martin at UPenn.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.
NPACI: National Partnership for Advanced Computational Infrastructure Supercomputing ‘98 Mannheim CRAY T90 vs. Tera MTA: The Old Champ Faces a New Challenger.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Squish-DSP Application of a Project Management Tool to manage low-level DSP processor resources M. Smith, University of Calgary, Canada ucalgary.ca.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
PARALLEL PROCESSING The NAS Parallel Benchmarks Daniel Gross Chen Haiout.
Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.
Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes.
1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Suitability of Alternative Architectures for Scientific Computing in 5-10 Years LDRD 2002 Strategic-Computational Review July 31, 2001 PIs: Xiaoye Li,
Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,
Welcome Three related projects at Berkeley –Intelligent RAM (IRAM) –Intelligent Storage (ISTORE) –OceanStore Groundrules –Questions are welcome during.
Digital Signal Processors for Real-Time Embedded Systems By Jeremy Kohel.
Computer System Architectures Computer System Software
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
SPL: A Language and Compiler for DSP Algorithms Jianxin Xiong 1, Jeremy Johnson 2 Robert Johnson 3, David Padua 1 1 Computer Science, University of Illinois.
Antoine Monsifrot François Bodin CAPS Team Computer Aided Hand Tuning June 2001.
1 History of compiler development 1953 IBM develops the 701 EDPM (Electronic Data Processing Machine), the first general purpose computer, built as a “defense.
Telecommunications and Signal Processing Seminar Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * The University of Texas at.
November 13, 2006 Performance Engineering Research Institute 1 Scientific Discovery through Advanced Computation Performance Engineering.
1 “How Can We Address the Needs and Solve the Problems in HPC Benchmarking?” Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Embedded Systems Design: A Unified Hardware/Software Introduction 1 Chapter 3 General-Purpose Processors: Software.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
Computer Architecture
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
CISC Machine Learning for Solving Systems Problems John Cavazos Dept of Computer & Information Sciences University of Delaware
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
1 SciDAC High-End Computer System Performance: Science and Engineering Jack Dongarra Innovative Computing Laboratory University of Tennesseehttp://
Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
Performance Tuning John Black CS 425 UNR, Fall 2000.
Compiling for VIRAM Kathy Yelick Dave Judd, & Ronny Krashinsky Computer Science Division UC Berkeley.
Compilers as Collaborators and Competitors of High-Level Specification Systems David Padua University of Illinois at Urbana-Champaign.
DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO Session 2 Computer Organization.
1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.
GPU VSIPL: Core and Beyond Andrew Kerr 1, Dan Campbell 2, and Mark Richards 1 1 Georgia Institute of Technology 2 Georgia Tech Research Institute.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
KERRY BARNES WILLIAM LUNDGREN JAMES STEED
CS203 – Advanced Computer Architecture Performance Evaluation.
Vector computers.
Fermi National Accelerator Laboratory & Thomas Jefferson National Accelerator Facility SciDAC LQCD Software The Department of Energy (DOE) Office of Science.
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Want to Write a Compiler?
Scaling for the Future Katherine Yelick U.C. Berkeley, EECS
Vector Processing => Multimedia
Welcome Three related projects at Berkeley Groundrules Introductions
Samuel Larsen Saman Amarasinghe Laboratory for Computer Science
Dave Judd Kathy Yelick Computer Science Division UC Berkeley
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 16 – RISC-V Vectors Krste Asanovic Electrical Engineering and.
Presentation transcript:

Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer Science Division UC Berkeley

Compiling for VIRAM Long-term success of DIS technology depends on simple programming model, i.e., a compiler Needs to handle significant class of applications –IRAM: multimedia, graphics, speech and image processing –ISTORE: databases, signal processing, other DIS benchmarks Needs to utilize hardware features for performance –IRAM: vectorization –ISTORE: scalability of shared-nothing programming model

IRAM Compilers IRAM/Cray vectorizing compiler [Judd] –Production compiler Used on the T90, C90, as well as the T3D and T3E Being ported (by SGI/Cray) to the SV2 architecture –Has C, C++, and Fortran front-ends (focus on C) –Extensive vectorization capability outer loop vectorization, scatter/gather, short loops, … –VIRAM port is under way IRAM/VSUIF vectorizing compiler [Krashinsky] –Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford –This is a “research” compiler, not intended for compiling large complex applications –It has been working since 5/99.

IRAM/Cray Compiler Status MIPS backend developed in this year –Validated using a commercial test suite for code generation –Generated code run through vas assembler Vector backend recently started –Testing with vsim under way this week Leveraging from Cray –Automatic vectorization –Basic instruction scheduling framework Vectorizer C Fortran C++ Frontends Code Generators PDGCS IRAM C90

ISTORE Compiler Titanium language is an extension of Java –tc is the Titanium compiler Recent progress: –improved portability of generated code and the compiler itself, including port to Cray parallel machines –additions to generate annotations on C code to improve fine- grained parallelism (on Tera MTA) and vectorization New benchmarking efforts –database primitives: sorting, hash-join and index-nested-loop join –3d FFT and linear solvers (LU) Optimizer Java Titanium C + comm ISTORE t3e tc cc Code Gen C compiler

Applications Hand-written kernels for single-chip VIRAM –focus on multimedia kernels, see IRAM hardware talk Compiled programs for single-chip VIRAM –2 examples from IRAM/VSUIF: decryption and mvm –most effort devoted to IRAM/Cray compiler Performance benchmarks for ISTORE –3d FFT –Others SAM benchmarks for ISTORE

Automatic Vectorization Vectorizing compilers very successful on scientific applications –not entirely automatic, especially for C/C++ –good tools for training users Multimedia applications have –shorter vector lengths –can sometime exploit outer loop vectorization for longer vectors –often leads to non-unit strides –tree traversals could be written as scatter/gather (breadth-first), although automating this is far from solved e.g., image compression

IRAM/VSUIF Decryption (IDEA) IDEA Decryption operates on 16-bit ints Compiled with IRAM/VSUIF (with unrolling by hand) Note scalability of both #lanes and data width # lanes

VIRAM/VSUIF Matrix/Vector Multiply VIRAM/VSUIF does reasonably well on long loops mvmvmm 256x256 single matrix Compare to 1600 Mflop/s (peak without multadd) Note BLAS-2 (little reuse) ~350 on Power3 and EV6 Problems specific to VSUIF –hand strip-mining results in short loops –reductions –no multadd support

3D FFT on ISTORE Performance of large 3D FFT’s depend on 2 factors –speed of 1D FFT on a single node (next slide) –network bandwidth for “transposing” data –1.3 Tflop FFT possible w/ 1K IRAM nodes and.5 TB/s bw

1D FFT on IRAM TigerSHARC DSP 41us (Analog Devices) ( 32bit) IRAM 37us (32bit) TMS320C6000 DSP 124us (Texas Instruments) (32 bits) DSP56002 DSP 908 us (Motorola) (24 bits) FFT study on IRAM [Randi Thomas] –hand-coded and scheduled –use of ISA features to make in-register FFTs fast (128 point) –bit-reversal time not included; will also use ISA support

Other ISTORE Applications Working on several performance applications for ISTORE –Database primitives: sorts, joins, scans, etc. [Kar Ming Tang] –RT_STAP QR Decomposition vectorizes easily, partially complete in IRAM/VSUIF –Conjugate Gradient [Samson Kwok] Dominated by sparse matrix-vector multiply Current performance: 500/250 Mflops (single/double) on VIRAM Compare to 10s of Mflops on most RISC machines –Dense linear algebra [Simon Yau] –Considering other DIS benchmarks, such as MoM

Conclusions Significant compiler progress: –Cray collaboration key [Dave Judd Eagan ] –Good tech transfer model –Vector code gen and instruction scheduling next steps Even VSUIF version indicates reasonable performance –Commercial-quality compiler will allow non-toy applications, e.g., Speech Benchmarks –Have been used to help with final ISA design –Simulated results validate performance claims –Models show real advantage to Intelligence in Memory (and Disk) –Machines scale and with simpler programming and optimization model than conventional multiprocessors