Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.

1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.

CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic

The University of Adelaide, School of Computer Science

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Parallell Processing Systems1 Chapter 4 Vector Processors.

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Simulation Where real stuff starts. ToC 1.What, transience, stationarity 2.How, discrete event, recurrence 3.Accuracy of output 4.Monte Carlo 5.Random.

Data Locality CS 524 – High-Performance Computing.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Slide 1 V-IRAM Compiler Benchmarks and Applications Adam Janin, David Judd, Christoforos Kozyrakis, David Martin, Thinh Nguyen, Randi Thomas, David Patterson,

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.

Microprocessors Introduction to ia64 Architecture Jan 31st, 2002 General Principles.

Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.

Retrospective on the VIRAM-1 Design Decisions Christoforos E. Kozyrakis IRAM Retreat January 9, 2001.

Parallel System Performance CS 524 – High-Performance Computing.

1 IRAM and ISTORE David Patterson, Katherine Yelick, John Kubiatowicz U.C. Berkeley, EECS

Scientific Applications on Multi-PIM Systems WIMPS 2002 Katherine Yelick U.C. Berkeley and NERSC/LBNL Joint with with: Xiaoye Li, Lenny Oliker, Brian Gaeke,

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

1 The Google File System Reporter: You-Wei Zhang.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Supercomputers – David Bailey (1991) Eileen Kraemer August 25, 2002.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

What have mr aldred’s dirty clothes got to do with the cpu

Classifying GPR Machines TypeNumber of Operands Memory Operands Examples Register- Register 30 SPARC, MIPS, etc. Register- Memory 21 Intel 80x86, Motorola.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.

1 Fly – A Modifiable Hardware Compiler C. H. Ho 1, P.H.W. Leong 1, K.H. Tsoi 1, R. Ludewig 2, P. Zipf 2, A.G. Oritz 2 and M. Glesner 2 1 Department of.

1. Could I be getting better performance? Probably a little bit. Most of the performance is handled in HW How much better? If you compile –O3, you can.

Today’s lecture 2-Dimensional indexing Color Format Thread Synchronization within for- loops Shared Memory Tiling Review example programs Using Printf.

PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.

Computer Architecture

CS 4396 Computer Networks Lab Router Architectures.

CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

CS 740: Advanced Computer Networks IP Lookup and classification Supplemental material 02/05/2007.

Copyright © 2004, Dillon Engineering Inc. All Rights Reserved. An Efficient Architecture for Ultra Long FFTs in FPGAs and ASICs  Architecture optimized.

Oct 26, 2005 FEC: 1 Custom vs Commodity Processors Bill Dally October 26, 2005.

Compilers and Applications Kathy Yelick Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang, Adam Janin, Thinh Nguyen Computer.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 11: May10, 2001 Data Parallel (SIMD, SPMD, Vector)

Architectural Effects on DSP Algorithms and Optimizations Sajal Dogra Ritesh Rathore.

UT-Austin CART 1 Mechanisms for Streaming Architectures Stephen W. Keckler Computer Architecture and Technology Laboratory Department of Computer Sciences.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Vector computers.

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

IRAM and ISTORE Projects

Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break

5.2 Eleven Advanced Optimizations of Cache Performance

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 13 SIMD Multimedia Extensions Prof. Zhang Gang School.

Morgan Kaufmann Publishers

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

Vector Processing => Multimedia

COMP4211 : Advance Computer Architecture

The University of Adelaide, School of Computer Science

Parallel and Multiprocessor Architectures

CMSC 611: Advanced Computer Architecture

STUDY AND IMPLEMENTATION

Samuel Larsen Saman Amarasinghe Laboratory for Computer Science

Memory System Performance Chapter 3

Main Memory Background

CSE 502: Computer Architecture

Networking What are the basic concepts of networking? Three classes

Shared-Memory Paradigm & OpenMP

Presentation transcript:

Slide 1 Exploiting 0n-Chip Bandwidth The vector ISA + compiler technology uses high bandwidth to mask latency Compiled matrix-vector multiplication: 2 Flops/element –Easy compilation problem; stresses memory bandwidth –Compare to 304 Mflops (64-bit) for Power3 (hand-coded) –Performance normally scales with number of lanes –Need more memory banks than default DRAM macro

Slide 2 Compiling Media Kernels on IRAM The compiler generates code for narrow data widths, e.g., 16-bit integer Compilation model is simple, more scalable (across generations) than MMX, VIS, etc. –Strided and indexed loads/stores simpler than pack/unpack –Maximum vector length is longer than datapath width (256 bits); all lane scalings done with single executable

Slide 3 Protein Folding on IRAM? Vectorization of basic algorithms well-known, e.g., –Spectral methods (large FFTs); probably hand-code inner FFT –Naïve O(n 2 ) algorithm for forces vectorizes over atoms »Hierarchical methods (fast multipole) also vectorize over the inner loop (e.g., mvm) or by packing a set of interaction eval’s –Monte Carlo methods vectorize Difficulty comes from handling irregularities in the hardware –Unpredictable network delays, processor failures,… –Leads to an event-driven model: compute on the next pair of atoms when the 2 nd one arrives IRAM benefits from larger units of work –E.g., compute a set if interactions when then next chunk of k atoms arrives; vectorization/parallelism within a chunk –Larger messages also can amortize message overhead