CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

Slides:



Advertisements
Similar presentations
Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture VLIW Steve Ko Computer Sciences and Engineering University at Buffalo.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Computer Architecture A.
CS252 Graduate Computer Architecture Spring 2014 Lecture 9: VLIW Architectures Krste Asanovic
The University of Adelaide, School of Computer Science
University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,
Comp Sci Floating Point Arithmetic 1 Ch. 10 Floating Point Unit.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Performance See: P&H 1.4.
VIRAM-1 Architecture Update and Status Christoforos E. Kozyrakis IRAM Retreat January 2000.
Scalable Vector Coprocessor for Media Processing Christoforos Kozyrakis ( ) IRAM Project Retreat, July 12 th, 2000.
Computer Architecture Lecture 2 Instruction Set Principles.
RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696
A Reconfigurable Processor Architecture and Software Development Environment for Embedded Systems Andrea Cappelli F. Campi, R.Guerrieri, A.Lodi, M.Toma,
Lecture 1 ECE Spring 2000 ECE 291 Spring 2000 Lecture 1: Microprocessor Evolution & Organization Constantine D. Polychronopoulos Professor, ECE.
Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.
CDA 5155 Superscalar, VLIW, Vector, Decoupled Week 4.
Spring 2003CSE P5481 VLIW Processors VLIW (“very long instruction word”) processors instructions are scheduled by the compiler a fixed number of operations.
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Advanced Processor Technology Architectural families of modern computers are CISC RISC Superscalar VLIW Super pipelined Vector processors Symbolic processors.
Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.
CS5222 Advanced Computer Architecture Part 3: VLIW Architecture
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Lx: A Technology Platform for Customizable VLIW Embedded Processing.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
CPS 258 Announcements –Lecture calendar with slides –Pointers to related material.
My Coordinates Office EM G.27 contact time:
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
Vector computers.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
Use of Pipelining to Achieve CPI < 1
CS203 – Advanced Computer Architecture
CS 352H: Computer Systems Architecture
Protection in Virtual Mode
Higher Level Parallelism
15-740/ Computer Architecture Lecture 3: Performance
CS 286 Computer Architecture & Organization
Visit for more Learning Resources
Design-Space Exploration
Multiscalar Processors
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue
Rough Schedule 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break
The Vector-Thread Architecture
Cache Memory Presentation I
CS203 – Advanced Computer Architecture
ECE/CS 552: Pipelining to Superscalar
Vector Processing => Multimedia
COMP4211 : Advance Computer Architecture
Flow Path Model of Superscalars
Mattan Erez The University of Texas at Austin
CS 152 Computer Architecture & Engineering
The Microarchitecture of the Pentium 4 processor
Superscalar Processors & VLIW Processors
Superscalar Pipelines Part 2
Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra
Out-of-Order Commit Processor
VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder
CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue
TI C6701 VLIW MIMD.
ECE/CS 552: Pipelining to Superscalar
Superscalar and VLIW Architectures
CSC3050 – Computer Architecture
Mattan Erez The University of Texas at Austin
Lecture 4: Instruction Set Design/Pipelining
COMPUTER ORGANIZATION AND ARCHITECTURE
CSE 502: Computer Architecture
Presentation transcript:

CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang A Comparison of the VIRAM-1 and Embedded VLIW architectures for use on SVD CS 252 Spring 2000 Jeff Herman John Loo Xiaoyi Tang

Motivation SVD Applications VLIW Vector Smart antennas Image processing Medical imaging VLIW Trend in high performance embedded computing Vector Out of favor Flynn bottleneck is a limiting factor in parallelism Known for linear algebra performance

C67 Architecture (mapped) Instruction Ram (cache optional) Decode Logic (8-way) A Register File B Register File L1 S1 M1 D1 D2 M2 S2 L2 Data Ram (>4 banks)

C67 Architecture Split Register Files Instruction Latencies 16 registers per register file One cross path per register file Instruction Latencies Branches - 6 cycles Load - 5 cycles FP add/multiply - 4 cycles

TM 1100 VLIW Processor Core Architecture 5-issue VLIW 2 FP adders/multipliers 2 Load/Store Units 128 general purpose 32 bit registers 16KB data cache, 32KB instruction cache Instruction Latencies 3 cycles for Branches, Load, FP add/multiply

VIRAM-1 Microarchitecture 2-way-issue superscalar MIPS IV core Asynchronous vector unit Communication to scalar core through queue 32 general purpose vector and flag registers 32 scalar and control register 2 VAFU, 2 FFU, 1 VMFU 4-lane standard configuration

VIRAM-1 Microarchitecture

Testing Conditions SVD routine from CLAPACK Random test matrices with a rank of 10 Matrix dimension ratio of 10 Sizes range from 100x10 to 300x30 Suboptimal parameters used Trends should still hold Assumed 200 Mhz clock rate

Ideal ‘C67 and TM 1100 Performance Gap Same memory bottlenecks in both processors Programming model C67 Assembly coded kernels 1700 lines TM 1100 Only C level optimizations

VIRAM Performance Summary Gains from vector unit limited by Amdahl’s law. Vector instructions comprise only ~15% of total code. Not much else of SVD can be vectorized. Gains limited by what cannot be vectorized. Perhaps streamline LAPACK or handcode assembly? Sub-linear scalability. Scaling IRAM is cheap but gains diminish. Efficiency and scalability increase with size of data set.

Concluding Remarks Limitations of both architecture are different VIRAM: Scalar core VLIW: Memory bandwidth VLIW cannot match performance of VIRAM when computing SVD. VLIW with vector coprocessor?