Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal

Slides:



Advertisements
Similar presentations
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Advertisements

International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Instruction-Level Parallelism (ILP)
The Raw Processor: A Scalable 32 bit Fabric for General Purpose and Embedded Computing Presented at Hotchips 13 On August 21, 2001 by Michael Bedford Taylor.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
ECE669 L12: Interconnection Network Performance March 9, 2004 ECE 669 Parallel Computer Architecture Lecture 12 Interconnection Network Performance.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.
Distributed Microarchitectural Protocols in the TRIPS Prototype Processor Sankaralingam et al. Presented by Cynthia Sturton CS 258 3/3/08.
Evaluating the Raw microprocessor Michael Bedford Taylor Raw Architecture Group Computer Science and AI Laboratory Massachusetts Institute of Technology.
Gigabit Routing on a Software-exposed Tiled-Microprocessor
Secure Embedded Processing through Hardware-assisted Run-time Monitoring Zubin Kumar.
High-Performance Networks for Dataflow Architectures Pravin Bhat Andrew Putnam.
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.
A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT Nikolaos Vassiliadis N. Kavvadias, G. Theodoridis, S. Nikolaidis Section.
CLEMSON U N I V E R S I T Y AVR32 Micro Controller Unit Atmel has created the first processor architected specifically for 21st century applications that.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
Comparing Memory Systems for Chip Multiprocessors Leverich et al. Computer Systems Laboratory at Stanford Presentation by Sarah Bird.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Message Passing On Tightly- Interconnected Multi-Core Processors James Psota and Anant Agarwal MIT CSAIL.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
CS 258 Spring The Expandable Split Window Paradigm for Exploiting Fine- Grain Parallelism Manoj Franklin and Gurindar S. Sohi Presented by Allen.
Creating a Scalable Microprocessor: A 16-issue Multiple-Program-Counter Microprocessor With Point-to-Point Scalar Operand Network Michael Bedford Taylor.
Evaluating The Raw Microprocessor: Scalability and Versatility Michael Taylor Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry.
My Coordinates Office EM G.27 contact time:
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
CS 352H: Computer Systems Architecture
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Dynamic Scheduling Why go out of style?
18-447: Computer Architecture Lecture 30B: Multiprocessors
Distributed Processors
Packet Switching on Raw
Architecture and Design of AlphaServer GS320
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Multiscalar Processors
Assembly Language for Intel-Based Computers, 5th Edition
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
5.2 Eleven Advanced Optimizations of Cache Performance
CS203 – Advanced Computer Architecture
CDA 3101 Spring 2016 Introduction to Computer Organization
Flow Path Model of Superscalars
Hyperthreading Technology
Parallel and Multiprocessor Architectures
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Using Packet Information for Efficient Communication in NoCs
The Stanford FLASH Multiprocessor
Chapter 1 Introduction.
Lecture 20: OOO, Memory Hierarchy
Outline Announcements Lab2 Distributed File Systems 1/17/2019 COP5611.
RAW Scott J Weber Diagrams from and summary of:
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
CSC3050 – Computer Architecture
CSC3050 – Computer Architecture
The University of Adelaide, School of Computer Science
Presentation transcript:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal Presented By: Sarah Lynn Bird

Scalar Operand Networks “A set of mechanisms that joins the dynamic operands and operations of a program in space to enact the computation specified by a program graph” Physical Interconnection Network Operation-operand matching system

Example Scalar Operand Networks Register File Raw Microprocessor

Design Issues Delay Scalability Bandwidth Scalability Intra-component delay Inter-component delay Managing latency Bandwidth Scalability Deadlock and Starvation Efficient Operation-Operand Matching Handling Exceptional Events

Operation-Operand Matching 5-Tuples of Costs <SO, SL, NHL, RL, RO> SO: Send Occupancy The number of cycles that the ALU wastes in sending SL: Send Latency The number of cycles of delay for the message on the send side of the network NHL: Network Hop Latency The number of cycles of delay per hop RL: Receive Latency The number of cycles of delay between the final input arrives and the instruction is consumed RO: Receive Occupancy The number of cycles that an ALU wastes before employing a remote value

Raw Design 8 -stage in-order single-issue pipeline 2 Static Networks Instructions from a 64KB cache Point-to-point for operand transport 2 Dynamic networks Memory traffic, interrupts, user-level messages 8 -stage in-order single-issue pipeline 4-stage pipelined FPU 32KB data cache 32KB instruction cache 16 Cores on a Chip

Experiments Beetle: a cycle-accurate simulator Memory Model Benchmarks Actual Scalar Operand Network Parameterized Scalar Operand Network without Contention Data cache misses modeled correctly Assume no instruction cache misses Memory Model Compiler maps memory to tiles Each location has one home site Benchmarks From Spec92, Spec95, Raw benchmark suite Dense Matrix Codes, 1 Secure Hash Algorithm

Benchmark Scaling 2 4 8 16 32 64 cholesky 1.622 3.234 5.995 9.185 11.898 12.934 vpenta 1.714 3.112 6.093 12.132 24.172 44.872 mxm 1.933 3.731 6.207 8.900 14.836 20.472 fppp-kernal 1.511 3.336 5.724 6.143 5.988 6.536 sha 1.123 1.955 1.976 2.321 2.536 2.523 swim 1.601 2.624 4.691 8.301 17.090 28.889 jacobi 1.430 2.757 4.953 9.304 15.881 22.756 life 1.807 3.365 6.436 12.049 21.081 36.095 Benchmark speedups on many tiles relative to the speed of the benchmark on one tile

Effect of Send & Receive Occupancy 64 tiles Parameterized network without contention <n,1, 1, 1, 0> & <0,1,1,1, n>

Effect of Send or Receive Latencies Applications with courser-grain parallelism are less sensitive to send/receive latencies Overall, applications are less sensitive to send/receive latencies as compared with send/receive occupancies.

Other Experiments Removing Contention Increasing Hop Latency Comparing with Other networks

Conclusions Many difficult issues with designing scalar operand networks Send and receive occupancies have the biggest impact on performance Network contention, multicast, and send/receive latencies have a smaller impact