Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,

Slides:



Advertisements
Similar presentations
Field Programmable Gate Array
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Pipelining 5. Two Approaches for Multiple Issue Superscalar –Issue a variable number of instructions per clock –Instructions are scheduled either statically.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
The Raw Architecture Signal Processing on a Scalable Composable Computation Fabric David Wentzlaff, Michael Taylor, Jason Kim, Jason Miller, Fae Ghodrat,
THE RAW MICROPROCESSOR: A COMPUTATIONAL FABRIC FOR SOFTWARE CIRCUITS AND GENERAL- PURPOSE PROGRAMS Taylor, M.B.; Kim, J.; Miller, J.; Wentzlaff, D.; Ghodrat,
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 xkcd/619. Multicore & Parallel Processing Guest Lecture: Kevin Walsh CS 3410, Spring 2011 Computer Science Cornell University.
CGRA QUIZ. Quiz What is the fundamental drawback of fine-grained architecture that led to exploration of coarse grained reconfigurable architectures?
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Multicore & Parallel Processing P&H Chapter ,
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Lecture 26: Reconfigurable Computing May 11, 2004 ECE 669 Parallel Computer Architecture Reconfigurable Computing.
Instruction Level Parallelism (ILP) Colin Stevens.
Spring 07, Jan 16 ELEC 7770: Advanced VLSI Design (Agrawal) 1 ELEC 7770 Advanced VLSI Design Spring 2007 Introduction Vishwani D. Agrawal James J. Danaher.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
Chapter 6 Memory and Programmable Logic Devices
Dynamic Hardware Software Partitioning A First Approach Komal Kasat Nalini Kumar Gaurav Chitroda.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Section 10: Advanced Topics 1 M. Balakrishnan Dept. of Comp. Sci. & Engg. I.I.T. Delhi.
Is Out-Of-Order Out Of Date ? IA-64’s parallel architecture will improve processor performance William S. Worley Jr., HP Labs Jerry Huck, IA-64 Architecture.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Programming Model and Synthesis for Low-power Spatial Architectures Phitchaya Mangpo Phothilimthana Nishant Totla University of California, Berkeley.
J. Christiansen, CERN - EP/MIC
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
RISC By Ryan Aldana. Agenda Brief Overview of RISC and CISC Features of RISC Instruction Pipeline Register Windowing and renaming Data Conflicts Branch.
Jump to first page One-gigabit Router Oskar E. Bruening and Cemal Akcaba Advisor: Prof. Agarwal.
Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
EE3A1 Computer Hardware and Digital Design
The End of Conventional Microprocessors Edwin Olson 9/21/2000.
EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)
COARSE GRAINED RECONFIGURABLE ARCHITECTURES 04/18/2014 Aditi Sharma Dhiraj Chaudhary Pruthvi Gowda Rachana Raj Sunku DAY
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
Axel Jantsch 1 Networks on Chip Axel Jantsch 1 Shashi Kumar 1, Juha-Pekka Soininen 2, Martti Forsell 2, Mikael Millberg 1, Johnny Öberg 1, Kari Tiensurjä.
ECEG-3202 Computer Architecture and Organization Chapter 7 Reduced Instruction Set Computers.
FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.
COMPUTER ORGANIZATIONS CSNB123 NSMS2013 Ver.1Systems and Networking1.
High-Bandwidth Packet Switching on the Raw General-Purpose Architecture Gleb Chuvpilo Saman Amarasinghe MIT LCS Computer Architecture Group January 9,
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Application Domains for Fixed-Length Block Structured Architectures ACSAC-2001 Gold Coast, January 30, 2001 ACSAC-2001 Gold Coast, January 30, 2001.
1 of 14 Lab 2: Formal verification with UPPAAL. 2 of 14 2 The gossiping persons There are n persons. All have one secret to tell, which is not known to.
CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.
FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Lecture 4: Contrasting Processors: Fixed and Configurable September 20, 2004 ECE 697F Reconfigurable Computing Lecture 4 Contrasting Processors: Fixed.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Spring 2003CSE P5481 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Dynamic and On-Line Design Space Exploration for Reconfigurable Architecture Fakhreddine Ghaffari, Michael Auguin, Mohamed Abid Nice Sophia Antipolis University.
Lynn Choi School of Electrical Engineering
Baring It All to Software: Raw Machines
Ph.D. in Computer Science
A Quantitative Analysis of Stream Algorithms on Raw Fabrics
Architecture & Organization 1
Michael Bedford Taylor, Walter Lee, Saman Amarasinghe, Anant Agarwal
Architecture & Organization 1
Dynamically Reconfigurable Architectures: An Overview
The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.
HIGH LEVEL SYNTHESIS.
RAW Scott J Weber Diagrams from and summary of:
Presentation transcript:

Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb, S. Amarasinghe, A. Agarwal (Presented by Linda Deng)

Hitting a wall Already in 1997? As # of transistors increases, so does wire delay New complex hardware  verification costs Emerging stream-based multimedia

The radical Raw idea Lots of simple interconnected tiles Each tile contains: – Instruction/data memories – ALU – Registers – Configurable logic – Programmable switch for routing Complex operations synthesized into HW ↑

A Raw processor ↑

The programmer’s job Software deals with wire delay Wire delay = hops in mesh network One cycle to move from a tile to its neighbor Compiler knows # of cycles needed to move – Statically schedules operations Register renaming, instruction scheduling, dependency checking… ↑

What’s the big deal? Distributed registers – Bigger register namespace  higher ILP Distributed static RAM – Shorter memory latency No specialized logic structures in HW – Smaller tiles  more tiles  greater parallelism – More chip area for memory/logic – Faster clock – Less complexity  easier verification

The hard-working compiler Parallelism vs. communication/synchronization? – But the latter’s overhead is low – So partitioning can be fine-grained Tile placement to minimize latency/bandwidth Programs for tiles/switches (scheduling/routing) Logic synthesis tool for configurable logic – Pattern-matching algorithms to find candidate insns

Some remaining dynamic events… What happens when compiler can’t resolve? Reserve bandwidth b/w potential communicators Conservative estimates for dynamic routing Assign dependency checking to tiles Predict tile for offset, even though base is unknown

Prototype time: RawLogic Implemented with FPGAs Limited feature support – Static sequences converted into state machines – Hardwired into RawLogic – Inflexible, with amazingly long compilation times Framework in C/Verilog for compilation – Produced binary code for state machines But larger benchmarks were emulated And Raw machine has faster clock than FPGA

The numbers

Looking ahead “In 10 to 15 years, we believe that billion- transistor chip densities, faster switching speeds, and growing compiler sophistication will allow a Raw machine’s performance-to- cost ratio to surpass that of traditional architectures for future, general-purpose workloads.” Agarwal’s Tilera started shipping 64-core TILE64 in 2007, working on 36- and 120-core?