7/20/05FDIS 20051 The Design and Application of Berkeley Emulation Engines John Wawrzynek Bob Brodersen Chen Chang University of California, Berkeley Berkeley.

Slides:

Advertisements

Similar presentations

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Device Tradeoffs Greg Stitt ECE Department University of Florida.

Evolution of Chip Design ECE 111 Spring A Brief History 1958: First integrated circuit – Flip-flop using two transistors – Built by Jack Kilby at.

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),

IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)

1 RAMP Implementation J. Wawrzynek. 2 RDL supports multiple platforms:  XUP, pure software, BEE2 BEE2 will be the standard RAMP platform for the next.

Some Thoughts on Technology and Strategies for Petaflops.

Introduction to Reconfigurable Computing CS61c sp06 Lecture (5/5/06) Hayden So.

Room: E-3-31 Phone: Dr Masri Ayob TK 2123 COMPUTER ORGANISATION & ARCHITECTURE Lecture 4: Computer Performance.

BEEKeeper Remote Management and Debugging of Large FPGA Clusters Terry Filiba Navtej Sadhal.

Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.

Configurable System-on-Chip: Xilinx EDK

© 2006 Regents University of California. All Rights Reserved RAMP Blue: A Message Passing Multi-Processor System on the BEE2 Andrew Schultz and Alex Krasnov.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.

Implementation of DSP Algorithm on SoC. Mid-Semester Presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompaning engineer : Emilia Burlak.

Intel ® Research mote Ralph Kling Intel Corporation Research Santa Clara, CA.

Introduction to FPGA and DSPs Joe College, Chris Doyle, Ann Marie Rynning.

A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.

Using Programmable Logic to Accelerate DSP Functions 1 Using Programmable Logic to Accelerate DSP Functions “An Overview“ Greg Goslin Digital Signal Processing.

FPGA Based Fuzzy Logic Controller for Semi- Active Suspensions Aws Abu-Khudhair.

Presenter MaxAcademy Lecture Series – V1.0, September 2011 Introduction and Motivation.

General FPGA Architecture Field Programmable Gate Array.

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

Computer performance.

1 Down Place Hammersmith London UK 530 Lytton Ave. Palo Alto CA USA.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1

Highest Performance Programmable DSP Solution September 17, 2015.

Making FPGAs a Cost-Effective Computing Architecture Tom VanCourt Yongfeng Gu Martin Herbordt Boston University BOSTON UNIVERSITY.

FPGA-based Dedispersion for Fast Transient Search John Dickey 23 Nov 2005 Orange, NSW.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

RAMPing Down Chuck Thacker Microsoft Research August 2010.

Lessons Learned The Hard Way: FPGA  PCB Integration Challenges Dave Brady & Bruce Riggins.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

PRESENTED BY OUSSAMA SEKKAT Self-Healing Mixed-Signal Baseband Processor for Cognitive Radios.

SHA-3 Candidate Evaluation 1. FPGA Benchmarking - Phase Round-2 SHA-3 Candidates implemented by 33 graduate students following the same design.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

ATCA based LLRF system design review DESY Control servers for ATCA based LLRF system Piotr Pucyk - DESY, Warsaw University of Technology Jaroslaw.

J. Christiansen, CERN - EP/MIC

Reminder Lab 0 Xilinx ISE tutorial Research Send me an if interested Looking for those interested in RC with skills in compilers/languages/synthesis,

Introduction to Reconfigurable Computing Greg Stitt ECE Department University of Florida.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Design Criteria and Proposal for a CBM Trigger/DAQ Hardware Prototype Joachim Gläß Computer Engineering, University of Mannheim Contents –Requirements.

Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.

Interconnection network network interface and a case study.

Reconfigurable Computing: HPC Network Aspects Mitch Sukalski (8961) David Thompson (8963) Craig Ulmer (8963) Pete Dean R&D Seminar December.

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

SCORES: A Scalable and Parametric Streams-Based Communication Architecture for Modular Reconfigurable Systems Abelardo Jara-Berrocal, Ann Gordon-Ross NSF.

Trigger Hardware Development Modular Trigger Processing Architecture Matt Stettler, Magnus Hansen CERN Costas Foudas, Greg Iles, John Jones Imperial College.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Real-Time System-On-A-Chip Emulation.  Introduction  Describing SOC Designs  System-Level Design Flow  SOC Implemantation Paths-Emulation and.

3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.

Hardware Architecture

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Lynn Choi School of Electrical Engineering

Architecture & Organization 1

Electronics for Physicists

FPGAs in AWS and First Use Cases, Kees Vissers

Architecture & Organization 1

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Electronics for Physicists

Cluster Computers.

Presentation transcript:

7/20/05FDIS The Design and Application of Berkeley Emulation Engines John Wawrzynek Bob Brodersen Chen Chang University of California, Berkeley Berkeley Wireless Research Center

7/20/05FDIS Berkeley Emulation Engine (BEE), 2002 FPGA-based system for real-time hardware emulation: Emulation speeds up to 60 MHz Emulation capacity of 10 Million ASIC gate- equivalents (although not a logic gate emulator), corresponding to 600 Gops (16-bit adds) 2400 external parallel I/Os providing 192 Gbps raw bandwidth. 20 Xilinx VirtexE 2000 chips, 16 1MB ZBT SRAM chips.

7/20/05FDIS Realtime Processing Allows In-System Emulation BEE TransmitterReceiver Frame O.K. Data Match Data Out Receiver Output on SCSI Connector Transmitter Output Spectrum

7/20/05FDIS Matlab/Simulink Programming Tools: Discrete-Time-Block-Diagrams with FSMs Tool flow developed by Mathworks, Xilinx, and UCB. User specifies design as block diagrams (for datapaths) and finite state machines for control. Tools automatically map to both FPGAs and ASIC implementation. User assisted partitioning with automatic system level routing. DI DO A R/W S2 S1 ControlData PathUser Macros StateFlow, Matlab HDL CoreGen Module Compiler Black Boxes Block Diagrams: Matlab/Simulink: Functional simulation, Hardware Emulation

7/20/05FDIS BEE Status Four BEE processing units built Three in near continuous “production” use Other supported universities CMU, USC, Tampere, UMass, Stanford Successful tapeout of: 3.2M transistor pico-radio chip 1.8M transistor LDPC decoder chip System emulated: QPSK radio transceiver BCJR decoder MPEG IDCT On-going projects UWB mix-signal SOC MPEG/PRISM transcoder Pico radio multi-node system Infineon SIMD processor for SDR

7/20/05FDIS Lessons from BEE 1. Real-time performance vastly eases the debugging/verification/tuning process. 2. Simulink based tool-flow very effective FPGA programming model in DSP domain. 3. System emulation tasks are significant computations in their own right – high-performance emulation hardware makes for high-performance general computing. Is this the right way to build high-end (super) computers? BEE could be scaled up with latest FPGAs and by using multiple boards  BEE2 (and beyond).

7/20/05FDIS BEE2 Hardware 1. Modular design scalable from a few to hundreds of FPGAs. 2. High memory capacity and bandwidth to support general computing applications. 3. High bandwidth / low-latency inter-module communication to support massive parallelism. 4. All off-the-shelf components no custom chips. Thanks to Xilinx for engineering assistance, FPGAs, and interaction on application development.

7/20/05FDIS Basic Computing Element Single Xilinx Virtex 2 Pro 70 FPGA 130nm technology ~70K logic cells 1704 package with 996 user I/O pins 2 PowerPC405 cores 326 dedicated multipliers (18-bit) 5.8 Mbit on-chip SRAM 20X Gbit/s duplex serial communication links (MGTs) 4 physical DDR2-400 banks Per FPGA: up to 12.8 Gbyte/s memory bandwidth and maximum 8 GByte capacity. Virtex 4 (90nm) out now, 2x capacity, 2x frequency. Virtex 5 (65nm) next spring.

7/20/05FDIS Compute Module Diagram 10GigE or Infiniband

7/20/05FDIS Compute Module 14X17 inch 22 layer PCB Module also includes I/O for administration and maintenance: 10/100 Ethernet HDMI / DVI USB Completed 12/04.

7/20/05FDIS Inter-Module Connections Global Communication Tree Stream Packets Admin, UI, NFS

7/20/05FDIS Alternative topology: 3D mesh or torus The 4 compute FPGA can be used to extend to 3D mesh/torus 6 directional links: 4 off-board MGT links 2 on-board LVCMOS links

7/20/05FDIS ” Rack Cabin Capacity 40 compute modules in 5 chassis (8U) per rack ~40TeraOPS, ~1.5TeraFLOPS 150 Watt AC/DC power supply to each blade ~6 Kwatt power consumption Hardware cost: ~ $500K

7/20/05FDIS Why are these systems interesting? 1. Best solution in several domains: a) Emulation for custom chip design b) Extreme real-time signal processing tasks c) Scientific and Supercomputing 2. Good model on how to build future chips and systems: a) Massively parallel b) Fine-grained reconfigurability enables: Robust performance/power efficiency on a wide- range of problems. Manufacturing defect tolerance.

7/20/05FDIS Moore’s Law in FPGA world 100X higher performance, 100X more efficient than microprocessors FPGA performance doubles every 12 months

7/20/05FDIS Extreme Digital-Signal-Processing Massive arithmetic operations per second requirement. “Stream-based” computation model Real-time requirement High-bandwidth data I/O Low numerical precision requirements Mostly fix-point operations Rarely needs floating point Data-flow processing dominated few control branch points BEE2 is a promising computing platform for for Allen Telescope Array (ATA) (350 antennas) and proposed Square Kilometer Array (SKA) (1K antennas) SETI spectrometer Image-formation for Radio Astronomy Research

7/20/05FDIS SETI Spectrometer Target: 0.7Hz channels over 800MHz  1 billion Channel real-time spectrometer Result: One BEE2 module meets target and yields 333GOPS (16-bit mults, 32-bit adds), at 150Watts (similar to desk-top computer) >100x peak throughput of current Pentium-4 system on integer performance, & >100x better throughput per energy.

7/20/05FDIS FPGA versus DSP Chips Spectrometer & polyphase filter bank (PFB): 18 mult, Correlator: 4bit mult, 32bit acc. Cost based on street price. Assume peak numbers for DSPs, mapped for FPGAs (automatic Simulink tools). TI DSPs: C6415-7E, 130nm (720MHz) C6415T-1G, 90nm (IGHz) FPGAs: 130nm, freq MHz. Energy Efficiency Performance Cost-Performance Metrics include chips only (not system). FPGAs provide extra benefit at the PC board level.

7/20/05FDIS Active Application Areas High-performance DSP SETI Spectroscopy, ATA / SKA Image Formation Scientific computation and simulation E & M simulation for antenna design Communication systems development Platform Algorithms for SDR and Cognitive radio Large wireless Ad-Hoc sensor networks In-the-loop emulation of SOCs and Reconfigurable Architectures Bioinformatics BLAST (Basic Local Alignment Search Tool) biosequence alignment System design acceleration Full Chip Transistor-Level Circuit Simulation (Xilinx) RAMP (Research Accelerator for MultiProcessing)

7/20/05FDIS Opportunity for a New Research Platform: RAMP (Research Accelerator for Multiple Processors) Krste Asanovic (MIT), Christos Kozyrakis (Stanford), Dave Patterson (UCB), Jan Rabaey (UCB), John Wawrzynek (UCB) July 2005

7/20/05FDIS Change in Computer Landscape Old Conventional Wisdom: Uniprocessor performance 2X / 1.5 yrs (“Moore’s Law”) New Conventional Wisdom: 2X CPUs per socket / ~ 2 years Problem: Compilers, operating systems, architectures not ready for 1000s of CPU per chip, but that’s where we’re headed How do research on 1000 CPU systems in compilers, OS, architecture?

7/20/05FDIS FPGA Boards as New Research Platform Given ~ 25 soft CPUs can fit in FPGA, what if made a 1000-CPU system from ~ 40 FPGAs? 64-bit simple RISC at 100HMz Research community does logic design (“gate shareware”) to create out-of-the-box Massively Parallel Processor that runs standard binaries of OS and applications Processors, Caches, Coherency, Switches, Ethernet Interfaces, … Recreate synergy of old VAX + BSD Unix?

7/20/05FDIS Why RAMP Attractive? Priorities for Research Parallel Computers 1a. Cost of purchase 1b. Cost of ownership (staff to administer it) 1c. Scalability (1000 much better than 100 CPUs) 4. Observability (measure, trace everything) 5. Reproducibility (to debug, run experiments) 6. Community synergy (share code, …) 7. Flexibility (change for different experiments) 8. Performance

7/20/05FDIS Why RAMP Attractive? Grading SMP vs. Cluster vs. RAMP SMPCluster RAMP Cost of purchase (1 CPU, 1 GB DRAM)* D ($40k, $4k) B ($2k, $0.4k) A+ ($0.1k, $0.2k) Cost of ownershipADB ScalabilityCAA ObservabilityDCA+ ReproducibilityBDA+ CommunityDAA FlexibilityDCA+ Performance (clock) A (2 GHz) A (3 GHz) D (0.2 GHz) * Costs from TPC-C Benchmark IBM eServer P5 595, IBM eServer x346/Apple Xserver, BWRC BEE2

7/20/05FDIS Internet in a Box? Could RAMP radically change research in distributed computing? (Armando Fox, Ion Stoica, Scott Shenker) Existing distributed environments (like PlanetLab) very hard to use for development: The computers are live on the Internet and subject to all kinds of problems (security,...) and there is no reproducibility. You cannot reserve the whole thing for yourself and change OS or routing or.... Very expensive to support - the reason the biggest ones are order 200 to 300 nodes, and there are lots of restrictions on using them.

7/20/05FDIS Internet in a Box? RAMP promises a private "internet in a box" for $50k to $100k. A collection of 1000 computers running independent OS that could do real checkpoints and have reproducible behavior. We can set parameters for network delays, bandwidth, number of disks, disk latency and bandwidth,... Could have every board running synchronously to the same clock cycle, so that we could do a checkpoint at clock cycle 4,000,000,000, and then reload later from that point and cause the network interrupt to occur exactly at clock cycle 4,000,000,100 for CPU 104 every single time.