Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.

Slides:



Advertisements
Similar presentations
VHDL Design of Multifunctional RISC Processor on FPGA
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Xtensa C and C++ Compiler Ding-Kai Chen
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Computer Abstractions and Technology
TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.
Ultra-Efficient Exascale Scientific Computing Lenny Oliker, John Shalf, Michael Wehner And other LBNL staff.
SpecC and SpecCharts Reviewed and Presented by Heemin Park and Eric Kwan EE202A - Fall 2001 Professor Mani Srivastava.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
1 Architectural Complexity: Opening the Black Box Methods for Exposing Internal Functionality of Complex Single and Multiple Processor Systems EECC-756.
MotoHawk Training Model-Based Design of Embedded Systems.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
Extensible Processors. 2 ASIP Gain performance by:  Specialized hardware for the whole application (ASIC). −  Almost no flexibility. −High cost.  Use.
Some Thoughts on Technology and Strategies for Petaflops.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Platforms, ASIPs and LISATek Federico Angiolini DEIS Università di Bologna.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Introduction to ARM Architecture, Programmer’s Model and Assembler Embedded Systems Programming.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
Prardiva Mangilipally
Embedded Systems Design at Mentor. Platform Express Drag and Drop Design in Minutes IP Described In XML Databook s Simple System Diagrams represent complex.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
GallagherP188/MAPLD20041 Accelerating DSP Algorithms Using FPGAs Sean Gallagher DSP Specialist Xilinx Inc.
Getting Started With DSP A. What is DSP? B. Which TI DSP do I use? Highest performance C6000 Most power efficient C5000 Control optimized C2000 TMS320C6000™
Digital signature using MD5 algorithm Hardware Acceleration
The 6713 DSP Starter Kit (DSK) is a low-cost platform which lets customers evaluate and develop applications for the Texas Instruments C67X DSP family.
CSE430/830 Course Project Tutorial Instructor: Dr. Hong Jiang TA: Dongyuan Zhan Project Duration: 01/26/11 – 04/29/11.
® ChipScope ILA TM Xilinx and Agilent Technologies.
Ross Brennan On the Introduction of Reconfigurable Hardware into Computer Architecture Education Ross Brennan
DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Automated Design of Custom Architecture Tulika Mitra
System Design with CoWare N2C - Overview. 2 Agenda q Overview –CoWare background and focus –Understanding current design flows –CoWare technology overview.
J. Christiansen, CERN - EP/MIC
Configurable, reconfigurable, and run-time reconfigurable computing.
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
ESL and High-level Design: Who Cares? Anmol Mathur CTO and co-founder, Calypto Design Systems.
© 2012 xtUML.org Bill Chown – Mentor Graphics Model Driven Engineering.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.
VLSI Algorithmic Design Automation Lab. THE TI OMAP PLATFORM APPROACH TO SOC.
SOC Virtual Prototyping: An Approach towards fast System- On-Chip Solution Date – 09 th April 2012 Mamta CHALANA Tech Leader ST Microelectronics Pvt. Ltd,
What is a Microprocessor ? A microprocessor consists of an ALU to perform arithmetic and logic manipulations, registers, and a control unit Its has some.
Proposal for an Open Source Flash Failure Analysis Platform (FLAP) By Michael Tomer, Cory Shirts, SzeHsiang Harper, Jake Johns
Application-Specific Customization of Soft Processor Microarchitecture Peter Yiannacouras J. Gregory Steffan Jonathan Rose University of Toronto Electrical.
April 15, 2013 Atul Kwatra Principal Engineer Intel Corporation Hardware/Software Co-design using SystemC/TLM – Challenges & Opportunities ISCUG ’13.
SPRING 2012 Assembly Language. Definition 2 A microprocessor is a silicon chip which forms the core of a microcomputer the concept of what goes into a.
Computer Organization CS345 David Monismith Based upon notes by Dr. Bill Siever and from the Patterson and Hennessy Text.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Programmable Logic Devices
M. Bellato INFN Padova and U. Marconi INFN Bologna
Presenter: Darshika G. Perera Assistant Professor
Programmable Hardware: Hardware or Software?
Advanced Architectures
Ph.D. in Computer Science
THE PROCESS OF EMBEDDED SYSTEM DEVELOPMENT
FPGAs in AWS and First Use Cases, Kees Vissers
Emu: Rapid FPGA Prototyping of Network Services in C#
COMS 361 Computer Organization
Presentation transcript:

Climate Machine Update David Donofrio RAMP Retreat 8/20/2008

Agenda Project Overview Tensilica Architecture and Design Flow Tensilica Tools Demo Why we need RAMP Current Progress Next Steps

A New Approach to HPC Current HPC Design approach: –Leverage commodity processors from Intel, AMD, etc –Once machine is built, optimize problems to run on it –Power wall prevents scaling to exaflop performance –Power is the new design point Olukotun and Sutter Moore’s Law still in effect - but number of processors double every 18 months rather than clock rate

A New Approach to HPC Our approach: –Identify application, then tailor machine using semi-custom design –Optimize CPU architecture and further extend with semi-custom ISA –Leverage auto-tuning to access architecture specific optimizations –Even if each simple core is 1/4 as computationally efficient as a complex core you can fit hundreds on a single die and be 100x more power efficient Learn from embedded market where Flops / Watt and rapid design cycles are crucial –Start with building blocks from embedded designs rather than full custom ASIC –Preserve ability to run general purpose C code Application Target: 1km Scale Climate Model Tailor machine architecture to application to reduce waste

Climate Model Resource Requirements DOE has identified high-resolution climate modeling as a leading justification for exascale computing Must express 20M way parallelism Requires performance of 200 Pflops peak Simulation must run 1000x faster than real time Randall / CSU NASA Amenable to massively concurrent architectures composed of power efficient embedded cores. Actively working with the climate science community to enable new Icosahedral model

Tensilica Processor Design Flow Complete Solution: Hardware, Software and Verification Fully customizable –Required base ISA ensures general purpose applications Processor configuration submitted to Tensilica’s servers where synthesis is performed –Returned design can be spun for ASIC or FPGA –Bit file available for Avnet boards Building block approach drastically reduces design cycle time compared to full-custom design Tensilica Inc.

Tensilica Architecture Features Verilog-like TIE language allows for custom ISA extensions –Functional and performance verification built in –Auto generated compiler intrinsics –64-bit IEEE-DP floating point coded up in TIE and available Custom VLIW support Inter-processor communication easily enabled through: –TIE Ports –TIE Queues Access to direct HW support for interprocessor communication –TIE Lookups Allows interface to external ROMs or other RTL block

Tensilica Architecture Overview Tensilica Inc.

Tensilica Performance Debug Processor viewed as black box State can be compressed (via HW) and pushed out JTAG port –Intended for program replay Xtensa trace port gives real-time visibility into internal pipeline state with unprecedented detail –$ hit miss with virtual address –Branch taken / not taken –Call / return –Resource dependency –Etc… Opportunity for hundreds of performance counters to be made available Tensilica Inc.

Tensilica Tools Demo

Why we need RAMP Fast, accurate emulation enables: –Dual nested loop of HW / SW co-design Preliminary work using Stanford SM sim shows significant improvement in power eff. using automated HW/SW co-tuning RAMP critical to accelerate –Rapid prototyping and analysis of Tensilica architectural options –Inter-processor communication architecture exploration –Running FULL climate code providing a more complete performance picture Cycle accurate simulator currently running at ~100 kHz vs. 50MHz on V5 –Extensive HW performance counter data enables an emulation environment with similar resolution but much greater speed Tensilica provided emulation environment kick-starts this effort

Current Status ML505 used for initial design exploration –Basic xtensa processor + JTAG and memory controller is ~50% of a Virtex 5 50t –Runs at 50MHz ASIC in 65G process runs at 650MHz OnChip Debug working Can load / run programs using main memory synthesized from BRAM DRAM interface coded - currently being debugged RTL license recently obtained - full simulation environment (in ModelSim) being brought up

Next Steps… Transition to BEE3 from ML505 Bring up XTOS environment on single xtensa processor on BEE3 Run single column of climate code on single processor –Demo at SC’08 in November –Continue HW / SW co-tuning optimization Begin multi-processor emulation –Emulation of single socket, 32 core, using networked BEE3s –Running full 2 Million line climate model

Backup

The Need for Exascale Computing DOE has identified high-resolution climate modeling as leading justification for exascale computing –1 km resolution targeted for accurate cloud resolving model Difficult to scale existing systems –HPC design using commodity processors estimated to draw 179MW –BlueGene design estimated to draw 20MW –Leveraging embedded cores and more application specific design a power envelope of 3-5MW is projected Icosahedral LBNL will seek an external vendor to build the machine if our approach is proven valid - LBNL is not entering the commercial HPC market. Randall / CSU