RAMP Retreat Summer 2006 Break Session Leaders & Questions Greg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis 6/21/2006.

Slides:



Advertisements
Similar presentations
1 Hardware Support for Isolation Krste Asanovic U.C. Berkeley MURI “DHOSA” Site Visit April 28, 2011.
Advertisements

1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),
Copyright 2009 FUJITSU TECHNOLOGY SOLUTIONS PRIMERGY Servers and Windows Server® 2008 R2 Benefit from an efficient, high performance and flexible platform.
Keith Wiles DPACC vNF Overview and Proposed methods Keith Wiles – v0.5.
1 RAMP Implementation J. Wawrzynek. 2 RDL supports multiple platforms:  XUP, pure software, BEE2 BEE2 will be the standard RAMP platform for the next.
June 2007 RAMP Tutorial BEE3 Update Chuck Thacker John Davis Microsoft Research 10 June, 2007.
© Derek Chiou 1 RAMP-White Derek Chiou and Hari Angepat The University of Texas at Austin Supported in part by DOE, NSF, IBM, Intel, and Xilinx.
RAMP Summer Retreat 2008 Breakout Reports RAMP Summer Retreat 2008 Attendees (Compiled by Greg Gibeling)
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
1/28/2004CSCI 315 Operating Systems Design1 Operating System Structures & Processes Notice: The slides for this lecture have been largely based on those.
1 Fast Communication for Multi – Core SOPC Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab.
January 2007 RAMP Retreat BEE3 Update Chuck Thacker Technical Fellow Microsoft Research 11 January, 2007.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
UCB November 8, 2001 Krishna V Palem Proceler Inc. Customization Using Variable Instruction Sets Krishna V Palem CTO Proceler Inc.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
1 RAMP Breakout 1 Question 3 What are the standard distribution target machines? In what form should they be distributed? or What kind of infrastructure.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Using FPGAs with Embedded Processors for Complete Hardware and Software Systems Jonah Weber May 2, 2006.
Virtual Machines. Virtualization Virtualization deals with “extending or replacing an existing interface so as to mimic the behavior of another system”
Virtual Machine Monitors CSE451 Andrew Whitaker. Hardware Virtualization Running multiple operating systems on a single physical machine Examples:  VMWare,
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Final A Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
RSC Williams MAPLD 2005/BOF-S1 A Linux-based Software Environment for the Reconfigurable Scalable Computing Project John A. Williams 1
Out-of-Order OpenRISC 2 semesters project Semester A: Implementation of OpenRISC on XUPV5 board Midterm Presentation By: Vova Menis-Lurie Sonia Gershkovich.
Microkernels, virtualization, exokernels Tutorial 1 – CSC469.
1 3-General Purpose Processors: Altera Nios II 2 Altera Nios II processor A 32-bit soft core processor from Altera Comes in three cores: Fast, Standard,
Appendix B Planning a Virtualization Strategy for Exchange Server 2010.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
1 Hardware Security Mechanisms Krste Asanovic U.C. Berkeley August 20, 2009.
Benefits: Increased server utilization Reduced IT TCO Improved IT agility.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
M.A.Doman Short video intro Model for enabling the delivery of computing as a SERVICE.
Virtualization Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation is licensed.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
InfiniSwitch Company Confidential. 2 InfiniSwitch Agenda InfiniBand Overview Company Overview Product Strategy Q&A.
RAMPing Down Chuck Thacker Microsoft Research August 2010.
Virtual Machine and its Role in Distributed Systems.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 7 OS System Structure.
© 2004 Mercury Computer Systems, Inc. FPGAs & Software Components Graham Bardouleau & Jim Kulp Mercury Computer Systems, Inc. High Performance Embedded.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
XStream: Rapid Generation of Custom Processors for ASIC Designs Binu Mathew * ASIC: Application Specific Integrated Circuit.
BridgePoint Integration John Wolfe / Robert Day Accelerated Technology.
1 RapidIO Testbed Update Chris Conger Honeywell Project 1/25/2004.
A. Frank - P. Weisberg Operating Systems Structure of Operating Systems.
 The End to the Means › (According to IBM ) › 03.ibm.com/innovation/us/thesmartercity/in dex_flash.html?cmp=blank&cm=v&csr=chap ter_edu&cr=youtube&ct=usbrv111&cn=agus.
Full and Para Virtualization
1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Background Computer System Architectures Computer System Software.
1 Chapter 2: Operating-System Structures Services Interface provided to users & programmers –System calls (programmer access) –User level access to system.
Fermilab Scientific Computing Division Fermi National Accelerator Laboratory, Batavia, Illinois, USA. Off-the-Shelf Hardware and Software DAQ Performance.
Virtualization for Cloud Computing
Virtual Machine Monitors
Operating System Structure
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Structural Simulation Toolkit / Gem5 Integration
Group 8 Virtualization of the Cloud
Derek Chiou The University of Texas at Austin
OS Virtualization.
Combining Simulators and FPGAs “An Out-of-Body Experience”
A Survey on Virtualization Technologies
Co-designed Virtual Machines for Reliable Computer Systems
System calls….. C-program->POSIX call
Virtualization Dr. S. R. Ahmed.
Presentation transcript:

RAMP Retreat Summer 2006 Break Session Leaders & Questions Greg Gibeling, Derek Chiou, James Hoe, John Wawrzynek & Christos Kozyrakis 6/21/2006

Breakout Topics RDL & Design Infrastructure RAMP White Caches, Network & IO (Uncore) RAMP2 Hardware  BEE3 OS, VM and Compiler  Software Stack

RDL & Design Infrastructure Leader/Reporter: Greg Gibeling Topics  Features & Schedule  Proposals Multi-platform migration  Languages Which languages, priorities Assignments for support  Debugging – Models & Requirements  Retargeting to ASICs (Platform Optimization)

RDL & DI Notes (1) Languages  Hardware Verilog BlueSpec IBM uses VHDL  Software? Multi-Platform  Integration of hardware simulations  Control of multiplexing Needed for efficiency! Possible through channel & link parameters Features  Meta-types  Component (and unit) libraries

RDL & DI Notes (2) Debugging  Split target model RDL Target Design Exposed to a second level of RDL Allows statistics aggregation Modeling of noisy channels  Integration with unit internals Event & State Extraction Connection to processor debugging tools  People clearly want this ASAP

RDL & DI Notes (3) Debugging (Integrated)  Message tracing Causality Diagrams Framework to debug through units  Checkpoints  Injection  Single stepping May not be widely used But cheap to implement  Watch/Breakpoints

RDL & DI Notes (4) Why Java?  Runs on various platforms Recompilation is generally pretty painful  Decent type system in Java 1.5  Perfect for plugin infrastructure (e.g. OSGi) When to use RDL  Detailed timing model  Great at abstracting inter-chip comm  Perfect platform for partitioning designs Concise, logical specification  Support for the debugging framework  With standard interfaces, good for sharing

RDL & DI Notes (5) Basic Infrastructure  First system bringup Interfaces with workstations Initial board support  Standard interfaces (RDL and otherwise) Processor Replacements Board Support  Currently a heroic effort  Solutions Standardized components? Generators?

RDL & DI Notes (6) Timelines  Greg’s Goals 10/2006 should see RCF/RDLC3 11/2006 should see documentation  Debugging (Integrated) should be ASAP Manpower  Board support  First board bring up  RDL & RDLC users Standard interfaces Features & Documentation

RAMP White Leader/Reporter: Derek Chiou Topics  Two day break-out First day should be pro/con  Overall Preliminary Plan Evaluation Who is doing exactly what?  ISA for RAMP White OpenSPARC 32bit Leon PowerPC 405 Processor agnosticism  Implementation Reimplementation will be required Test suites from companies are very useful

RAMP White Notes (1) Use embedded PowerPC core first  Available  Debugged  Can run full OS today  FPGA chip space is already committed PowerPC and Sparc are both candidates PowerPC pros  Embedded processor is PowerPC Sparc pros  64b available today Wait and see on soft-core for RAMP-White from Derek go here

RAMP White Notes (2) >= 256 processors  Can buy 64 processors today Reasonable speed  10’s of MHz With 280K LUTs in Virtex 5, assume 50% for processor but 80% for ease of place- and-route  100K LUTs for processors  Need 4 per FPGA (16 per board, 16 boards)  25K LUTs per processor

RAMP White Notes (3) Embedded PowerPC core (it’s there and better performance than any soft-core)  Soft L1 data cache (no L2)  Hard L1 instruction cache  Emulation???? Ring coherence (a la IBM) Linux on top of embedded PowerPC core NSF mount for disk access Mark’s port of Peh’s and Dally’s router To do:  Ring coherence + L1 data cache + memory interface  RDL for modules  Software port  Timing models for memory, ring, cache, processor?  integration

RAMP White Notes (4) RAMP-White Greek  Beta More general fabric using same router Still use ring coherence  Gamma James Hoe’s coherence engine  Delta Soft core integration

Caches, Networks & IO (Uncore) Leader/Reporter: James Hoe Topics  CPU, Cache and Memories  Hybrid FPGA Cosimulation  Network Storage  Interfaces Especially with respect to interfaces Components, not sub-frameworks  Phase uncore abilities

Uncore Notes (1) A fully-system has more than just CPUs and memory I/O is very important Getting RAMP to “work”  Just like the real thing (from SW and OS’s perspective)  Software porting/development  Performance studies Someone has to build the “uncore”?  Co-simulation  Direct HW support for paravirtualization / VM

Uncore Notes (2) Why make RAMP white generic?  What is a more interesting target system?  What is a more relevant target system? Building a system without an application in mind? Would anyone care about RAMP- “vanilla”?

Uncore Notes (3) Why insist on directory-based CC for 1000 nodes  Today’s large SMPs (at 100+ ways) are actually snoopy-based  Plug in 8-core CMPs, that is a 1000-node snoopy system (that the industry may be more interested it in)

Uncore Notes (4) Let’s ping down a reference system architecture (including the uncore)  minimum modules required?  optional modules supported?  fix standard interfaces between modules  RDL script for RAMP white?? Need more than a block diagram for RAMP white

Uncore Notes (5) Requests and Ideas for RDL  Compensate for skewed raw performance of components (for timing measurements) Large I/O bandwidth relative to CPU throughput Need knobs to dial-in different rates for experiments  Some form of HW/SW co-simulation  Built-in performance monitoring

Uncore Notes (6) Sanity Check  1000 processing nodes: no problem  I/O: we can fake it somehow  DRAM for 1000 processing node Not easy to cheat on this one

RAMP2 Hardware (BEE3) Leader/Reporter: Dan Burke & John Wawrzynek Topics  Follow up to XUP Should RAMP embrace XUP at low end? Inexpensive small systems  Size & scaling of new platform More than 40 FPGAs?  Technical Questions Reconsider use of SRAM DRAM Capacity Presence of on-board hard CPUs On-board interfaces (PCI-Express)  Project Questions Timelines  Definitely need one Packaging Pricing (Especially FPGAs)  Design for largest FPGA, change part at solder time?  Evaluation of Chen Chang’s Design

RAMP2 HW Notes (1) Follow-up to XUP  XUP has been useful to the project, particularly for early development efforts.  Xilinx will continue to design and support new XUP boards  No v4 version planned.  V5 version will be out Q2 next year.  For BEE3 can't really count on V5 FX in 2Q next year.  Perhaps use a separate (AMCC) powerPC processor chip.

RAMP2 HW Notes (2) Size and Scaling of new platform: Given potential processor core density issue, will need to plan on a system that can scale past 40 FPGAs. Better compatibility with new XUP is important:  ex: DRAM standard (better sharing of memory controllers)  USB use Cypress CY7300 for USB compatibility with Xilinx core. Our design and production of BEE3 is timed to the production of V5 parts. We need to better understand RAMP team schedule for RAMP white. Hope to be able to choose the package and have flexibility in part sizes and ideally part feature set. How about a daughterboard for FPGA (DRC approach)?

RAMP2 HW Notes (3) Technical Questions Reconsider use of SRAM: group thought SRAM is a bad idea. It is faster, smaller, simpler to interface to. Newer parts will make interfacing simpler. Faster not a big concern for RAMP. Smaller is a big concern. 8GB DDR2 DIMM modules on the horizon. A target will be 1 GByte/processor. Presence of on-board hard CPUs  Are hard cores in FPGAs useful (e.g. PPC405 in V2Pro)  Would commodity chips on PCB be useful (eg for management)

RAMP2 HW Notes (4) Enclosures:  Using a standard form-factor will help in the with module packaging. Need to look carefully at IBM blade center (adopted by IBM and Intel) ATCA is gaining momentum.  Power may be a problem Can we accomodate custom ASIC integration (perhaps through a slight generalization of the DRAM interface). What does Google do for packaging in their data centers? Is it racks of 1U modules?

RAMP2 HW Notes (5) Interesting Idea from Chuck Thacker: "Design new board based on need of RAMP White"!  Previously suggested by others Can we estimate the logic capacity, memory BW, network BW, etc.?

OS, VM & Compiler Leader/Reporter: Christos Kozyrakis Topics  Debugging HW and SW (RDL)  Phased approach Proxy, full kernel, VMMs, Hypervisor HW/SW schedule and dependencies  High level applications

Software Notes (1) RAMP milestones  Pick ISA  Deploy basic VMM  Deploy OS

Software Notes (2) VMM approach: use split VMM system (ala VMware/Xen)  Run full VMM on x86 host that allows access to devices  Run simple VMM on RAMP that communicates with host for devices accesses through some network  A timing model may be used if I/O performed is important  Should talk with Sun & IBM about their VMM systems for Sparc and PowerPC.  May be able to port a very basic Xen system on our own Questions  Accurate I/O timing with para-virtualization (you also need repeatability)  SW/system-level/IO issues for large scale machine may be more important than coherence Related Issue: Do we want global cache coherence in white?  Benefit vs complexity (schedule etc)

Software Notes (3) Separate infrastructure from RAMP  Example: RDL should not be tied to RAMP White Note: This is in progress with some current RDL applications  Same with BEE3 design work  Most of our tools are applicable to others

Software Notes (4) Debugging support: RDL-scope  Arbitrary conditions on RDL-level events to trigger debugging  Get traces of messages  Track lineage of messages Traceability, accountability, relate events to program constructs  Infinite checkpoints for instructions & data  Checkpoint support Swappable & observable designs  Single step Instruction, RDL, or cycle level Note: not always a commonly use feature Such features may attract people to RDL more than retiming  Note: This is already the case with current RDL applications

Software Notes (5) What our is schedule  What can we have up and running with 1 year?  Does it have to be RAMP white? Do we need to migrate RDL maintenance from Greg?  Note: The work should be spread out at least. Do we have enough manpower for this SW work?  Compiler, VMMs, Applications, etc…

Software Notes (6) Application Domains  Enterprise/desktop Full featured OS on all nodes Running a JVM is a big plus here Should be able to run webservers, middleware and DBs.  Embedded While eventually an app may directly control a number of nodes, it is easier to start with all nodes running the OS. The base design should allow all nodes to run the OS.  Easiest starting point for SW. Various researchers may decide to run the OS in a subset of nodes, managing the rest of them directly  A simple runtime with app-specific policies Common in embedded systems

Software Notes (7) A simple kernel for embedded systems should support  Fast remapping of computation  Protection across processes Emulation of attached disk  ISCSI + a timing model for disks RAMP VMM uses:  Attract VMM researchers (might require x86)  Our own convenience Get an OS running, access to devices etc  We may achieve (b) without (a) Some researchers will want to turn cache coherence off anyway!