UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Performance Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
1 Web Server Performance in a WAN Environment Vincent W. Freeh Computer Science North Carolina State Vsevolod V. Panteleenko Computer Science & Engineering.
Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.
© ABB Group Jun-15 Evaluation of Real-Time Operating Systems for Xilinx MicroBlaze CPU Anders Rönnholm.
CSCE 212 Chapter 4: Assessing and Understanding Performance Instructor: Jason D. Bakos.
UC Berkeley 1 A Disk and Thermal Emulation Model for RAMP Zhangxi Tan and David Patterson.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
1 RAMP 100K Core Breakout Assorted RAMPants RAMP Retreat, UC San Diego June 14, M.
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
IT Systems Memory EN230-1 Justin Champion C208 –
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
CS430 – Computer Architecture Lecture - Introduction to Performance
1 RAMP Breakout 1 Question 3 What are the standard distribution target machines? In what form should they be distributed? or What kind of infrastructure.
1 Measuring Performance Chris Clack B261 Systems Architecture.
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
1 Computer Performance: Metrics, Measurement, & Evaluation.
Computers Central Processor Unit. Basic Computer System MAIN MEMORY ALUCNTL..... BUS CONTROLLER Processor I/O moduleInterconnections BUS Memory.
Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
Performance David Monismith Jan. 16, 2015 Based on notes from Dr. Bill Siever and from the Patterson and Hennessy Text.
Lecture Topics: 11/17 Page tables TLBs Virtual memory flat page tables
Hybrid Prototyping of MPSoCs Samar Abdi Electrical and Computer Engineering Concordia University Montreal, Canada
Performance.
CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Morgan Kaufmann Publishers
1 COMS 361 Computer Organization Title: Performance Date: 10/02/2004 Lecture Number: 3.
Data Management for Decision Support Session-4 Prof. Bharat Bhasker.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
1  1998 Morgan Kaufmann Publishers How to measure, report, and summarize performance (suorituskyky, tehokkuus)? What factors determine the performance.
Performance Performance
High Level Architecture Time Management. Time management is a difficult subject There is no real time management in DIS (usually); things happen as packets.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Baring It All to Software: Raw Machines E. Waingold, M. Taylor, D. Srikrishna, V. Sarkar, W. Lee, V. Lee, J. Kim, M. Frank, P. Finch, R. Barua, J. Babb,
Lecture 5: 9/10/2002CS170 Fall CS170 Computer Organization and Architecture I Ayman Abdel-Hamid Department of Computer Science Old Dominion University.
Sunpyo Hong, Hyesoon Kim
EGRE 426 Computer Organization and Design Chapter 4.
CSC 360- Instructor: K. Wu Review of Computer Organization.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Memory COMPUTER ARCHITECTURE
September 2 Performance Read 3.1 through 3.4 for Tuesday
Yoav Etsion, Dan Tsafrir, Dror G. Feitelson
EE380, Fall 2010 Hank Dietz Chapter 2 EE380, Fall 2010 Hank Dietz
Defining Performance Which airplane has the best performance?
Section 9: Virtual Memory (VM)
Alternative system models
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
CSCE 212 Chapter 4: Assessing and Understanding Performance
CS2100 Computer Organisation
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Virtual Memory: Working Sets
Chapter-1 Computer is an advanced electronic device that takes raw data as an input from the user and processes it under the control of a set of instructions.
Performance.
Computer Organization and Design Chapter 4
Presentation transcript:

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley

2 A time machine Using RAMP as datacenter simulator –Vary DC configurations: processors, disks, network and etc. –Evaluate different system implementations: Mapreduce with 10 Gbps, 2ms delay or 100 Gbps, 80ms delay interconnect –Explore and predict what happened if update hardware in your cluster: powerful CPU, fast/large disks – Try things in the future! RAMP inside

3 The problems Emulate fast and many computers in FPGA What are the problems? –First comment half year ago in RadLab retreat: 100 MHz is too slow can’t reflect GHz machine –Targets are becoming more and more complex Implement them in FPGA and cycle accurate is desired How many cores can we put in FPGA? (Original vision cores per chip. Now, 1 Leon on V2P30, 2-3 on V2P70)

4 Methodologies RDL –Target cycle, host cycle, start, stop, channel model… Transfer data between units with extra start/stop control Replace original transferring logic with RDL control target clock: If no data, still send something to keep the target time “running” Bad control logic implementation may cause deadlock RDLizing unit (build channels, units) if you want to talk with each other –Compared to porting APPs for MicroBlaze? –RDLizing is obvious and simple?? Model: event driven? or clock driven? Time dilation –Remove target cycle control Stepping every clock cycle is the way to debug 1000 nodes system? –Use standard data transfer interface –Rescale everything to a “virtual wall clock” and “slow down” events accordingly Events: Timer interrupt, data sent/received and etc

5 Basic Idea “Slow down” time passage to make target faster –10 ms wall clock time = 2 ms target time Network: shorter time to send packet -> BW increase, latency decrease Disk: shorter time to read/write CPU: shorter time to do computation –Virtual wall clock is the coordinate in target, only control event interval in implementation Wall clock 10 ms perceived event interval 10 ms Virtual wall clock 2 ms 2 ms perceived event interval 10 ms perceived event interval No time dilation Time dilation

6 Real world examples Real Time dilation 1 sec Timer interrupt before time dilation 10 ms Network CPU and OS 100 ms Sending 100 Mb data between two events Perceived BW : 100 Mbps Perceived BW : 1 Gbps Sending data at the same rate with the same logic Timer interrupt after time dilation 50 ms in wall clock time 10 ms perceived in target OS updates its timer every 10 ms (jiffies) in each timer interrupt Reprogram the timer to slow the interrupt down –No OS modifications –No HW changes Speed up the processor by x5

7 Experiments HW Emulator (FPGA): 32-bit Leon3 with, 50MHz, 90 MHz DDR memory, 8K L1 Cache (4K Inst and 4K Data) –Target system: Linux 2.6 kernel, 50 MHz / 250 MHz / 500 MHz / 1 GHz / 2 GHz –Run Dhrystone benchmark –Tomorrow: HW/SW co-simulation example Concept Time Dilation Factor = wall clock time / emulated clock time

8 Dhrystone result (w/o memory TD) How close to a 3 GHz x86 ~8000 Dhrystone MIPS? Memory, Cache, CPI

9 Problems Similar to time dilation in VM –To Infinity and Beyond: Time-Warped Network Emulation, NSDI 06 Everything scaled linearly, including memory! –VM is lucky: networking code can fit in cache easily. –RAMP has more knobs to tweak. Solution: slow down the memory and redo the experiment

10 Dhrystone w. Memory TD Keep the memory access latency constant - 90 MHz DDR DRAM w. 200 ns latency in all target (50MHz to 2GHz) - Latency is pessimistic, but reflect the trend RAMP blue result + Time dilation vs. real system?

11 Limitation of Naïve time dilation Fixed CPI (memory/CPU) model Next step –Variable time dilation factor: distribution and state (statistic model) –Emulate OOO with time dilation Peek each instruction and dilate it –Going to deterministic? No, I’ll do statistic Unit Time dilation counter Proposed model No extra control between units Reprogram Time Dilation Counter (TDC) in each unit to get different target configuration

12 Discussions!