UC Berkeley 1 A Disk and Thermal Emulation Model for RAMP Zhangxi Tan and David Patterson.

Slides:



Advertisements
Similar presentations
IT253: Computer Organization
Advertisements

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
CS2100 Computer Organisation Performance (AY2014/2015) Semester 2.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Performance See: P&H 1.4.
IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
CIS629 Fall Lecture Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two important.
ISCSI Performance in Integrated LAN/SAN Environment Li Yin U.C. Berkeley.
Accurate and Efficient Replaying of File System Traces Nikolai Joukov, TimothyWong, and Erez Zadok Stony Brook University (FAST 2005) USENIX Conference.
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Bandwidth Rocks (1) Latency Lags Bandwidth (last ~20 years) Performance Milestones Disk: 3600, 5400, 7200, 10000, RPM.
CIS429/529 Winter 07 - Performance - 1 Performance Overview Execution time is the best measure of performance: simple, intuitive, straightforward. Two.
High Performance Logging System for Embedded UNIX and GNU/Linux Applications IEEE RTCSA 2013 (8/21/13) Cisco Systems Jaein Jeong.
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.
Measuring zSeries System Performance Dr. Chu J. Jong School of Information Technology Illinois State University 06/11/2012 Sponsored in part by Deer &
Cluster computing facility for CMS simulation work at NPD-BARC Raman Sehgal.
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
1 Computer Performance: Metrics, Measurement, & Evaluation.
DELL PowerEdge 6800 performance for MR study Alexander Molodozhentsev KEK for RCS-MR group meeting November 29, 2005.
Global NetWatch Copyright © 2003 Global NetWatch, Inc. Factors Affecting Web Performance Getting Maximum Performance Out Of Your Web Server.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.
GPS based time synchronization of PC hardware Antti Gröhn
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Measuring System Performance The speed of a computer is often referred to as THROUGHPUT. This is very difficult to measure. It can be done with Measures.
Lecture 19: Virtual Memory
1  2004 Morgan Kaufmann Publishers Multilevel cache Used to reduce miss penalty to main memory First level designed –to reduce hit time –to be of small.
Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.
10/19/2015Erkay Savas1 Performance Computer Architecture – CS401 Erkay Savas Sabanci University.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.
Sandor Acs 05/07/
Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.
CPE 731 Advanced Computer Architecture Technology Trends Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
DiskSim – Storage System Simulator Michigan-CMU
Introduction to virtualization
Sep. 17, 2002BESIII Review Meeting BESIII DAQ System BESIII Review Meeting IHEP · Beijing · China Sep , 2002.
Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
CORE Lab. E.E. 1 Soft timers : efficient microsecond so ftware timer support for network proc essing Mohit Aron and Peter Druschel 17 th ACM Symposium.
Jan. 5, 2000Systems Architecture II1 Machine Organization (CS 570) Lecture 2: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed. Oct. 4,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
W4118 Operating Systems Instructor: Junfeng Yang.
June 20, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 1: Performance Evaluation and Benchmarking * Jeremy R. Johnson Wed.
ECE232: Hardware Organization and Design
September 2 Performance Read 3.1 through 3.4 for Tuesday
Hardware Technology Trends and Database Opportunities
EE380, Fall 2010 Hank Dietz Chapter 2 EE380, Fall 2010 Hank Dietz
Defining Performance Which airplane has the best performance?
Virtual Memory Main memory can act as a cache for the secondary storage (disk) Advantages: illusion of having more physical memory program relocation protection.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Combining Simulators and FPGAs “An Out-of-Body Experience”
CMSC 611: Advanced Computer Architecture
Welcome to Architectures of Digital Systems
CMSC 611: Advanced Computer Architecture
Cluster Computers.
CS2100 Computer Organisation
Presentation transcript:

UC Berkeley 1 A Disk and Thermal Emulation Model for RAMP Zhangxi Tan and David Patterson

2 Outline Introduction and retrospective overview Improvement since June 06 Disk and temperature emulation Future work

3 June 06 status Internet in a box Version 0 –3 Xilinx XUP board ($299*3) with 12 processors –uClinux and research application (i3) Limitations –Software base is poor No MMU, no fork, no full version of linux Every software need porting –Processor is too slow (100 MHz vs 3 GHz) –No local storage per nodes

4 Improvement Jun 06Jan 07 Processor MicroBlazeLEON 3 32-bit RISC/Microcontroller32-bit SPARC V8 No MMUMMU/Configurable TLB Single precision floating pointIEEE 754 Floating Point Direct map cacheDirect map/Set associative cache OS and Software uClinux 2.4 (no protection, no fork)Full Linux Every software needs portingRun latest Debian/GNU Linux binaries directly (support apt-get) Others No disk emulationEmulate local disk with Ethernet attached storage Slow processor onlyEmulate fast systems with “Time Dilation” -Emulate system temperature

5 Agenda Introduction and retrospective overview Improvement since June 06 Disk and temperature emulation Future work

6 Disk and Thermal Emulation Local disk is an essential part for datacenter –Local physical storage –Variable disk specifications (VM only have a function module) –In the context of real workload Temperature is a critical issue in DC –Cooling, reliability –How the workload will affect the temperature in datacenter is an interesting topic

7 Methodology HW Emulator (FPGA): 32-bit Leon3 with, 50MHz, 90 MHz DDR memory, 8K L1 Cache (4K Inst and 4K Data) –Target system: Linux 2.6 kernel, 50 MHz – 2 GHz PC – storage, trace logger and model solver (offline or online)  Emulating IDE disk with Ethernet based network storage (ATA over Ethernet) + DiskSim AoE: Encapsulate IDE command in Ethernet packet DiskSim: widely used disk simulator (provide access timing based on disk specification)  Thermal emulation is done by Mercury suite (ASPLOS’ 06) Sample CPU/disk activities periodically and send to a central emulator Emulator takes system configuration and predict temperature based on Newton’s laws of cooling Disk state will help power estimation  Time dilation makes “target” looks faster Reprogram HW timer to make ‘jiffies’ longer in terms of wall clock Slow down memory accordingly, when speeding up processor

8 Experiments Thermal emulation model (validated in Mercury) –Physical layout from Dell PowerEdge GHz Xeon, 10K RPM SCSI Emulated disk model (validated disk model in Disksim) –Seagate Cheetah 9LP 10K RPM, 5 ms avg seek time Several programs run in target system with various time dilation factors –Dhrystone: CPU intensive benchmark –Postmark: A file system benchmark (disk intensive) –Unix command with pipe (both disk and CPU intensive) cat alargefile | grep ‘a search pattern’ > searchresultfile 100 MB file size Emulation output –Performance statistics –System temperature

9 Dhrystone result (w/o memory TD) How close to a 3 GHz x86 ~8000 Dhrystone MIPS? Memory, Cache, CPI

10 Dhrystone w. Memory TD Keep the memory access latency constant - 90 MHz DDR DRAM w. 200 ns latency in all target (50MHz to 2GHz) - Latency is pessimistic, but reflect the trend

11 Postmark file system benchmark Speed-up factor is larger than TDF (overhead) How close to modern SATA disk? Twice throughput if run the same benchmark.

12 Disk emulation performance Overhead analysis –<1.4ms sending packet (no zero-copy, VM) –Burst of requests (service time < 10ms, including Disksim), AoE protocol segmentation –Larger TDF offset overhead Overall emulated disk time still a little longer than simulated timing in disksim (~2.8 ms)

13 Emulated disk R/W time in target Pretty deterministic result with different TDF

14 CPU Temperature Emulation 50 MHz250 MHz500 MHz 1 GHz2 GHz Need calibration to get correct absolute value Trend is accurate

15 Disk Temperature Emulation 50 MHz250 MHz500 MHz 1 GHz2 GHz

16 Limitations and Conclusion Limitations –AoE limits the maximum number of RW sectors to 2! (Ethernet packet limitation) –Naïve memory dilation (constant delay) Conclusion –Doing disk emulation in SW is pretty “lightweight”, if Time dilation makes SW disk fast enough Having separate network channel for disk emulation Future work –Better statistic time dilation model (CPI, distribution), still simple HW –Emulate real-life disk controller (e.g. Intel ICH) less overhead