Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Slides:



Advertisements
Similar presentations
Hardware Lesson 3 Inside your computer.
Advertisements

Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Virtual Switching Without a Hypervisor for a More Secure Cloud Xin Jin Princeton University Joint work with Eric Keller(UPenn) and Jennifer Rexford(Princeton)
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Parallelism Lecture notes from MKP and S. Yalamanchili.
Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SLA-Oriented Resource Provisioning for Cloud Computing
Flash: An efficient and portable Web server Authors: Vivek S. Pai, Peter Druschel, Willy Zwaenepoel Presented at the Usenix Technical Conference, June.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Variability in Architectural Simulations of Multi-threaded Workloads Alaa R. Alameldeen and David A. Wood University of Wisconsin-Madison
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
CS-3013 & CS-502, Summer 2006 Virtual Machine Systems1 CS-502 Operating Systems Slides excerpted from Silbershatz, Ch. 2.
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Simulating a $2M Commercial Server on a $2K PC Alaa Alameldeen, Milo Martin, Carl Mauer, Kevin.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
G Robert Grimm New York University Disco.
Introduction to Systems Architecture Kieran Mathieson.
Multiprocessors ELEC 6200 Computer Architecture and Design Instructor: Dr. Agrawal Yu-Chun Chen 10/27/06.
(C) 2002 Milo MartinHPCA, Feb Bandwidth Adaptive Snooping Milo M.K. Martin, Daniel J. Sorin Mark D. Hill, and David A. Wood Wisconsin Multifacet.
Figure 1.1 Interaction between applications and the operating system.
Evaluating Non-deterministic Multi-threaded Commercial Workloads Computer Sciences Department University of Wisconsin—Madison
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.
(C) 2004 Daniel SorinDuke Architecture Using Speculation to Simplify Multiprocessor Design Daniel J. Sorin 1, Milo M. K. Martin 2, Mark D. Hill 3, David.
February 11, 2003Ninth International Symposium on High Performance Computer Architecture Memory System Behavior of Java-Based Middleware Martin Karlsson,
Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.
Presented by Deepak Srinivasan Alaa Aladmeldeen, Milo Martin, Carl Mauer, Kevin Moore, Min Xu, Daniel Sorin, Mark Hill and David Wood Computer Sciences.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
(C) 2003 Mulitfacet ProjectUniversity of Wisconsin-Madison Evaluating a $2M Commercial Server on a $2K PC and Related Challenges Mark D. Hill Multifacet.
1 CS503: Operating Systems Spring 2014 Dongyan Xu Department of Computer Science Purdue University.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.
CDA 3101 Fall 2013 Introduction to Computer Organization Computer Performance 28 August 2013.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Operating Systems CSE 411 Multi-processor Operating Systems Multi-processor Operating Systems Dec Lecture 30 Instructor: Bhuvan Urgaonkar.
(C) 2003 Daniel SorinDuke Architecture Dynamic Verification of End-to-End Multiprocessor Invariants Daniel J. Sorin 1, Mark D. Hill 2, David A. Wood 2.
1 Lecture 1: Computer System Structures We go over the aspects of computer architecture relevant to OS design  overview  input and output (I/O) organization.
Exploiting Instruction Streams To Prevent Intrusion Milena Milenkovic.
SIMULATION OF MULTIPROCESSOR SYSTEM AND NETWORK Manish Patel Nov 8 th 2004 Advisor: Dr. Chung-E-Wang Department of Computer Science California State University,
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.
CS203 – Advanced Computer Architecture
Virtualization.
Computer Sciences Department University of Wisconsin-Madison
Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.
Presented by Yoon-Soo Lee
Current Generation Hypervisor Type 1 Type 2.
CLUSTER COMPUTING Presented By, Navaneeth.C.Mouly 1AY05IS037
Virtualization overview
Chapter 4: Threads.
Operating Systems (CS 340 D)
Section 1: Introduction to Simics
Combining Simulators and FPGAs “An Out-of-Body Experience”
Introduction to Computing
Improving Multiple-CMP Systems with Token Coherence
Five Key Computer Components
Simulating a $2M Commercial Server on a $2K PC
Operating Systems (CS 340 D)
Chapter 4 Multiprocessors
Presentation transcript:

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison

Full-System Timing-First Simulation Carl Mauer The Problem Design of future computer systems uses simulation Increasing functionality and timing demands from: –Micro-architecture complexity –Multiprocessing –Complex workloads How can simulators be structured to manage complexity? (Simulated) Target System Target Application Host Computer DesignEvaluate Software Hardware

Full-System Timing-First Simulation Carl Mauer Solutions Separates the complexity of functional correctness from timing accuracy + Models multiprocessor timing + Decouples complexity + Leverages existing simulators * Timing-directed discussed in paper Timing-First (This paper) Functional Simulator Timing Simulator  Complete Timing  Partial Function  No Timing  Complete Function Timing and Functional Simulator Integrated (SimOS) - Complex Functional Simulator Timing Simulator Functional-First (Trace-driven) - Timing feedback

Full-System Timing-First Simulation Carl Mauer Outline Overview Introduction Timing-First Simulation Evaluation Conclusion

Full-System Timing-First Simulation Carl Mauer Challenges to Timing Simulation Execution driven simulation is getting harder Increasing micro-architecture complexity –Multiple “in-flight” instructions –Speculative execution –Out-of-order execution Increasing use of multiprocessing –Chip multiprocessors –Hardware multi-threading

Full-System Timing-First Simulation Carl Mauer Challenges to Functional Simulation Commercial workloads have high functional fidelity demands (Simulated) Target System Target Application Database Operating System SPEC Benchmarks Kernels Web Server RAM Processor PCI Bus Ethernet Controller Fiber Channel Controller Graphics Card SCSI Controller CD- ROM SCSI Disk … DMA Controller Terminal I/O MMU Controller IRQ Controller Status Registers Serial PortMMU Real Time Clock

Full-System Timing-First Simulation Carl Mauer Outline Overview Introduction Timing-First Simulation Evaluation Conclusion

Full-System Timing-First Simulation Carl Mauer Timing-First Simulation Timing Simulator –does functional execution of user and privileged operations –does detailed multiprocessor timing simulation –does NOT implement full instruction set or any devices Functional Simulator –does full-system multiprocessor simulation –does NOT model detailed micro-architectural timing Timing Simulator Functional Simulator Commit Verify CPU Other Chips RAM Network add load Cache CPU Execute I/O Devices

Full-System Timing-First Simulation Carl Mauer Timing-First Operation As instruction retires, step CPU in functional simulator Verify instruction’s execution Reload state if timing simulator deviates from functional –Very uncommon case –Example: Instructions with unidentified side-effects Timing Simulator Functional Simulator CPU RAM Network add load Cache CPU Execute Commit Verify Other Chips I/O Devices Reload

Full-System Timing-First Simulation Carl Mauer Benefits of Timing-First Support detailed multi-processor timing models Avoid re-implementing full-system simulator (hard!) Faster development, same performance –99% correct is (much!) easier than 100% –Functional simulation is (much!) faster than timing simulation –Independent correctness checker helps development Questions: –How much error is introduced with this approach? –Are there simulation performance penalties?

Full-System Timing-First Simulation Carl Mauer Outline Overview Introduction Timing-First Simulation Evaluation Conclusion

Full-System Timing-First Simulation Carl Mauer Evaluation Our implementation, TFsim uses: –Functional Simulator: Virtutech Simics –Timing simulator: Implemented in less than one-person year Evaluated using OS intensive commercial workloads –OS Boot: > 1 billion instructions of Solaris 8 startup –OLTP: Database workload with 1 GB of data –Dynamic Web: Apache serving message board, using code and data similar to slashdot.org –Static Web: Apache web server serving static web pages –Barnes-Hut: Scientific SPLASH-2 benchmark

Full-System Timing-First Simulation Carl Mauer Measured Deviations Less than 2 deviations per 10,000 instructions (0.02%)

Full-System Timing-First Simulation Carl Mauer Analysis of Results Runs full-system workloads! Timing performance impact of deviations –Worst case: less than 3% performance error –Average case: less than 0.3% performance error ‘Overhead’ of redundant execution –18% on average for uniprocessors –18% (2 processors) up to 36% (16 processors)

Full-System Timing-First Simulation Carl Mauer Performance Comparison Absolute simulation performance comparison –In kilo-instructions committed per second (KIPS) –RSIM Scaled: 107 KIPS –Uniprocessor TFsim: 119 KIPS (Simulated) Target System Target Application Host Computer Out-of-Order MP SPARC V9 SPLASH-2 Kernels 400 MHz SPARC running Solaris Out-of-Order MP Full-system SPARC V9 SPLASH-2 Kernels 1.2 GHz Pentium running Linux match close different TFsimRSIM

Full-System Timing-First Simulation Carl Mauer Conclusions Execution-driven simulators are increasingly complex How to manage complexity? Our answer: –Introduces relatively little performance error (worst case: 3%) –Has low-overhead (18% uniprocessor average) –Rapid development time Timing-First Simulation Functional Simulator Timing Simulator  Complete Timing  Partial Function  No Timing  Complete Function

Full-System Timing-First Simulation Carl Mauer Questions?

Full-System Timing-First Simulation Carl Mauer Related Work

Full-System Timing-First Simulation Carl Mauer Bundled Retires