NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:

Slides:



Advertisements
Similar presentations
In-Order Execution In-order execution does not always give the best performance on superscalar machines. The following example uses in-order execution.
Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Enabling Efficient On-the-fly Microarchitecture Simulation Thierry Lafage September 2000.
Paul D. Bryan, Jason A. Poovey, Jesse G. Beu, Thomas M. Conte Georgia Institute of Technology.
Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.
1: Operating Systems Overview
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Multiscalar processors
ECE 510 Brendan Crowley Paper Review October 31, 2006.
1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
LOGO OPERATING SYSTEM Dalia AL-Dabbagh
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
Multi-core architectures. Single-core computer Single-core CPU chip.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Presenter: Zong Ze-Huang Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi- Core Systems Stattelmann, S. ; Bringmann, O.
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.
CS1104 – Computer Organization PART 2: Computer Architecture Lecture 12 Overview and Concluding Remarks.
Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
© Michel Dubois, Murali Annavaram, Per Strenstrom All rights reserved Embedded Computer Architecture 5SAI0 Simulation - chapter 9 - Luc Waeijen 16 Nov.
Full and Para Virtualization
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
1 Lecture 2: Performance, MIPS ISA Today’s topics:  Performance equations  MIPS instructions Reminder: canvas and class webpage:
Sunpyo Hong, Hyesoon Kim
Introduction to Computer Organization Pipelining.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
New-School Machine Structures Parallel Requests Assigned to computer e.g., Search “Katz” Parallel Threads Assigned to core e.g., Lookup, Ads Parallel Instructions.
The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.
PipeliningPipelining Computer Architecture (Fall 2006)
Lecture 3. Performance Prof. Taeweon Suh Computer Science & Engineering Korea University COSE222, COMP212, CYDF210 Computer Architecture.
Use of Pipelining to Achieve CPI < 1
CS 352H: Computer Systems Architecture
CS161 – Design and Architecture of Computer Systems
Lecture 3: MIPS Instruction Set
Multiscalar Processors
Computer Structure Multi-Threading
CDA 3101 Spring 2016 Introduction to Computer Organization
Hyperthreading Technology
Department of Computer Science University of California, Santa Barbara
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Hardware Multithreading
Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt
Instruction Level Parallelism (ILP)
Hardware Multithreading
Department of Computer Science University of California, Santa Barbara
What Are Performance Counters?
Presentation transcript:

NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s: DANIEL SANCHEZ CHRISTOS KOZYRAKIS Presented By: Vaibhav Ashtikar(13IS24F) Govind Dhonddev(13IS06F)

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Piece of software: Modelling computer system/components Input Predicts o/p & performance Architectural Simulator Evaluating different hardware designs without building costly physical hardware systems. Enabling the opportunities to access non-existing computer components or systems. Obtaining detailed performance metrics: A single execution of simulators can often generate a large set of performance data. Debugging: Debugging on real hardware typically require re-booting and re-running the code to reproduce the problems. In contrast, some simulators have a fully controlled environment and allow software developers to run code backward once an error is detected.

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Piece of software: Modelling computer system/components Input Predicts o/p & performance IDEAL Architectural Simulator FAST ACCURATEExecute wide range of WORKLOADSEasy to use, easy to modify

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: Problem: Architectural simulation is Time Consuming Current detailed simulators are slow (~200 KIPS) Problem: Time to simulate GHz for 1second at 200 KIPS: 4 months 200 MIPS: 3 hours Simulation performance wall More complex targets (multicore, memory hierarchy, …) Hard to parallelize

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Architectural Simulator Sequential SimulationParallel simulation More cores to be simulated ; More slower sequential simulation. Scaling poorly due to excessive synchronization. Sacrifice accuracy by allowing event reordering tradeoff

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS SpeedAccuracy SequentialScaling Performance measures ParallelOOO Tradeoff between speed and accuracy:

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed up detailed core models Bound Weave Light weight user level virtualization ZSIM

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Speed up detailed core models ZSIM With instruction driven timing models that uses DBT(Dynamic Binary Translation) FAST

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Bound Weave ZSIM ACCURACY 2 phase parallelization technology that scales parallel simulation on multicore..

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Light weight user level virtualization ZSIM Wide range of workloads To support complex workloads. E.g. multiprogramming client server based applications etc. To bridge user-level/full system gap

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Dynamic Binary Translation Simulate basic block using host instructions Binary Code Load r3, 16(r1) Add r4, r3, r2 Jump 0x48074 Load t1, simRegs[1] Load t2, 16(t1) Store t2, simRegs[3] Load t1, simRegs[2] Load t2, simRegs[3] Add t3, t1, t2 Store t3, simregs[4] Store 0x48074, simPc J dispatch_loop Translated Code ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Dynamic Binary Translation Modeling thousand-core simulator with parallelization alone is not sufficient. Instrumentation based approach to eliminate need for functional modeling of X86. Timing Based Model Simple Core ModelOOO Core Model ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Simple Core Model (SCM) Instrument Load and Store instructions. SCM counts cycles, instructions, derives memory hierarchy. Simulated up to 90 MIPS per simulated Core Pitfalls: Doesn’t represent ooo model used in desktops, server and processor chips. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

OOO Core Model Models Branch prediction Instruction length Pre-decoder instruction decoding Issue stalls Register renaming Conventional Simulators with OOO model execute around 100 KIPS. ZSIM accelerates OOO core by pushing most of the work at instrumentation phase ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

OOO core modeling Basic Block Instrumented basic block + Basic Block Descriptor Instruction Driven Approach: Simulate all stages at once for each instruction / μ-operation. Schedule of given μ operation must not depend on future μ operation. Execution time of every μ operation must be known in advance. BasicBlock(DecodedBBL) Load(addr = -0x38(%rbp)) mov -0x38(%rbp),%rcx lea -0x2040(%rbp),%rdx add %rax,%rdx mov %rdx,-0x2068(%rbp) Store(addr = -0x2068(%rbp)) cmp $0x1fff,%rax jne a mov (%rbp),%rcx add %rax,%rbx mov %rdx,(%rbp) ja 40530a Ins →μop decoding μop dependencies, functional units, latency Front-end delays ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Parallelism and Interference Path-altering Interference Two accesses if simulated in out of order changes their paths through memory hierarchy. Root cause: Two accesses address to same line (except both reads) Second access if executed out of order, causes first access as miss. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Parallelism and Interference Path-Preserving Interference Two accesses if simulated out of order changes their timing but path to memory hierarchy remains unaffected.. Ex. 2 accesses to different Cache sets in same bank. In small intervals(1-10K cycles) path altering interference is very rare(<1 in 10K accesses) ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Bound-weave Algorithm Need accuracy on path-altering interference Bound-weave Algorithm 1. Bound Phase2. Weave Phase In this interval, each core is simulated for specific small interval Zero load latency during the interval. In this interval, each core is simulated for specific small interval Parallel simulation of core for prior knowledge of events to scale efficiently. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Bound Phase Limit the skew between simulated Cores Thread execution in parallel and sync for interval barrier. Moderate parallelism Allow as many threads as host hardware threads run concurrently. Ex core simulation on host with 32 hardware threads, barrier only wakes up 32 threads at each time interval. Avoiding systematic bias At end of time interval, barrier shuffles thread wake up order to avoid consistently prioritizing a few threads. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Bound-Weave Example 2-core host simulating 4-core system 1000-cycles intervals Dividing components among 2 domains ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Complex Workloads Multi process simulation Scheduler – simple round robin scheduler Avoiding Simulator-OS Deadlock Time Virtualization – virtualize rdtsc counter. System Virtualization – pregenerated virtual instruction. Fast forwarding – DBT to perform pre-processing fast close to native speed. Challenge: Accounting for OS execution time. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Accuracy 18 out of 29 benchmarks, zsim is within 10% of real system (average performance error around 9.7 ) Absolute performance error Simulated Real ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Accuracy- Cache Level MPKI Error Error rate increases along with memory hierarchy. Currently TLB misses not modelled. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Thousand Core performance Single system in sequence simulation: 1.32 trillion instructions takes 1.8 hours(IPC1-NC) to 8.9 hours(OOO- C). ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Speed Up Performance simulation of single threaded ZSIM on SPEC2006 using 4 models: IPC1 or OOO cores with and without contention ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Speed Up Average ZSIM speedup on workloads as we increase host threads from 1 to 32(16 cores with 2 hardware threads/cores) ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Comparison With other simulators Parallel simulators reports 1-10 MIPS. Many constraints – host, workloads, memory intensive application, leads to potential difference. ZSIM is 2-3 orders of magnitude faster than other simulators. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

Conclusion New techniques to achieve speed and accuracy. DBT based Timing Model Bound-Weave Parallelization Lightweight virtualization of user process Leads to speedup of MIPS on thousand core simulation. ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

References [1] “ZSIM: Fast and Accurate Micro architectural Simulation of Thousand-Core Systems ” Daniel Sanchez, Christos Kozyrakis, ISCA 2013 ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS