Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload.

Slides:



Advertisements
Similar presentations
Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Advertisements

Parallelism Lecture notes from MKP and S. Yalamanchili.
CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Computer Abstractions and Technology
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
TU/e Processor Design 5Z032 1 Processor Design 5Z032 The role of Performance Henk Corporaal Eindhoven University of Technology 2009.
Scaling Up Engineering Analysis using Windows HPC Server 2008 Todd Wedge Platform Strategy Advisor, HPC Microsoft.
Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.
THQ/Gas Powered Games Supreme Commander and Supreme Commander: Forged Alliance Thread for Performance.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
6/14/2015 How to measure Multi- Instruction, Multi-Core Processor Performance using Simulation Deepak Shankar Darryl Koivisto Mirabilis Design Inc.
Performance D. A. Patterson and J. L. Hennessey, Computer Organization & Design: The Hardware Software Interface, Morgan Kauffman, second edition 1998.
1 Introduction Background: CS 3810 or equivalent, based on Hennessy and Patterson’s Computer Organization and Design Text for CS/EE 6810: Hennessy and.
Chapter 4 Assessing and Understanding Performance
1 Software Testing and Quality Assurance Lecture 40 – Software Quality Assurance.
SECTION 1: INTRODUCTION TO SIMICS Scott Beamer CS152 - Spring 2009.
1 Chapter 4. 2 Measure, Report, and Summarize Make intelligent choices See through the marketing hype Key to understanding underlying organizational motivation.
Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.
More on Locks: Case Studies
Evaluation of a DAG with Intel® CnC Mark Hampton Software and Services Group CnC MIT July 27, 2010.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
University of Michigan Electrical Engineering and Computer Science 1 Dynamic Acceleration of Multithreaded Program Critical Paths in Near-Threshold Systems.
AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
F. Brasolin / A. De Salvo – The ATLAS benchmark suite – May, Benchmarking ATLAS applications Franco Brasolin - INFN Bologna - Alessandro.
NATIONAL INSTITUTE OF TECHNOLOGY KARNATAKA,SURATHKAL Presentation on ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS Publisher’s:
Problem is to compute: f(latitude, longitude, elevation, time)  temperature, pressure, humidity, wind velocity Approach: –Discretize the.
1 CS/EE 362 Hardware Fundamentals Lecture 9 (Chapter 2: Hennessy and Patterson) Winter Quarter 1998 Chris Myers.
Using Prediction to Accelerate Coherence Protocols Shubu Mukherjee, Ph.D. Principal Hardware Engineer VSSAD Labs, Alpha Development Group Compaq Computer.
 Virtual machine systems: simulators for multiple copies of a machine on itself.  Virtual machine (VM): the simulated machine.  Virtual machine monitor.
1 CS/COE0447 Computer Organization & Assembly Language CHAPTER 4 Assessing and Understanding Performance.
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
Making Watson Fast Daniel Brown HON111. Need for Watson to be fast to play Jeopardy successfully – All computations have to be done in a few seconds –
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
TEST 1 – Tuesday March 3 Lectures 1 - 8, Ch 1,2 HW Due Feb 24 –1.4.1 p.60 –1.4.4 p.60 –1.4.6 p.60 –1.5.2 p –1.5.4 p.61 –1.5.5 p.61.
COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.
September 10 Performance Read 3.1 through 3.4 for Wednesday Only 3 classes before 1 st Exam!
… begin …. Parallel Computing: What is it good for? William M. Jones, Ph.D. Assistant Professor Computer Science Department Coastal Carolina University.
EGRE 426 Computer Organization and Design Chapter 4.
Performance Computer Organization II 1 Computer Science Dept Va Tech January 2009 © McQuain & Ribbens Defining Performance Which airplane has.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.
KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.
Scaling up R computation with high performance computing resources.
Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.
Using the VTune Analyzer on Multithreaded Applications
Conclusions on CS3014 David Gregg Department of Computer Science
CS203 – Advanced Computer Architecture
Lecture 2: Performance Evaluation
Simone Campanoni A research CAT Simone Campanoni
September 2 Performance Read 3.1 through 3.4 for Tuesday
Large-scale Machine Learning
Diskpool and cloud storage benchmarks used in IT-DSS
Intel’s Core i7 Processor
Improving java performance using Dynamic Method Migration on FPGAs
Section 1: Introduction to Simics
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
CMSC 611: Advanced Computer Architecture
Erlang Multicore support
CMSC 611: Advanced Computer Architecture
Run time performance for all benchmarked software.
Computer Organization and Design Chapter 4
Presentation transcript:

Pre-Silicon Simulation of Multi-Core Benchmarks Shubu Mukherjee Principal Engineer Director, SPEARS Group Intel Corporation Panel in Symposium on Workload Characterization, Sep 27, 2007

2 Detailed Model Good for Core Analysis Single core simulation model executes ~ 12 milliseconds of a real machine’s execution Assumes core speed = 1 KIPS (kilo simulated insts per second) Assumes each simulation run is about 10 hours Core Uncore Socket

3 Four-Socket Platform Model Too Slow 1-socket simulation model executes ~ 1-3 milliseconds of a real machine’s execution 4-socket simulation model executes only 100s of microseconds of a real machine’s execution (recall disk latency is in milliseconds) Need at least a 10x Boost in Platform Performance Model Speed

4 What 10x Speed Improvement Gives Us? Improved Accuracy Via greater coverage of benchmark slices Better glassjaw analysis Faster Turnaround Improved Latency Faster debugging Improved Benchmarking Greater coverage of benchmarks Enables multithreaded (cooperative) benchmarks

5 Approaches to Boost Simulation Speed (one key charter for SPEARS)  Improve Basic Infrastructure  Create Faster Core Models That are Less Accurate  Go Parallel in a Modular Fashion  Use Accelerators, such as FPGAs

6 What’s Novel Here? Parallel Simulation is an Old Technology Distributed, discrete-event simulation, Fujimoto, 1990 Wisconsin Wind Tunnel I + II, Reinhardt, et al 1992 & Mukherjee, et al Customized for specific applications (e.g., shared memory) So, What Are the Challenges? Starting point is several millions of lines of non-parallel C++ code (!) This is production software  must be stable (unlike “research” software) Parallel infrastructure must be modular, built once, used repeatedly without changing any architecture model code Deal with new problems: load imbalance at multiple levels Current Status: Created infrastructure, Work-In-Progress

7 Speedup of the Pthread-per-socket Model (on Clovertowns) Speedup scales linearly with problem size LOT more room for improvement exists