Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018)

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Lecture 6: Multicore Systems
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
This project and the research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/ ] under.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
Continuously Recording Program Execution for Deterministic Replay Debugging.
What Great Research ?s Can RAMP Help Answer? What Are RAMP’s Grand Challenges ?
1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.
OPERATING SYSTEM OVERVIEW
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
1 Dr. Frederica Darema Senior Science and Technology Advisor NSF Future Parallel Computing Systems – what to remember from the past RAMP Workshop FCRC.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
Figure 1.1 Interaction between applications and the operating system.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
COMPUTER SYSTEMS An Integrated Approach to Architecture and Operating Systems Chapter 14 Epilogue: A Look Back at the Journey ©Copyright 2008 Umakishore.
Senior Design May AbstractDesign Alex Frisvold Alex Meyer Nazmus Sakib Eric Van Buren Our project is to develop a working emulator for an Android.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
“The Architecture of Massively Parallel Processor CP-PACS” Taisuke Boku, Hiroshi Nakamura, et al. University of Tsukuba, Japan by Emre Tapcı.
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Virtualization: Not Just For Servers Hollis Blanchard PowerPC kernel hacker.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Heterogeneous Multikernel OS Yauhen Klimiankou BSUIR
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
An Accurate and Detailed Prefetching Simulation Framework for gem5 Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture.
Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
Background Computer System Architectures Computer System Software.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.
Compiler Research How I spent my last 22 summer vacations Philip Sweany.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Processes Chapter 3. Processes in Distributed Systems Processes and threads –Introduction to threads –Distinction between threads and processes Threads.
PERFORMANCE OF THE OPENMP AND MPI IMPLEMENTATIONS ON ULTRASPARC SYSTEM Abstract Programmers and developers interested in utilizing parallel programming.
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Outline Installing Gem5 SPEC2006 for Gem5 Configuring Gem5.
Microarchitecture.
For Massively Parallel Computation The Chaotic State of the Art
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Structural Simulation Toolkit / Gem5 Integration
Decoupled Access-Execute Pioneering Compilation for Energy Efficiency
University of Technology
OS Virtualization.
Address Translation for Manycore Systems
Directory-based Protocol
HPC User Forum 2012 Panel on Potential Disruptive Technologies Emerging Parallel Programming Approaches Guang R. Gao Founder ET International.
Using Packet Information for Efficient Communication in NoCs
Evaluating Asymmetric Multiprocessing for Mobile Applications
A High Performance SoC: PkunityTM
Introduction to Heterogeneous Parallel Computing
High Performance Computing
EE 4xx: Computer Architecture and Performance Programming
Operating System Introduction.
Chip&Core Architecture
CSE 542: Operating Systems
Presentation transcript:

Experiences with gem5 @ RoMoL project Miquel Moretó gem5 Users Workshop, June 2015

Projects Using gem5 ParaDIME (2012 – 2015) RoMoL (2013 – 2018) Partners: BSC, TU Dresden, UNeuchatel, IMEC, GmbH Low power systems RoMoL (2013 – 2018) ERC Advanced Grant Hardware – Software Co-design Mont Blanc 3 (starting Oct 2015) Partners: Bull, ARM, BSC, CNRS, UniCan Trace-driven gem5 (CNRS) Multiscale simulation

List of gem5 Users @ BSC Victor Garcia: Research on CPU+GPU architectures with gem5gpu Lluis Vilanova: gem5 to McPAT converter (https://projects.gso.ac.upc.edu/projects/gem5-mcpat) Gulay Yalcin: Fault injection and voltage scaling in different pipeline structures Ruben Titos: HW acceleration of message passing Milan Stanic: Low power vector processors (ARM) Lluc Alvarez: Heterogeneous memory systems with scratchpad memories alongside L1 caches Emilio Castillo: Adaptive DVFS techniques for large-scale multicores Performance analysis and visualization tools Full system simulation support for Minor CPU with x86 ISA Paul Caheny: Coherence protocols in multicores Ivan Perez: Booksim Network Simulator integrated in Ruby

Heterogeneous Memory Systems Hybrid memory hierarchy L1 cache + Local memories (LM) More difficult to manage, but More energy efficient Less coherence traffic LM Management in OpenMP (ISCA’15) Strided accesses served by the LM Irregular accesses served by the L1 cache HW support for coherence and consistency L1 L1 C C LM LM L1 L1 C C LM LM Cluster Interconnect L1 L1 C C LM LM L1 L1 C C LM LM L2 L3 DRAM DRAM Lluc Alvarez et al. Coherence Protocol for Transparent Management of Scratchpad Memories in shared Memory Manycore Architectures. ISCA 2015.

Heterogeneous Memory Systems x86, syscall emulation, 64 ooo cores, Ruby, NAS benchmarks Some required modifications for these experiments: MOESI Coherence Protocol Deadlocks Consistency Model Issues Functional accesses in Ruby New implementations Scratchpad memory, DMA controllers, additional directories Modified coherence protocol Ruby – Sequencer hooks

Adaptive DVFS Techniques Runtime system computation power according to task criticality and total power budget 16% and 45% improvements in execution time and EDP big big big big little little little little Reconfig overhead grows with the number of cores Runtime Support Unit (RSU) Extended ISA to notify task execution RSU reconfigures core speed

Adaptive DVFS for large-scale multicores x86, full system, 32 ooo cores, Ruby, PARSEC benchmarks Some required modifications for these experiments: Interrupts clobbering Context switches issues with FP/SSE registers Pipeline flush due to aliased loads New implementations Basic DVFS Controller (replaced later with the ARM one). Enabling different clock domains for the Ruby memory model (submitted patch).

Performance Analysis & Visualization Objectives: Understand execution of parallel applications Tune performance of simulated parallel applications Extend gem5 to generate execution traces of applications without affecting their timing Runtime system informs the timing model to start tracing Gather statistics and generate trace during simulation Traces are visualized with paraver after/during simulation Performance analysis and visualization tools. To better understand the execution of parallel applications, we extended gem5 to generate execution traces of applications without affecting their timing. We are able to generate traces that are processed later with paraver (equivalent to Vampir) and are very helpful to understand the behavior of the application.

Tune Applications: Blackscholes 32 cores CPU time idle time time Performance problem: too fine parallelization Solution: increase block size

Performance Analysis: Bodytrack I/O Tasks 16 cores CPU Bound Tasks Idle threads time

High IPC Low IPC Idle IPC Performance Analysis 4 big cores Task state 4 little cores High IPC Low IPC Idle IPC

Conclusions Pros Cons Wishlist Detailed simulation HW-SW interaction Simulation time Some commits break your code Wishlist Simulation time reduction Dyniamically switch level of detail? Better testing infrastructure OS boot? Support for more than 8 cores with ARM ISA

THANK YOU!

Experiences with gem5 @ RoMoL project Miquel Moretó gem5 Users Workshop, June 2015