1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Threads, SMP, and Microkernels

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras

A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.

Optimizing Shared Caches in Chip Multiprocessors Samir Sapra Athula Balachandran Ravishankar Krishnaswamy.

High Performing Cache Hierarchies for Server Workloads

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

Multiple Processor Systems

Hardware Support for Spin Management in Overcommitted Virtual Machines Philip Wells Koushik Chakraborty Gurindar Sohi {pwells, kchak,

Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-

A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.

Computer Systems/Operating Systems - Class 8

1 Threads, SMP, and Microkernels Chapter 4. 2 Process: Some Info. Motivation for threads! Two fundamental aspects of a “process”: Resource ownership Scheduling.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.

Disco Running Commodity Operating Systems on Scalable Multiprocessors.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Chapter 17 Parallel Processing.

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

Multiscalar processors

EECC722 - Shaaban #1 Lec # 4 Fall Operating System Impact on SMT Architecture The work published in “An Analysis of Operating System Behavior.

DISTRIBUTED COMPUTING

Utility-Based Acceleration of Multithreaded Applications on Asymmetric CMPs José A. Joao * M. Aater Suleman * Onur Mutlu ‡ Yale N. Patt * * HPS Research.

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Computer System Architectures Computer System Software

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Pınar Tözün Anastasia Ailamaki SLICC Self-Assembly of Instruction Cache Collectives for OLTP Workloads Islam Atta Andreas Moshovos.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Multi-Core Architectures

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Predicting Coherence Communication by Tracking Synchronization Points at Run Time Socrates Demetriades and Sangyeun Cho 45 th International Symposium in.

Srihari Makineni & Ravi Iyer Communications Technology Lab

Threads, SMP, and Microkernels Chapter 4. Process Resource ownership - process is allocated a virtual address space to hold the process image Scheduling/execution-

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly 廖健合 Department of Electrical Engineering National Cheng Kung University.

Virtual Hierarchies to Support Server Consolidation Mike Marty Mark Hill University of Wisconsin-Madison ISCA 2007.

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Operating System 4 THREADS, SMP AND MICROKERNELS.

Concurrency, Processes, and System calls Benefits and issues of concurrency The basic concept of process System calls.

MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Chapter 4 Threads, SMP, and Microkernels Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E.

Distributed Computing Systems CSCI 6900/4900. Review Definition & characteristics of distributed systems Distributed system organization Design goals.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Threads, SMP, and Microkernels Chapter 4. Processes and Threads Operating systems use processes for two purposes - Resource allocation and resource ownership.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Multi-Core CPUs Matt Kuehn. Roadmap ► Intel vs AMD ► Early multi-core processors ► Threads vs Physical Cores ► Multithreading and Multi-core processing.

COMP 740: Computer Architecture and Implementation

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Hyperthreading Technology

Energy-Efficient Address Translation

Reducing Memory Reference Energy with Opportunistic Virtual Caching

Threads, SMP, and Microkernels

Chapter 4: Threads.

Hardware Multithreading

CARP: Compression-Aware Replacement Policies

Operating System 4 THREADS, SMP AND MICROKERNELS

Operating Systems (CS 340 D)

Presentation transcript:

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Chakraborty, Wells, and Sohi ASPLOS Paper Overview  Multiprocessor Code Reuse Poor resource utilization  Computation Spreading New model for assigning computation within a program on CMP cores in H/W Case Study: OS and User computation  Investigate performance characteristics

Chakraborty, Wells, and Sohi ASPLOS Talk Outline  Motivation  Computation Spreading (CSP) Case study: OS and User compution  Implementation  Results  Related Work and Summary

Chakraborty, Wells, and Sohi ASPLOS Homogeneous CMP  Many existing systems are homogeneous Sun Niagara, IBM Power 5, Intel Xeon MP  Multithreaded server application Composed of server threads Typically each thread handles a client request OS assigns software threads to cores Entire computation from one thread execute on a single core (barring migration)

Chakraborty, Wells, and Sohi ASPLOS Code Reuse  Many client requests are similar Similar service across multiple threads Same code path traversed in multiple cores  Instruction footprint classification Exclusive – single core access Common – many cores access Universal – all cores access

Chakraborty, Wells, and Sohi ASPLOS Multiprocessor Code Reuse

Chakraborty, Wells, and Sohi ASPLOS Implications  Lack of instruction stream specialization Redundancy in predictive structures Poor capacity utilization Destructive interference  No synergy among multiple cores Lost opportunity for co-operation Exploit core proximity in CMP

Chakraborty, Wells, and Sohi ASPLOS Talk Outline  Motivation  Computation Spreading (CSP) Case study: OS and User compution  Implementation  Results  Related Work and Summary

Chakraborty, Wells, and Sohi ASPLOS Computation Spreading (CSP)  Computation fragment = dynamic instruction stream portion  Collocate similar computation fragments from multiple threads Enhance constructive interference  Distribute dissimilar computation fragments from a single thread Reduce destructive interference Reassignment is the key

Chakraborty, Wells, and Sohi ASPLOS Example A1B1C1A1B1C1 B2C2A2B2C2A2 C3A3B3C3A3B3 T1 T2 T3 B3B3 A3A3 C3C3 A1A1 C1C1 B1B1 B2B2 C2C2 A2A2 P1 P2 P3 CANONICAL CSP time A1B1C1A1B1C1 B2C2A2B2C2A2 C3A3B3C3A3B3

Chakraborty, Wells, and Sohi ASPLOS Key Aspects  Dynamic Specialization Homogeneous multicore acquires specialization via retaining mutually exclusive predictive state  Data Locality Data dependencies between different computation fragments Careful fragment selection to avoid loss of data locality

Chakraborty, Wells, and Sohi ASPLOS Selecting Fragments  Server workloads characteristics Large data and instruction footprint Significant OS computation  User Computation and OS Computation A natural separation Exclusive instruction footprints Relatively independent Relatively independent data footprint

Chakraborty, Wells, and Sohi ASPLOS Data Communication T1T1 T2T2 T 1 -User T 1 -OS T 2 -User T 2 -OS Core 1Core 2

Chakraborty, Wells, and Sohi ASPLOS Relative Inter-core Data Communication ApacheOLTP OS-User Communication is limited

Chakraborty, Wells, and Sohi ASPLOS Talk Outline  Motivation  Computation Spreading (CSP) Case study: OS and User compution  Implementation  Results  Related Work and Summary

Chakraborty, Wells, and Sohi ASPLOS Implementation  Migrating Computation Transfer state through the memory subsystem ~2KB of register state in SPARC V9 Memory state through coherence  Lightweight Virtual Machine Monitor Migrates computation as dictated by the CSP Policy Implemented in hardware/firmware

Chakraborty, Wells, and Sohi ASPLOS Baseline User Cores OS Cores User Comp OS Comp Virtual CPUs Physical Cores Software Stack Implementation cont Threads

Chakraborty, Wells, and Sohi ASPLOS User Cores OS Cores Virtual CPUs Physical Cores Software Stack Implementation cont Threads

Chakraborty, Wells, and Sohi ASPLOS CSP Policy  Policy dictates computation assignment  Thread Assignment Policy (TAP) Maintains affinity between VCPUs and physical cores  Syscall Assignment Policy (SAP) OS computation assigned based on system calls  TAP and SAP use identical assignment for user computation

Chakraborty, Wells, and Sohi ASPLOS Talk Outline  Motivation  Computation Spreading (CSP) Case study: OS and User compution  Implementation  Results  Related Work and Summary

Chakraborty, Wells, and Sohi ASPLOS Simulation Methodology  Virtutech SIMICS MAI running Solaris 9  CMP system: 8 out-of-order processors 2 wide, 8 stages, 128 entry ROB, 3GHz  3 level memory hierarchy Private L1 and L2 Directory base MOSI L3: Shared, Exclusive 8MB (16w) (75 cycle load-to-use) Point to point ordered interconnect (25 cycle latency) Main Memory 255 cycle load to use, 40GB/s Measure impact on predictive structures

Chakraborty, Wells, and Sohi ASPLOS L2 Instruction Reference

Chakraborty, Wells, and Sohi ASPLOS Result Summary  Branch predictors 9-25% reduction in mis-predictions  L2 data references 0-19% reduction in load misses Moderate increase in store misses  Interconnect messages Moderate reduction (after accounting extra messages for migration)

Chakraborty, Wells, and Sohi ASPLOS Performance Potential Migration Overhead

Chakraborty, Wells, and Sohi ASPLOS Talk Outline  Motivation  Computation Spreading (CSP) Case study: OS and User compution  Implementation  Results  Related Work and Summary

Chakraborty, Wells, and Sohi ASPLOS Related Work  Software re-design: staged execution Cohort Scheduling [Larus and Parkes 01], STEPS [Ailamaki 04], SEDA [Welsh 01], LARD [Pai 98] CSP: similar execution in hardware  OS and User Interference [several] Structural separation to avoid interference CSP avoids interference and exploits synergy

Chakraborty, Wells, and Sohi ASPLOS Summary  Extensive code reuse in CMPs 45-66% instruction blocks universally accessed in server workloads  Computation Spreading Localize similar computation and separate dissimilar computation Exploits core proximity in CMPs  Case Study: OS and User computation Demonstrate substantial performance potential

Chakraborty, Wells, and Sohi ASPLOS Thank You!

Chakraborty, Wells, and Sohi ASPLOS Backup Slides

Chakraborty, Wells, and Sohi ASPLOS L2 Data Reference L2 load miss comparable, slight to moderate increase in L2 store miss

Chakraborty, Wells, and Sohi ASPLOS Multiprocessor Code Reuse

Chakraborty, Wells, and Sohi ASPLOS Performance Potential