CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex.

Slides:

Advertisements

Similar presentations

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.

Advertisements

Network Processor Technical Report Present by: Jiening Jiang June 05.

A Performance Comparison of DRAM Memory System Optimizations for SMT Processors Zhichun ZhuZhao Zhang ECE Department Univ. Illinois at ChicagoIowa State.

Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 Memory Performance and Scalability of Intel’s and AMD’s Dual-Core Processors: A Case Study Lu Peng 1, Jih-Kwon Peir 2, Tribuvan K. Prakash 1, Yen-Kuang.

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.

PERFORMANCE ANALYSIS OF MULTIPLE THREADS/CORES USING THE ULTRASPARC T1 (NIAGARA) Unique Chips and Systems (UCAS-4) Dimitris Kaseridis & Lizy K. John The.

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Current Trends in CMP/CMT Processors Wei Hsu 7/26/2006.

1 Pipelining for Multi- Core Architectures. 2 Multi-Core Technology Single Core Dual CoreMulti-Core + Cache + Cache Core 4 or more cores.

“ NAHALAL : Cache Organization for Chip Multiprocessors ” New LSU Policy By : Ido Shayevitz and Yoav Shargil Supervisor: Zvika Guz.

How Multi-threading can increase on-chip parallelism

1 Lecture 10: ILP Innovations Today: ILP innovations and SMT (Section 3.5)

1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Lecture 16: Cache Innovations / Case Studies Topics: prefetching, blocking, processor case studies (Section 5.2)

Conference title 1 A Research-Oriented Advanced Multicore Architecture Course Julio Sahuquillo, Salvador Petit, Vicent Selfa, and María E. Gómez May 25,

Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.

Multi-core Processing The Past and The Future Amir Moghimi, ASIC Course, UT ECE.

A Low-Complexity, High-Performance Fetch Unit for Simultaneous Multithreading Processors Ayose Falcón Alex Ramirez Mateo Valero HPCA-10 February 18, 2004.

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

Parallel and Distributed Systems Instructor: Xin Yuan Department of Computer Science Florida State University.

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.

Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.

1 Thread level parallelism: It’s time now ! André Seznec IRISA/INRIA CAPS team.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)

Niagara: a 32-Way Multithreaded SPARC Processor

Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Outline  Over view  Design  Performance  Advantages and disadvantages  Examples  Conclusion  Bibliography.

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

SIMULTANEOUS MULTITHREADING Ting Liu Liu Ren Hua Zhong.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

By Islam Atta Supervised by Dr. Ihab Talkhan

1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.

E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.

ECE/CS 552: Multithreading and Multicore © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John.

Carnegie Mellon /18-243: Introduction to Computer Systems Instructors: Anthony Rowe and Gregory Kesden 27 th (and last) Lecture, 28 April 2011 Multi-Core.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Electrical and Computer Engineering

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Krste Asanovic Electrical Engineering and Computer Sciences

5.2 Eleven Advanced Optimizations of Cache Performance

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Lecture: SMT, Cache Hierarchies

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

Database Servers on Chip Multiprocessors: Limitations and Opportunities Nikos Hardavellas With Ippokratis Pandis, Ryan Johnson, Naju Mancheril, Anastassia.

Lecture 22: Multithreading

Presentation transcript:

CMP-MSI Feb. 11 th 2007 Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors Carmelo Acosta 1 Francisco J. Cazorla 2 Alex Ramírez 1,2 Mateo Valero 1,2 1 UPC-Barcelona 2 Barcelona Supercomputing Center

CMP-MSI Feb. 11 th Overview  Introduction  Simulation Methodology  Results  Conclusions

CMP-MSI Feb. 11 th Introduction  As Process Technology advances it is more important what to do with transistors.  Current trend to replicate cores.  Intel: Pentium4, Core Duo, Core 2 Duo, Core 2 Quad  AMD: Opteron Dual-Core, Opteron Quad-Core  IBM: POWER4, POWER5  Sun Microsystems: Niagara T1, Niagara T2

CMP-MSI Feb. 11 th Introduction Power4 (CMP) Power5 (CMP+SMT)  Memory Subsystem (green) spreads over more than half the chip area.

CMP-MSI Feb. 11 th Introduction  Each L1 is connected to each L2 bank with a bus- based interconnection network.

CMP-MSI Feb. 11 th Goal  Is directly applicable prior research in the SMT field in the new CMP+SMT scenario?  NO…we have to revisit well-known SMT ideas.  Instruction Fetch Policy

CMP-MSI Feb. 11 th ICOUNT Fetch ROB

CMP-MSI Feb. 11 th ICOUNT Fetch ROB L2 miss FETCH Stalled  Processor’s resources balanced between running threads.  All resources devoted to blue thread unused until L2 miss resolution.

CMP-MSI Feb. 11 th FLUSH Fetch ROB L2 miss  All resources devoted to the pending instructions of the blue thread are freed. FLUSH Triggered

CMP-MSI Feb. 11 th FLUSH Fetch ROB L2 miss  Freed resources allow additional forward progress.  L2 miss late detection  L2 miss prediction. Thread Stalled

CMP-MSI Feb. 11 th Single vs Multi Core I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core I$D$ Core L2 b0L2 b1L2 b2L2 b3 More pressure on both: Interconnection Network Shared L2 banks

CMP-MSI Feb. 11 th Single vs Multi Core I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core I$D$ Core L2 b0L2 b1L2 b2L2 b3 More Unpredictable L2 Access Latency - BAD for FLUSH

CMP-MSI Feb. 11 th Overview  Introduction  Simulation Methodology  Results  Conclusions

CMP-MSI Feb. 11 th Simulation Methodology  Trace driven SMT simulator derived from SMTsim.  C2T2, C3T2, C4T2 multicore configurations. (CXTY, where X= Num. Cores and Y= Num. Threads/Core) I$D$ Core L2 b0 I$D$ Core I$D$ Core L2 b1L2 b2L2 b3 I$D$ Core Core Details (* per thread)

CMP-MSI Feb. 11 th Simulation Methodology  Instruction Fetch Policies:  ICOUNT  FLUSH  Workload classified per type:  ILP  All threads have good memory behavior.  MEM  All threads have bad memory behavior.  MIX  Mixes both types of threads.

CMP-MSI Feb. 11 th Overview  Introduction  Simulation Methodology  Results  Conclusions

CMP-MSI Feb. 11 th Results : Single-Core (2 threads)  FLUSH yields 22% average speedup over ICOUNT, in MIX workloads.  Mainly on MEM/MIX workloads

CMP-MSI Feb. 11 th Results : Multi-Core (2 threads/core)  FLUSH drops to 9% average slowdown over ICOUNT in a four-cored multicore. + Cores  - Speedup

CMP-MSI Feb. 11 th Results : L2 Hits Latency on Multi-Core +Cores  +latency +dispersion L2 hit latency (cycles)

CMP-MSI Feb. 11 th Results : L2 miss prediction  In this four-cored example, the best choice is predicting L2 miss after 90 cycles.

CMP-MSI Feb. 11 th Results : L2 miss prediction  But, in this other four-cored example the best choice is not to predict L2 miss.

CMP-MSI Feb. 11 th Overview  Introduction  Simulation Methodology  Results  Conclusions

CMP-MSI Feb. 11 th Conclusions  Future high-degree CMPs open new challenging research topics in CMP+SMT cooperation.  The CMP outer cache level and interconnection characteristics may heavily affect SMT intra-core performance.  For example, FLUSH relies on a predictable L2 hit latency, heavily affected in a CMP+SMT scenario.  FLUSH drops from 22% average speedup to 9% average slowdown when moving from single-core to quad-core configuration.

CMP-MSI Feb. 11 th 2007 Thank you Questions?