June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based.

Slides:



Advertisements
Similar presentations
L.N. Bhuyan Adapted from Patterson’s slides
Advertisements

Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
The University of Adelaide, School of Computer Science
EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Hikmet Aras
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Nikos Hardavellas, Northwestern University
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
Nov 18, 2005 Seminar Dissertation Seminar, 18/11 – 2005 Auditorium Minus, Museum Gustavianum Software Techniques for.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.
Architecture Research Team (UART)1 Zoran Radović and Erik Hagersten {zoranr, Uppsala University Information Technology.
Euro-Par Uppsala Architecture Research Team [UART] | Uppsala University Dept. of Information Technology Div. of.
The University of Adelaide, School of Computer Science
Chapter 17 Parallel Processing.
1 Lecture 26: Case Studies Topics: processor case studies, Flash memory Final exam stats:  Highest 83, median 67  70+: 16 students, 60-69: 20 students.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
Lecture 11 Multithreaded Architectures Graduate Computer Architecture Fall 2005 Shih-Hao Hung Dept. of Computer Science and Information Engineering National.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Dynamic Verification of Cache Coherence Protocols Jason F. Cantin Mikko H. Lipasti James E. Smith.
Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing Barroso, Gharachorloo, McNamara, et. Al Proceedings of the 27 th Annual ISCA, June.
Niagara: a 32-Way Multithreaded SPARC Processor
In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.
Comparing Intel’s Core with AMD's K8 Microarchitecture IS 3313 December 14 th.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
© Wen-mei Hwu and S. J. Patel, 2005 ECE 511, University of Illinois Lecture 4: Microarchitecture: Overview and General Trends.
CMP/CMT Scaling of SPECjbb2005 on UltraSPARC T1 (Niagara) Dimitris Kaseridis and Lizy K. John The University of Texas at Austin Laboratory for Computer.
CSC 7080 Graduate Computer Architecture Lec 8 – Multiprocessors & Thread- Level Parallelism (3) – Sun T1 Dr. Khalaf Notes adapted from: David Patterson.
By Islam Atta Supervised by Dr. Ihab Talkhan
Exploiting Fine-Grained Data Parallelism with Chip Multiprocessors and Fast Barriers Jack Sampson*, Rubén González†, Jean-Francois Collard¤, Norman P.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
E6200, Fall 07, Oct 24Ambale: CMP1 Bharath Ambale Venkatesh 10/24/2007.
Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
The University of Adelaide, School of Computer Science
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.
Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,
Presented by: Nick Kirchem Feb 13, 2004
Lynn Choi School of Electrical Engineering
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Computer Structure Multi-Threading
5.2 Eleven Advanced Optimizations of Cache Performance
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
/ Computer Architecture and Design
CANDY: Enabling Coherent DRAM Caches for Multi-node Systems
The University of Adelaide, School of Computer Science
Chapter 4 Multiprocessors
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 24: Virtual Memory, Multiprocessors
Lecture 23: Virtual Memory, Multiprocessors
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

June 30th, 2006 ICS’06 -- Håkan Zeffer: Håkan Zeffer Zoran Radovic Martin Karlsson Erik Hagersten Uppsala University Sweden TMA A Trap-Based Memory Architecture TMA

June 30th, 2006 ICS’06 -- Håkan Zeffer: Simultaneous Multithreading (SMT)  Diminishing performance from ILP  Increased chip parallelism from hardware threading (TLP)  IBM Power5, Intel Pentium4, Sun T1 (Niagara)  “No processor should come without multiple threads” [Dr. Tremblay] fetch unit decode, rename etc. integer pipe floating-point pipe memory pipe branch pipe L1IL1D

June 30th, 2006 ICS’06 -- Håkan Zeffer: Chip Multiprocessors (CMPs) interconnect I D I D I D I D P P P P L2  Chip Multiprocessors (CMPs)  Piranha, IBM Power4, IBM Power5, Sun UltraSPARC IV+, Sun T1, Intel Duo, AMD Dual-Core Opteron

June 30th, 2006 ICS’06 -- Håkan Zeffer: Multi-CMP Systems CMP 3CMP 4 CMP 2CMP 1 interconnect I D I D I D I D P P P P L2  Larger systems sometimes built from multiple CMPs  Piranha, IBM Power4 and IBM Power5 interconnect

June 30th, 2006 ICS’06 -- Håkan Zeffer: Multi-CMP Coherence Inter-CMP Coherence Intra-CMP Coherence  Intra-CMP protocol for coherence within CMP  Inter-CMP protocol for coherence between CMPs  Interactions between protocols increase complexity CMP 3CMP 4 CMP 2CMP 1 interconnect

June 30th, 2006 ICS’06 -- Håkan Zeffer: Shared-Memory Trends  Today’s chips = yesterday’s mid-range servers  Sun T1 has 32 hardware threads on a single die  Is it worth to implement multi-CMP systems?  Increased development cost  Increased verification cost  How big is the market?

June 30th, 2006 ICS’06 -- Håkan Zeffer: Trap-Based Memory Architectures  TMA: Trap-based Memory Architecture  Basic idea  Optimize for commercial singe-chip performance  Let simple HW and SW support enable scalability  Coherence violation detection in hardware  Trap on inter-chip coherence violations  Solve inter-chip coherence misses in software

June 30th, 2006 ICS’06 -- Håkan Zeffer: Outline Introduction  TMA and TMA Lite  Evaluation methodology  Results  Related work  Future work  Conclusions

June 30th, 2006 ICS’06 -- Håkan Zeffer: TMA Lite  TMA Lite is a “minimal” TMA implementation  Runtime system Deadlock avoidance Coherence protocol  Per application “scalability”  Binary transparency  No memory system modifications  Simple processor core modifications  An inter-node load coherence check  An inter-node store coherence check

June 30th, 2006 ICS’06 -- Håkan Zeffer: A TMA Lite System  TMA Lite nodes  Single-chip system Load and store coherence check support  HW maintains intra-chip coherence  TMA Lite cluster network  “InfiniBand like”  High-bandwidth  Low-latency  Remote memory access (put, get and atomic)  TMA Lite software  Coherence and consistency between nodes

June 30th, 2006 ICS’06 -- Håkan Zeffer: The Load Check  Magic value convention  Each cache line in state invalid contains a predefined value  Hardware  Comparator at the load path detects this value  Trap generated when the value is found magic value register =? data & load check enabled? load trap? Controlled by system software  False misses  When the magic value is used within an application  Easy to detect and solve within the coherence protocol  Rare

June 30th, 2006 ICS’06 -- Håkan Zeffer: The Store Check  Write permission cache (WPC)  Can be seen as a very small cache  Operates on virtual addresses  Accessed in parallel with the data TLB  Write permission for lines in the WPC guaranteed by protocol trap? data TLB WPC Data L1 Address generation TLB access WPC access Start L1 access Tag compare TLB trap? WPC trap? End L1 access... hit? data  The write permission cache has to be filled  A fill occurs at all WPC misses  Even if the node already has write permission  Overhead often severe

June 30th, 2006 ICS’06 -- Håkan Zeffer: Simulator and Benchmarks  Simics: full-system simulator  Vasa: timing- and memory-model extension  Cycle accurate  Power5 like SMT processor model  Latency and bandwidth of caches, memory and network  SPLASH-2 benchmarks

June 30th, 2006 ICS’06 -- Håkan Zeffer: System Parameters  Scaled down Power5 chip  1 or 2 processor cores per chip  2 SMT threads per processor core  Write through L1  Write back L2 and L3 L2 on-die, L3 tags on-die  The HW distributed shared memory system  Directory: fully mapped bit vector, dedicated SRAM  Coherence protocol: HW, highly optimized, non-blocking  The TMA Lite system  Directory: fully mapped bit vector, in ordinary DRAM memory  Coherence protocol SW Binary patch to Solaris modifies the trap vector Coherence protocol run on the hardware thread that caused the miss

June 30th, 2006 ICS’06 -- Håkan Zeffer: Execution Time Breakdown Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Coherence Protocol Breakdown

June 30th, 2006 ICS’06 -- Håkan Zeffer: SW Flexibility: Coherence Unit Size Execution time is normalized to the HW DSM. 4 nodes, load comparator + 16 entry WPC.

June 30th, 2006 ICS’06 -- Håkan Zeffer: Related Work  SW only  Page-based systems IVY, Munin, Cashmere, GeNIMA, Treadmarks + many more Virtual memory used for coherence detection  Fine-grained systems Shasta, Blizzard, Sirocco, DSZOOM Coherence checks instrumented into applications  HW support + software protocol  FLASH, Typhoon, S3.mp Coherence processor executes the coherence protocol  SMTp SMT thread executes the coherence protocol

June 30th, 2006 ICS’06 -- Håkan Zeffer: Future Work  More mature TMA implementations  Coherence detection on physical addresses  System (instead of application) scalability  (Proceedings figure text error: Internet pdf is OK!)  One proposal is already available as a tech. report  Available at:  New coherence detection scheme No “false” load or store coherence misses  A new way to decouple inter- and intra-chip coherence  In DRAM memory remote access caching  Commercial applications  Much more experiments  Very promising results

June 30th, 2006 ICS’06 -- Håkan Zeffer: Conclusions  Shared memory trends  SMT and CMP  Mid-range servers on a single chip  Trap-based Memory Architecture  Design for commercial single chip performance  Simple and small HW structures for scalable shared memory  TMA Lite  “Minimal” TMA implementation  Competitive to HW DSM when flexibility is used  Promising for HPC when runtime system is under control  Given the right HW/SW tradeoff simple and efficient scalable shared memory is possible  More mature TMA arch. in next paper (the tech. report)

June 30th, 2006 ICS’06 -- Håkan Zeffer: Questions?

June 30th, 2006 ICS’06 -- Håkan Zeffer: The Coherence Protocol