Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Slides:



Advertisements
Similar presentations
Prof. Natalie Enright Jerger
Advertisements

Misbah Mubarak, Christopher D. Carothers
QuT: A Low-Power Optical Network-on-chip
A Novel 3D Layer-Multiplexed On-Chip Network
International Symposium on Low Power Electronics and Design Energy-Efficient Non-Minimal Path On-chip Interconnection Network for Heterogeneous Systems.
Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Flattened Butterfly Topology for On-Chip Networks John Kim, James Balfour, and William J. Dally Presented by Jun Pang.
Circuit-Switched Coherence Natalie Enright Jerger*, Li-Shiuan Peh +, Mikko Lipasti* *University of Wisconsin - Madison + Princeton University 2 nd IEEE.
Aérgia: Exploiting Packet Latency Slack in On-Chip Networks
1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.
Destination-Based Adaptive Routing for 2D Mesh Networks ANCS 2010 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering University of California,
Miguel Gorgues, Dong Xiang, Jose Flich, Zhigang Yu and Jose Duato Uni. Politecnica de Valencia, Spain School of Software, Tsinghua University, China, Achieving.
L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.
MINIMISING DYNAMIC POWER CONSUMPTION IN ON-CHIP NETWORKS Robert Mullins Computer Architecture Group Computer Laboratory University of Cambridge, UK.
Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim
Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,
Design of a High-Throughput Distributed Shared-Buffer NoC Router
Predictive Load Balancing Reconfigurable Computing Group.
Genetic Algorithm for Variable Selection
Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.
Trace-Driven Optimization of Networks-on-Chip Configurations Andrew B. Kahng †‡ Bill Lin ‡ Kambiz Samadi ‡ Rohit Sunkam Ramanujam ‡ University of California,
1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
1 Near-Optimal Oblivious Routing for 3D-Mesh Networks ICCD 2008 Rohit Sunkam Ramanujam Bill Lin Electrical and Computer Engineering Department University.
Routing Algorithms ECE 284 On-Chip Interconnection Networks Spring
Dragonfly Topology and Routing
Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.
McRouter: Multicast within a Router for High Performance NoCs
Interactions Between Compression and Prefetching in Chip Multiprocessors Alaa R. Alameldeen* David A. Wood Intel CorporationUniversity of Wisconsin-Madison.
A Study on Intelligent Run-time Resource Management Techniques for Large Tiled Multi-Core Architectures Dong-In Kang, Jinwoo Suh, Janice O. McMahon, and.
Tightly-Coupled Multi-Layer Topologies for 3D NoCs Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi (NII, JAPAN) Hideharu Amano (Keio Univ, JAPAN)
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
Report Advisor: Dr. Vishwani D. Agrawal Report Committee: Dr. Shiwen Mao and Dr. Jitendra Tugnait Survey of Wireless Network-on-Chip Systems Master’s Project.
Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.
QoS Support in High-Speed, Wormhole Routing Networks Mario Gerla, B. Kannan, Bruce Kwan, Prasasth Palanti,Simon Walton.
LIBRA: Multi-mode On-Chip Network Arbitration for Locality-Oblivious Task Placement Gwangsun Kim Computer Science Department Korea Advanced Institute of.
DUKE UNIVERSITY Self-Tuned Congestion Control for Multiprocessor Networks Shubhendu S. Mukherjee VSSAD, Alpha Development Group.
Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.
Algorithms for Allocating Wavelength Converters in All-Optical Networks Authors: Goaxi Xiao and Yiu-Wing Leung Presented by: Douglas L. Potts CEG 790 Summer.
Presenter: Min-Yu Lo 2015/10/19 Asit K. Mishra, N. Vijaykrishnan, Chita R. Das Computer Architecture (ISCA), th Annual International Symposium on.
O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Analytic Evaluation of Shared-Memory Systems with ILP Processors Daniel J. Sorin, Vijay S. Pai, Sarita V. Adve, Mary K. Vernon, and David A. Wood Presented.
CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.
Non-Minimal Routing Strategy for Application-Specific Networks-on-Chips Hiroki Matsutani Michihiro Koibuchi Yutaka Yamada Jouraku Akiya Hideharu Amano.
A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
University of Michigan, Ann Arbor
Performance, Cost, and Energy Evaluation of Fat H-Tree: A Cost-Efficient Tree-Based On-Chip Network Hiroki Matsutani (Keio Univ, JAPAN) Michihiro Koibuchi.
Yu Cai Ken Mai Onur Mutlu
Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.
Topology-aware QOS Support in Highly Integrated CMPs Boris Grot (UT-Austin) Stephen W. Keckler (NVIDIA/UT-Austin) Onur Mutlu (CMU) WIOSCA '10.
1 Lecture 15: NoC Innovations Today: power and performance innovations for NoCs.
1 Lecture 22: Router Design Papers: Power-Driven Design of Router Microarchitectures in On-Chip Networks, MICRO’03, Princeton A Gracefully Degrading and.
On the Placement of Web Server Replicas Yu Cai. Paper On the Placement of Web Server Replicas Lili Qiu, Venkata N. Padmanabhan, Geoffrey M. Voelker Infocom.
Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Design Space Exploration for NoC Topologies ECE757 6 th May 2009 By Amit Kumar, Kanchan Damle, Muhammad Shoaib Bin Altaf, Janaki K.M Jillella Course Instructor:
HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.
Yiting Xia, T. S. Eugene Ng Rice University
How to Train your Dragonfly
Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio
Rachata Ausavarungnirun, Kevin Chang
Exploring Concentration and Channel Slicing in On-chip Network Router
Lecture 17: NoC Innovations
Rahul Boyapati. , Jiayi Huang
Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Presentation transcript:

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs, from ISCA ’09. Those responsible for the original title have been sacked. Dennis Abts Google Natalie Enright Jerger University of Toronto John Kim KAIST Dan Gibson Univ of Wisconsin Mikko Lipasti Univ of Wisconsin

Executive Summary ® On what tiles should memory controllers reside? –Three-tiered simulation approach Heuristic-guided search Detailed network simulation Full-system simulation Diamond MC placement works well for on-chip meshes and tori –Diamonds minimize maximum channel load –Diamonds deliver lower and more predictable runtimes

Background Diverse on-chip communication –Cache-to-cache –LD/ST to Memory –Off-chip traffic (e.g., I/O) Processors/chip on the rise –Pins available for memory not rising as fast: Memory bandwidth becomes more precious –Reality: Many Cores, Few Memory Controllers Tiled architectures gaining popularity –Commonly employ on-chip meshes or tori

The Problem What Memory Controller placement is best overall? –Flip-chip packaging allows flexible escape routes –n tiles and m ports: Don’t worry, there are only configurations! –What are the characteristics of the best configuration? Performance: Low runtime for a set of objective workloads Throughput: Low latency as a function of offered load Fairness: Similar (low) average memory latency across all nodes. Predictability: Low latency and runtime variance Slight Simplification: Assume n = k 2 and m = 2k

Baseline Placement: row0_7 Ports to MCs located at top and bottom of chip Conceptually similar to real parts: –Tilera’s Tile64 64 cores, 4 MCs (4 ports each, top/bottom of chip) –Intel TeraFLOPs 80 cores, 2 MCs (8 ports each, top/bottom of chip) X-Dimension Traffic Encounters Congestion on Rows with Memory Controllers

Three-Tiered Approach Link Contention Simulation Detailed Network Simulation Full System More RunsShorter RuntimesMore Detail

Tier 0.5: Exhaustive Search It turns out is tractable for k<7 –(At least on the link contention simulator – only 3,268,760 possibilities for k=5) Patterns Emerge! Another Contender

Tier 1: Heuristic-Guided Search k>6: Intractable to search all configurations –Use search heuristics and random search Genetic Algorithm: –Represent designs as a population of strings (Bit Vectors) –Generate new designs by combining members of the population via genetic crossover (Bit Selection) –Occasionally, mutate new population members (Swap adjacent bits) –Reduce population size by removing least-fit members – Survival of the Fittest

Genetic MC Placement 0x00AA550000AA5500 0x0000FF0000FF0000 0x00AAF00000F x00AAF00000F25080 Mutate

Link Contention Results k=8 Config. Max Channel Load MeshTorus row0_ X Diamond GA Selected Diamond as most fit solution for 8x8 –Minimizes MCs in a single row/column –Spreads DOR load Sanity Check: GA also prefers Diamond for 4x4, 5x5, and 6x6

Network Simulation: Open-Loop Evaluation Detailed simulation of all network events (buffers, links, etc.) Cores are Bernoulli injection processes, uniform random traffic Measure latency vs. offered load ParametersValues Router latency1 cycle (aggressive) Inter-router Delay1 cycle Buffers32-flit sized per port Packet sizeRequest: 1 flit Reply: 4 flit Virtual Channels4 (XY-YX routing)

Open-Loop Results Offered load (flits/cycle) Latency (cycles) row0_7 row2_5 Diamond X

Closed-Loop Evaluation Each processor executes N memory operations Up to r operations outstanding at a time –Models MSHRs Uniform Random requests, and real request streams with ‘hot spot’ behavior

Closed-Loop Results Completion Time Number of Processors Diamondrow0_7

Full System Results Standard Deviation Average Network Latency (cycles) for Request to Memory Controller JBB WEB TPC-W TPC-W+H TPC-H TPC-W+H TPC-W TPC-H WEB JBB Diamond placement yields lower latency and lower latency variance.

Conclusion MC Placement Matters! –Diamond reduces contention, improves latency, and reduces latency/runtime variance –X does fairly well