FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C.

Slides:

Advertisements

Similar presentations

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Advertisements

Presentation of Designing Efficient Irregular Networks for Heterogeneous Systems-on-Chip by Christian Neeb and Norbert Wehn and Workload Driven Synthesis.

REAL-TIME COMMUNICATION ANALYSIS FOR NOCS WITH WORMHOLE SWITCHING Presented by Sina Gholamian, 1 09/11/2011.

CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.

Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.

Allocator Implementations for Network-on-Chip Routers Daniel U. Becker and William J. Dally Concurrent VLSI Architecture Group Stanford University.

Packet-Switched vs. Time-Multiplexed FPGA Overlay Networks Kapre et. al RC Reading Group – 3/29/2006 Presenter: Ilya Tabakh.

L2 to Off-Chip Memory Interconnects for CMPs Presented by Allen Lee CS258 Spring 2008 May 14, 2008.

NoC Modeling Networks-on-Chips seminar May, 2008 Anton Lavro.

Lei Wang, Yuho Jin, Hyungjun Kim and Eun Jung Kim

Design of a High-Throughput Distributed Shared-Buffer NoC Router

Predictive Load Balancing Reconfigurable Computing Group.

Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.

Architecture and Routing for NoC-based FPGA Israel Cidon* *joint work with Roman Gindin and Idit Keidar.

Orion: A Power-Performance Simulator for Interconnection Networks Presented by: Ilya Tabakh RC Reading Group4/19/2006.

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

McRouter: Multicast within a Router for High Performance NoCs

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

TitleEfficient Timing Channel Protection for On-Chip Networks Yao Wang and G. Edward Suh Cornell University.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

On-Chip Networks and Testing

5/3/2011 International Symposium on Network-on-Chip 1 DART: A Programmable Architecture for NoC Simulation on FPGAs Danyao Wang*† Natalie Enright Jerger*

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

Building Expressive, Area-Efficient Coherence Directories Michael C. Huang Guofan Jiang Zhejiang University University of Rochester IBM 1 Lei Fang, Peng.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Data and Computer Communications Chapter 10 – Circuit Switching and Packet Switching (Wide Area Networks)

Design and Evaluation of Hierarchical Rings with Deflection Routing Rachata Ausavarungnirun, Chris Fallin, Xiangyao Yu, Kevin Chang, Greg Nazario, Reetuparna.

J. Christiansen, CERN - EP/MIC

O1TURN : Near-Optimal Worst-Case Throughput Routing for 2D-Mesh Networks DaeHo Seo, Akif Ali, WonTaek Lim Nauman Rafique, Mithuna Thottethodi School of.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR FPGA Fabric n Elements of an FPGA fabric –Logic element –Placement –Wiring –I/O.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

Express Cube Topologies for On-chip Interconnects Boris Grot J. Hestness, S. W. Keckler, O. Mutlu † The University of Texas at Austin † Carnegie Mellon.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Jason Bosko March 5 th, 2008 Based on “Managing Distributed, Shared L2 Caches through.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

University of Michigan, Ann Arbor

Networks-on-Chip (NoC) Suleyman TOSUN Computer Engineering Deptartment Hacettepe University, Turkey.

Yu Cai Ken Mai Onur Mutlu

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

FPGA-Based System Design: Chapter 1 Copyright  2004 Prentice Hall PTR Moore’s Law n Gordon Moore: co-founder of Intel. n Predicted that number of transistors.

BarrierWatch: Characterizing Multithreaded Workloads across and within Program-Defined Epochs Socrates Demetriades and Sangyeun Cho Computer Frontiers.

AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.

Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.

An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.

Sunpyo Hong, Hyesoon Kim

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.

Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.

A Low-Area Interconnect Architecture for Chip Multiprocessors Zhiyi Yu and Bevan Baas VLSI Computation Lab ECE Department, UC Davis.

HAT: Heterogeneous Adaptive Throttling for On-Chip Networks Kevin Kai-Wei Chang Rachata Ausavarungnirun Chris Fallin Onur Mutlu.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR Topics n FPGA fabric architecture concepts.

1 Scalability and Accuracy in a Large-Scale Network Emulator Nov. 12, 2003 Byung-Gon Chun.

Runtime Reconfigurable Network-on- chips for FPGA-based systems Mugdha Puranik Department of Electrical and Computer Engineering

ESE532: System-on-a-Chip Architecture

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

Exploring Concentration and Channel Slicing in On-chip Network Router

Cache Memory Presentation I

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Using Packet Information for Efficient Communication in NoCs

Combining Simulators and FPGAs “An Out-of-Body Experience”

CS 6290 Many-core & Interconnect

Presentation transcript:

FIST: A Fast, Lightweight, FPGA-Friendly Packet Latency Estimator for NoC Modeling in Full-System Simulations 5/3/2011 Michael K. Papamichael, James C. Hoe, Onur Mutlu Computer Architecture Lab at Our work has been supported by NSF. We thank Xilinx and Bluespec for their FPGA and tool donations.

CALCM Computer Architecture Lab at Carnegie Mellon Network-on-Chip Simulation 2 NoCs crucial component in chip multiprocessors FIST project explores fast NoC simulation techniques Practical, FPGA-friendly approach for full-system simulations L2 L1 P L2 L1 P L2 L1 P L2 L1 P R R R R Network model in a full-system simulator is simple Estimates packet latencies and delivers packets

CALCM Computer Architecture Lab at Carnegie Mellon FIST Network Simulation Goals * FPGA-friendly, but avoid implementing actual NoC Buffered NoC w/ multiple virtual channels complex Limited size of simulated network Stay within acceptable error margin Trade-off complexity vs. fidelity Simulate variety of network topologies Be fast to keep up with HW-based simulators FPGA (Xilinx LX110T) 3x3 4x4 5x5 6x6 32-bit Datapath LUT Requirements for a Mesh NoC on a modern FPGA* FPGA (Xilinx LX110T) 64-bit Datapath 3x3 4x4 5x5 Datapath Width3x34x45x56x67x78x8 32-bit43%76%120%172%234%306% 64-bit63%112%175%253%344%449% 128-bit101%180%282%405%552%721% 256-bit177%315%493%709%965%1261% Mesh NoC LUT Usage on Xilinx LX110T FPGA* *NoC Router RTL from 3

CALCM Computer Architecture Lab at Carnegie Mellon What is a Network-on-Chip? Abstract View of NoC Set of routers connected by links Buffers may exist at inputs/outputs Mesh Example R R R R R R R R R R R R R R R R R R N N N N N N N N N N N N N N N N N N N N R R R R R R R R R R R R R R 4

CALCM Computer Architecture Lab at Carnegie Mellon Key Observation Inputs Outputs Router has characteristic load-delay curves Known curves for given configuration & traffic pattern N N R R Delay-Load Curve Example for Buffered Crossbar 5

CALCM Computer Architecture Lab at Carnegie Mellon N N N N N N N N N N N N FIST Approach Treat each hop as a load-delay curve Trade-off between model complexity and fidelity Keep track of load at each node R R R R R R R R R R R R N N N N N N R R R R R R 6

CALCM Computer Architecture Lab at Carnegie Mellon N N N N N N N N N N N N N N N N N N FIST in Action Route packet from source to destination Determine routers that will be traversed Add up the delays for each traversed router Index load-delay curves using current load at each router R R R R R R R R R R R R R R R R R R S S D D packet delay 7

CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST Background FIST-based Network Models Applicability and Limitations Evaluation Related Work Conclusions Future Directions

CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST Background FIST-based Network Models Applicability and Limitations Evaluation Related Work Conclusions Future Directions

CALCM Computer Architecture Lab at Carnegie Mellon Simulation in Computer Architecture Simulation indispensable tool Slow for large-scale multiprocessor studies Full-system fidelity + many components Longer modern benchmarks How can we make it faster? Adjust trade-off between Speed, Accuracy, Flexibility Full-system simulators often sacrifice accuracy for speed and flexibility Speed Accuracy Flexibility Accelerate simulation using FPGAs Can provide orders of magnitude speedup Gaining more traction as FPGAs grow larger 10

CALCM Computer Architecture Lab at Carnegie Mellon Anatomy of a Simulator Collection of connected simulation components Each component models a different part of the system Example: Simulation Components Cache Model Network-on-Chip Model I/O Device Models Branch Predictor Model CMP Model FPGA-based simulators, can run components in parallel Each component can be mapped on a separate FPGA See ProtoFlex project for more info: 11

CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST Background FIST-based Network Models Applicability and Limitations Evaluation Related Work Conclusions Future Directions

CALCM Computer Architecture Lab at Carnegie Mellon Feedback Putting FIST Into Context Train Curves Use Curves Where does FIST fit in? Network models within full-system simulators Model network within a broader simulated system Assign delay to each packet traversing the network Traffic generated by real workloads Updated Curves Detailed network models Cycle-accurate network simulators (e.g. BookSim) Analytical network models Typically study networks under synthetic traffic patterns 13

CALCM Computer Architecture Lab at Carnegie Mellon Online and Offline FIST Offline FIST Detailed network simulator generates curves offline Can use synthetic or actual workload traffic Load curves into FIST and run experiment Detailed Network Model Online FIST Initialization of curves same as offline Periodically run detailed network simulator on the side Compare accuracy and, if necessary, update curves Detailed Network Model Provide feedback and receive updated curves 14

CALCM Computer Architecture Lab at Carnegie Mellon FIST Applicability and Limitations “FIST-Friendly” Networks Exhibit stable, predictable behavior as load fluctuates Actual traffic similar training traffic Online FIST models can tolerate more “unpredictability” FIST Limitations Depends on fidelity & representativeness of training models Higher loads and large buffers can limit FIST’s accuracy High network load  increased packet latency variance Large buffers  increased range of observed packet latencies Cannot capture fine-grain packet interactions Cannot replace cycle-accurate detailed network models FIST only as good as its training data 15

CALCM Computer Architecture Lab at Carnegie Mellon Using FIST to Model NoCs NoCs affected by on-chip limitations and scarce resources Employ simple routing algorithms Usually simple deterministic routing Operate at low loads NoCs usually over-provisioned to handle worst-case Have been observed to operate at low injection rates Small buffers On-chip abundance of wires reduces buffering requirements Amount of buffering in NoCs is limited or even eliminated NoCs are “FIST-Friendly” 16

CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST Background FIST-based Network Models Applicability and Limitations Evaluation Related Work Conclusions Future Directions

CALCM Computer Architecture Lab at Carnegie Mellon FIST Implementations Packet Descriptors Router Elements Src Dest Size Tree of Adders Pick routers Routing Logic Packet Delays Partial Delays BRAM Load Tracker Curve Software Implementation of FIST (written in C++) Implements online and offline FIST models Hardware Implementation (written in Bluespec) Precisely replicates software-based FIST Block diagram of architecture: 18

CALCM Computer Architecture Lab at Carnegie Mellon Methodology Examined online and offline FIST models Replaced cycle-accurate NoC model in tiled CMP Ysimulator Network and system configuration 4x4, 8x8, 16x16 wormhole-routed mesh with 4 virtual channels Each network node hosts core+coherent L1 and a slice of L2 Multiprogrammed and multithreaded workloads 26 SPEC CPU2006 benchmarks of varying network intensity 8 SPLASH-2 and 2 PARSEC workloads Traffic generated by cache misses Single-flit control packets for data requests and coherence 8-flit data packets for cache-block transfers Offline and Online FIST models with two curves per router Curves represent injection and traversal latency at each router Offline model trained using uniform random synthetic traffic Please see paper for more details! 19

CALCM Computer Architecture Lab at Carnegie Mellon Accuracy Results 8x8 mesh using FIST offline model Average Latency and Aggregate IPC Error Latency Error IPC Error MT (SPL/PAR) MP (High) MP (Med) MP (Low) Latency/IPC Error IPC Error < 4% Latency Error < 8% 20

CALCM Computer Architecture Lab at Carnegie Mellon Accuracy Results MT (SPL/PAR) MP (High) MP (Med) MP (Low) Latency/IPC Error 8x8 mesh using FIST online model Average Latency and Aggregate IPC Error Both Latency and IPC Error below 3% 21

CALCM Computer Architecture Lab at Carnegie Mellon Performance Results Getting a Sense of Performance 10K 100K 1M 10M 100M SW FIST HW FIST Complexity / Level of Detail Speed (packets/sec) 22 1K SW NoC Sims (e.g. BookSim) Software-based speedup results for 16x16 mesh Offline FIST: 43x Online FIST: 18x Hardware-based speedup: ~3-4 orders of magnitude

CALCM Computer Architecture Lab at Carnegie Mellon Hardware Implementation Results FPGA resource usage & post-synthesis clock frequency Xilinx Virtex-5 LX155T (medium size FPGA) Xilinx Virtex-6 LX760 (large FPGA) Virtex-5 LX155TVirtex-6 LX760 SizeBRAMsLUTsFreq.BRAMsLUTsFreq. 4x481%380 MHz80%448 MHz 8x8325%263 MHz321%443 MHz 12x127211%250 MHz722%375 MHz 16x %214 MHz1295%375 MHz 20x %200 MHz2018%319 MHz 24x %312 MHz

CALCM Computer Architecture Lab at Carnegie Mellon Outline Introduction to FIST Background FIST-based Network Models Applicability and Limitations Evaluation Related Work Conclusions Future Directions

CALCM Computer Architecture Lab at Carnegie Mellon Related Work Vast body of work on network modeling Analytical models, hardware prototyping, etc. Abstract network modeling Previous work has studied the performance-accuracy vs. performance Comparing five FPGAs for network modeling Cycle-accurate fidelity at the cost of limited scalability Virtualization techniques can help with scalability but can still suffer from high implementation complexity

CALCM Computer Architecture Lab at Carnegie Mellon Conclusions & Future Directions Conclusions Full-system simulators can tolerate small inaccuracies FIST can provide fast SW- or HW-based NoC models SW model provides 18x-43x average speedup w/ <2% error HW model can scale to 100s routers with >1000x speedup Not all networks are “FIST-friendly” But NoCs are good candidates for FIST-based modeling Future Directions Simple cycle-accurate, FPGA-friendly network models Configurable flexible NoC generation

CALCM Computer Architecture Lab at Carnegie Mellon Thanks! Questions?