A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.

Coherence Ordering for Ring-based Chip Multiprocessors Mike Marty and Mark D. Hill University of Wisconsin-Madison.

A Novel 3D Layer-Multiplexed On-Chip Network

NetFPGA Project: 4-Port Layer 2/3 Switch Ankur Singla Gene Juknevicius

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

Parallell Processing Systems1 Chapter 4 Vector Processors.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

Input/Output Management and Disk Scheduling

Interfacing Processors and Peripherals Andreas Klappenecker CPSC321 Computer Architecture.

1 ENTS689L: Packet Processing and Switching Buffer-less Switch Fabric Architectures Buffer-less Switch Fabric Architectures Vahid Tabatabaee Fall 2006.

Architecture for Network Hub in 2011 David Chinnery Ben Horowitz.

Rotary Router : An Efficient Architecture for CMP Interconnection Networks Pablo Abad, Valentín Puente, Pablo Prieto, and Jose Angel Gregorio University.

1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)

Device Management.

1 Indirect Adaptive Routing on Large Scale Interconnection Networks Nan Jiang, William J. Dally Computer System Laboratory Stanford University John Kim.

Dragonfly Topology and Routing

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Localized Asynchronous Packet Scheduling for Buffered Crossbar Switches Deng Pan and Yuanyuan Yang State University of New York Stony Brook.

Diamonds are a Memory Controller’s Best Friend* *Also known as: Achieving Predictable Performance through Better Memory Controller Placement in Many-Core.

Gigabit Routing on a Software-exposed Tiled-Microprocessor

Blue Gene / C Cellular architecture 64-bit Cyclops64 chip: –500 Mhz –80 processors ( each has 2 thread units and a FP unit) Software –Cyclops64 exposes.

Introduction to Interconnection Networks. Introduction to Interconnection network Digital systems(DS) are pervasive in modern society. Digital computers.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

© Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Blue Gene/L Torus Interconnection Network N. R. Adiga, et.al IBM Journal.

CHAPTER 3 TOP LEVEL VIEW OF COMPUTER FUNCTION AND INTERCONNECTION

1 Copyright © Monash University ATM Switch Design Philip Branch Centre for Telecommunications and Information Engineering (CTIE) Monash University

High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Author : Jing Lin, Xiaola Lin, Liang Tang Publish Journal of parallel and Distributed Computing MAKING-A-STOP: A NEW BUFFERLESS ROUTING ALGORITHM FOR ON-CHIP.

High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

High Performance Cluster Computing Architectures and Systems Hai Jin Internet and Cluster Computing Center.

Univ. of TehranAdv. topics in Computer Network1 Advanced topics in Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.

In-network cache coherence MICRO’2006 Noel Eisley et.al, Princeton Univ. Presented by PAK, EUNJI.

Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.

CS 8501 Networks-on-Chip (NoCs) Lukasz Szafaryn 15 FEB 10.

August 1, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 9: I/O Devices and Communication Buses * Jeremy R. Johnson Wednesday,

ISLIP Switch Scheduler Ali Mohammad Zareh Bidoki April 2002.

Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.

Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.

By Edward A. Lee, J.Reineke, I.Liu, H.D.Patel, S.Kim

Interconnect Networks Basics. Generic parallel/distributed system architecture On-chip interconnects (manycore processor) Off-chip interconnects (clusters.

Intel Slide 1 A Comparative Study of Arbitration Algorithms for the Alpha Pipelined Router Shubu Mukherjee*, Federico Silla !, Peter Bannon $, Joel.

DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220.

Predictive High-Performance Architecture Research Mavens (PHARM), Department of ECE The NoX Router Mitchell Hayenga Mikko Lipasti.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Disk Drive Architecture Exploration VisualSim Mirabilis Design.

Univ. of TehranIntroduction to Computer Network1 An Introduction to Computer Networks University of Tehran Dept. of EE and Computer Engineering By: Dr.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Overview Parallel Processing Pipelining

Interconnection Networks: Topology

Lecture 23: Interconnection Networks

Pablo Abad, Pablo Prieto, Valentin Puente, Jose-Angel Gregorio

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

Complexity effective memory access scheduling for many-core accelerator architectures Zhang Liang.

INTERCONNECTION NETWORKS

Improving Multiple-CMP Systems with Token Coherence

High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub

Dragonfly+: Low Cost Topology for scaling Datacenters

Multiprocessors and Multi-computers

Presentation transcript:

A Study of Cyclops64 Crossbar Architecture and Performance Yingping Zhang April, 2005

Overview 1. Background Background 2. Architecture Of C64 Crossbar Architecture Of C64 Crossbar 3. Performance Simulation Performance Simulation 4. Test Result Test Result 5. Performance Analysis Performance Analysis 6. Conclusion Conclusion 7. Future Work Future Work

Background 1. What is Cyclops64?  Cyclops64(C64), also called Blue Gene/C, is part of IBM Blue Gene project.  It is a cellular architecture-based supercomputer. Each chip consists of 75~80 custom designed 64-bit processors. Each processor will have two thread units, two integer units, and a floating point unit.  C64 is expected 1000 teraflops and will be one of the fastest supercomputers in the world.  The architecture was conceived by Cray award winner Monty Denneau, Verification testing and system software development is being done at our CAPSL group.Cray awardMonty Denneau 2. What is the project goal? Study of the architecture and performance of the C64 interconnection network, crossbar (part of Verification testing)

Host IF FIFO 64-bit x 64 Mickey tree Gbit ethernet Disk Mickey tree (DMA) Gbit ethernet (DMA) Mickey tree Gbit ethernet Disk Mickey tree (DMA) Gbit ethernet (DMA) C64 Processor TU FP ICache 5 Crossbar C64 Processor TU FP ICache 5 C64 Processor TU FP ICache 5 DDR2 SDRAM Controller 4 ASw (a part of 3D cube network) The other C64 chipsDDR2 SDRAM DIMMs FPGA Port 0-79 for C64 processors Port for mpg ICache Port 84,85 for Host IF Port for DRAM controller Port for ASw Processor# 80 ICache# 16 mpg Configuration Pin * The configuration pins are Connected to all modules except DDR and Crossbar Cyclops64 CHIP

Architecture Of C64 Crossbar 1. On chip crossbar: Provide communication inside a single chip way crossbar: 96 input ports, 96 output ports. Each port can connect with any other port and itself. Any communication among processors, ICaches, SRAM, DRAM, and ASwitches has to go through the crossbar 3. Pipelined crossbar: 7 pipeline stages When full pipelined, each port flow out one packet each cycle Bandwidth of the crossbar = port number * length of the packet

SrcSplit TarCombine TUnitA TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel SrcSplit TarCombine TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel Port# 96 Crossbar Architecture SrcSplit TarCombine TUnitA TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel TUnitA TUnitB

Crossbar Architecture SrcSplit TarCombine TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 SrcSplit TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 SrcSplit TarCtl Arbiter LC SrcCtl Ws Wr Rs Req Ack FIFOFIFO 96 C64|MP|COREC64|MP|CORE MUX Sel 92 3 Port# TUnitA 95 TUnitB TarCombine 95 TUnitB TarCombine 95 TUnitB

Performance Simulation 1. Performance Measurement Latency: The time required for a packet to traverse the network form source to destination Throughput: The rate at which packets are delivered by the network for a particular traffic pattern 2. Workloads Synthetic: Random Distributed vs Poisson Distributed Application Driven: Hello_World, Matrix_Cthread, Laplace_Cthread, Heat_Cthread, Cnet_get_nb, Cnet_put_nb, Dev_Align, Dev_Reset 3. Simulators Csim_crossbar LAST (Both designed by Fei Chen at CAPSL)

Parameters configuration PARAMETERS Workloads Arbitration Schemes SyntheticApplication Driven Benchmarks Temporal 1 Characteristics Spatial 2 Distributions Uniform Random Permutation (Neighbor & Tornado) Uniform Random Poisson Uniformly Random Matrix Circular Segmented Matrix Fixed Priority 1.Describe the generation probability of message over time 2.Determine the communication paths between the sources and destinations

Test Results: Latency - Synthetic Workloads Latency of Uniform Random Pattern goes infinite when injection rate > 0.6 Latency of Permutation Traffic is always 7 cycles without any change.

Test Results: Throughput - Synthetic Workloads (Cont) Uniform workload with permutation traffic pattern has linear throughput This network is a stable network

Test Results: Contention - Synthetic Workloads(Cont) Permutation Traffic has zero contention Uniform distribution has more contention than POISSON distribution

Performance Analysis One - Synthetic Workloads The least latency in the crossbar is 7 cycles. The crossbar is a stable network because its throughput does not degrade beyond the saturation point. Contention at the output causes the delay of transferring message, and permutation traffic has zero contention Uniformly random workload with permutation traffic has the best performance. When injection rate reaches 1.0, its throughput can achieve 1.

Test Results: Latency - Arbitration Schemes Fixed Priority Scheme is the worst case, its latency goes infinite at rate 0.5 Others have very similar latency behavior

Test Results: Throughput - Arbitration Schemes (Cont) Fixed Priority Scheme is the worst case, the network saturates at rate 0.5 Others have very similar throughput behavior

Performance Analysis Two - Arbitration Schemes  SLRU, PLRU, CIRC and RAND arbitration schemes show very similar performance behavior under uniformly random traffic pattern.  Fixed Priority arbitration scheme shows the worst performance behavior under the same situation.

Test Results – Application-Driven Benchmarks Application Number Of Packets Forword Latency (Avg) Reverse Latency (Avg) Forword Throughput (Avg) Reverse Throughput (Avg) Hello_World Heat_Cthread Matrix_Cthread Cnet_get_nb Cnet_put_nb Dev_Align Dev_Reset Average reverse latency increases very fast when packet number increased Forward and reverse traffics have different latency behavior

Performance Analysis -Application-Driven Benchmarks  C64 architecture classified traffic into: Class 0 (Forward traffic): messages send out from processor, like load request and stores from processors Class 1 (Reverse traffic): Messages send back to processors, like load return to processors  Reverse transfer delay is much bigger than forward transfer delay  Forward and reverse transfer have similar throughput

Conclusion For Synthetic Workloads V erified: C64 crossbar is a stable network The least latency of C64 crossbar is 7 cycles. Discovered: Traffic pattern, including temporal characteristics and spatial distribution, has sensitive affect on the crossbar performance behavior permutation spatial traffic has the best latency behavior. It keeps to have the least latency 7 cycles because it has zero contention. Uniform random distributed workload has better throughput behavior. Fixed priority arbitration scheme has worst performance behavior and others are very similar For Application-Driven Workload Discovered: Forward and reverse traffics have different latency behavior but similar throughput behavior Reverse traffic has worse latency behavior than forward

Future Work Synthetic Workloads  Investigate arbitration schemes under different traffic patterns Application-Driven Workloads  Investigate performance behavior of C64 Crossbar under different configuration constrains Number of used thread units Number of involved memory banks  Investigate performance behavior of C64 Crossbar under different arbitration schemes Summary of Performance Analyses Documentation

Acknowledge Fei Chen Yuhei Dimitri Joseph Ted Prof. Gao All people in CAPSL group

Question? Thanks!!!