TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.

Slides:



Advertisements
Similar presentations
Ethernet Over PCI Express Presented by Kallol Biswas
Advertisements

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
The Development of Mellanox - NVIDIA GPUDirect over InfiniBand A New Model for GPU to GPU Communications Gilad Shainer.
AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
2. Computer Clusters for Scalable Parallel Computing
Institute of Computer Science Foundation for Research and Technology – Hellas Greece Computer Architecture and VLSI Systems Laboratory Exploiting Spatial.
Nov COMP60621 Concurrent Programming for Numerical Applications Lecture 6 Chronos – a Dell Multicore Computer Len Freeman, Graham Riley Centre for.
Latency Tolerance: what to do when it just won’t go away CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.
1 InfiniBand HW Architecture InfiniBand Unified Fabric InfiniBand Architecture Router xCA Link Topology Switched Fabric (vs shared bus) 64K nodes per sub-net.
1. Overview  Introduction  Motivations  Multikernel Model  Implementation – The Barrelfish  Performance Testing  Conclusion 2.
Trends in Cluster Architecture Steve Lumetta David Culler University of California at Berkeley Computer Science Division.
A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.
CPE 731 Advanced Computer Architecture Multiprocessor Introduction
NPACI: National Partnership for Advanced Computational Infrastructure August 17-21, 1998 NPACI Parallel Computing Institute 1 Cluster Archtectures and.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann, Hyong-youb Kim, Scott Rixner, and Vijay S. Pai.
System Architecture A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Hyong-Youb Kim, Paul Willmann, Dr. Scott Rixner Rice.
Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.
Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Revisiting Network Interface Cards as First-Class Citizens Wu-chun Feng (Virginia Tech) Pavan Balaji (Argonne National Lab) Ajeet Singh (Virginia Tech)
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.
1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)
Computer System Architectures Computer System Software
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
HyperTransport™ Technology I/O Link Presentation by Mike Jonas.
Network Server Performance and Scalability June 9, 2005 Scott Rixner Rice Computer Architecture Group
1/29/2002 CS Distributed Systems 1 Infiniband Architecture Aniruddha Bohra.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
MIT Lincoln Laboratory VXFabric-1 Kontron 9/22/2011 VXFabric: PCI-Express Switch Fabric for HPEC Poster B.7, Technologies and Systems Robert Negre, Business.
The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
CSE 661 PAPER PRESENTATION
Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.
ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Latest ideas in DAQ development for LHC B. Gorini - CERN 1.
Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.
Lecture 25 PC System Architecture PCIe Interconnect
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage April 2010.
Interconnection network network interface and a case study.
An Efficient Gigabit Ethernet Switch Model for Large-Scale Simulation Dong (Kevin) Jin.
Mellanox Connectivity Solutions for Scalable HPC Highest Performing, Most Efficient End-to-End Connectivity for Servers and Storage September 2010 Brandon.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 May 2, 2006 Session 29.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.
Exploiting Task-level Concurrency in a Programmable Network Interface June 11, 2003 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
BLUE GENE Sunitha M. Jenarius. What is Blue Gene A massively parallel supercomputer using tens of thousands of embedded PowerPC processors supporting.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Open-source routing at 10Gb/s Olof Hagsand (KTH) Robert Olsson (Uppsala U) Bengt Görden (KTH) SNCNW May 2009 Project grants: Internetstiftelsen (IIS) Equipment:
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
Infiniband Architecture
HyperTransport™ Technology I/O Link
CMSC 611: Advanced Computer Architecture
Latency Tolerance: what to do when it just won’t go away
Chapter 4 Multiprocessors
Cluster Computers.
Presentation transcript:

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg

2 Motivation  Future Trends  More cores, 2-fold increase per year [Asanovic 2006]  More nodes, nodes for Exascale [Exascale Rep.]  Consequence  Exploit fine grain parallelisim  Improve serialization/synchronization  Requirement  Low latency communication

Motivation  Latency lags Bandwidth [Patterson, 2004]  Memory vs. Network  Memory BW 10GB/s  Network BW 5 GB/s  Memory Latency 50ns  Network Latency 1us  2x vs. 20x 3

State of the Art 4 Scalability Lower Latency Infiniband Ethernet Quickpath SW DSM HyperTransport Larrabee Tilera Clusters SMPs TCCluster

5 Observation  Today’s CPUs represent complete Cluster nodes  Processor cores  Switch  Links

6 Approach  Use host interface as interconnect  Tightly Coupled Cluster (TCCluster)

Background  Coherent HyperTransport  Shared memory SMPs  Cache coherency overhead  Max. 8 endpoints  Table based routing (nodeID)  Non-coherent HyperTransport  Subset of cHT  I/O devices, Southbridge,..  PCI like protocol  “Unlimited” number of devices  Interval routing (memory address) 7

8 Approach  Processors pretend to be I/O devices  Partitioned global address space  Communicate via PIO writes to MMIO

Routing  9

Programming Model  Remote Store PM  Each process has local private memory  Each process supports remotely writable regions  Sending by storing to remote locations  Receiving by reading from local memory  Synchronization through serializing instructions  No support of bulk transfers (DMA)  No support for remote reads  Emphasis on locality, low latency reads 10

11 Implementation  2x Two-socket Quadcore Shanghai Tyan Box SB HTX node1 node0 HTX SB node0 node1 ncHT link BOX 0 BOX 1 Reset/PWR

12 Implementation

13 Implementation  Software based approach  Firmware  Coreboot (LinuxBIOS)  Link de-enumeration  Force non-coherent  Link frequency & electrical parameters  Driver  Linux based  Topology & Routing  Manages remotely writable regions

14 Memory Layout 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB MMIO WC RW mem UC 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB RW mem UC MMIO WC DRAM Hole BOX 0 BOX 1

15 Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio

16 Latency – HT800(16bit) 227 ns Software-2-Software Half-Roundtrip

17 Conclusion  Introduced novel tightly coupled interconnect  “Virtually” moved the NIC into the CPU  Order of magnitude latency improvement  Scalable  Next steps:  MPI over RSM support  Own mainboard with multiple links

References  [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report  [Exascale Rep ] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems  [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp , October

UoH confidential and proprietary19 Routing  Traditional system (all nodes have same view of memory) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F IO x50 - x5F IO cHT ncHT

UoH confidential and proprietary20 Routing  Our approach (each CPU has its own view with one coherent node0 and 4 IO links) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F ncHT DRAM

UoH confidential and proprietary21 Routing in the Opteron Fabric  Type of a HT packet (posted, non posted, cHT, ncHT) is determined by SRQ based on:  MTRR  GART  Top of memory register  IO and DRAM range registers  Routing is determined by the NB on:  Routing table registers  MMIO base/limit registers  Coherent link traffic distribution register

UoH confidential and proprietary22 Transaction Example 1. Core 0 performs write to IO address. Forwarded to X-Bar via SRQ 2. X-Bar forwards it to IO bridge to convert into posted Write 3. X-Bar forwards it to IO link 4. X-Bar forwards it to IO bridge to convert into coherent sizedWr 5. X-Bar forwards it to Mem Ctrler

UoH confidential and proprietary23 Topology and Adressing > Top > Left > Right > Down > Top > Left 36 -> Right null -> Down Limited possibilities as Opteron only supports 8 address range registers

UoH confidential and proprietary24 Limitations  Communication is PIO, no DMA, no offloading  No congestion management, no HW barriers, no multicast, limited QoS etc  Synchronous system, all Opterons require same clock, no COTS boxes  Security Issues: Nodes can write directly to phys mem on any node  Posted writes to remote memory do not have the coherency bit set, no local caching possible?

How does it work?  Minimalistic Linux Kernel (MinLin)  100 MB, runs in Ramdisk  Boots over ethernet or FILO  Mount homes over ssh  PCI subsystem, to access NB config  Multicore/processor supported  No harddisk, VGA, keyboard,..  No Module support, no device drivers  No Swapping/paging UoH confidential and proprietary25