Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using.

Slides:

Advertisements

Similar presentations

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

Advertisements

CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Supporting Parallel Applications on Clusters of Workstations: The Intelligent Network Interface Approach.

Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

VIA and Its Extension To TCP/IP Network Yingping Lu Based on Paper “Queue Pair IP, …” by Philip Buonadonna.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

1 Version 3 Module 8 Ethernet Switching. 2 Version 3 Ethernet Switching Ethernet is a shared media –One node can transmit data at a time More nodes increases.

t Popularity of the Internet t Provides universal interconnection between individual groups that use different hardware suited for their needs t Based.

Haoyuan Li CS 6410 Fall /15/2009.  U-Net: A User-Level Network Interface for Parallel and Distributed Computing ◦ Thorsten von Eicken, Anindya.

Federated DAFS: Scalable Cluster-based Direct Access File Servers Murali Rangarajan, Suresh Gopalakrishnan Ashok Arumugam, Rabita Sarker Rutgers University.

Students:Gilad Goldman Lior Kamran Supervisor:Mony Orbach Mid-Semester Presentation Spring 2005 Network Sniffer.

A Pipelined Execution of Tiled Nested Loops on SMPs with Computation and Communication Overlapping Maria Athanasaki, Aristidis Sotiropoulos, Georgios Tsoukalas,

Figure 1.1 Interaction between applications and the operating system.

Dolphin software SCI Software Replace in Title/Slide Master with Company Logo or delete Hugo Kohmann Dolphin Interconnect Solutions.

Why Linux is a Bad Idea as a Compute Node OS (for Balanced Systems) Ron Brightwell Sandia National Labs Scalable Computing Systems Department

High Performance Communication using MPJ Express 1 Presented by Jawad Manzoor National University of Sciences and Technology, Pakistan 29 June 2015.

Embedded Transport Acceleration Intel Xeon Processor as a Packet Processing Engine Abhishek Mitra Professor: Dr. Bhuyan.

1 I/O Management in Representative Operating Systems.

An overview of Infiniband Reykjavik, June 24th 2008 R E Y K J A V I K U N I V E R S I T Y Dept. Computer Science Center for Analysis and Design of Intelligent.

Operating System Support for Virtual Machines Samuel King, George Dunlap, Peter Chen Univ of Michigan Ashish Gupta.

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

Xen and the Art of Virtualization. Introduction  Challenges to build virtual machines Performance isolation  Scheduling priority  Memory demand  Network.

Router Architectures An overview of router architectures.

Supporting iWARP Compatibility and Features for Regular Network Adapters P. BalajiH. –W. JinK. VaidyanathanD. K. Panda Network Based Computing Laboratory.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

Chapter 13 WAN Technologies and Routing. LAN Limitations Local Area Network (LAN) spans a single building or campus. Bridged LAN is not considered a Wide.

Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

High Performance User-Level Sockets over Gigabit Ethernet Pavan Balaji Ohio State University Piyush Shivam Ohio State University.

Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.

Traffic Management - OpenFlow Switch on the NetFPGA platform Chun-Jen Chung( ) Sriram Gopinath( )

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

RiceNIC: A Reconfigurable and Programmable Gigabit Network Interface Card Jeff Shafer, Dr. Scott Rixner Rice Computer Architecture:

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

MPICH2 – A High-Performance and Widely Portable Open- Source MPI Implementation Darius Buntinas Argonne National Laboratory.

11 NETWORK CONNECTION HARDWARE Chapter 3. Chapter 3: NETWORK CONNECTION HARDWARE2 NETWORK INTERFACE ADAPTER  Provides the link between a computer and.

CCNA 3 Week 4 Switching Concepts. Copyright © 2005 University of Bolton Introduction Lan design has moved away from using shared media, hubs and repeaters.

Increasing Web Server Throughput with Network Interface Data Caching October 9, 2002 Hyong-youb Kim, Vijay S. Pai, and Scott Rixner Rice Computer Architecture.

Integrating New Capabilities into NetPIPE Dave Turner, Adam Oline, Xuehua Chen, and Troy Benjegerdes Scalable Computing Laboratory of Ames Laboratory This.

Infiniband Bart Taylor. What it is InfiniBand™ Architecture defines a new interconnect technology for servers that changes the way data centers will be.

ECE 526 – Network Processing Systems Design Computer Architecture: traditional network processing systems implementation Chapter 4: D. E. Comer.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

A record and replay mechanism using programmable network interface cards Laurent Lefèvre INRIA / LIP (UMR CNRS, INRIA, ENS, UCB)

Intel Research & Development ETA: Experience with an IA processor as a Packet Processing Engine HP Labs Computer Systems Colloquium August 2003 Greg Regnier.

The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.

LRPC Firefly RPC, Lightweight RPC, Winsock Direct and VIA.

Guirao - Frascati 2002Read-out of high-speed S-LINK data via a buffered PCI card 1 Read-out of High Speed S-LINK Data Via a Buffered PCI Card A. Guirao.

CCNA3 Module 4 Brierley Module 4. CCNA3 Module 4 Brierley Topics LAN congestion and its effect on network performance Advantages of LAN segmentation in.

Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.

LISA Linux Switching Appliance Radu Rendec Ioan Nicu Octavian Purdila Universitatea Politehnica Bucuresti 5 th RoEduNet International Conference.

Advanced Network Labs & Remote Network Agent

Introduction to Operating Systems Concepts

High Performance and Reliable Multicast over Myrinet/GM-2

Instructor Materials Chapter 5: Ethernet

J.M. Landgraf, M.J. LeVine, A. Ljubicic, Jr., M.W. Schulz

Reference Router on NetFPGA 1G

CS 286 Computer Organization and Architecture

Read-out of High Speed S-LINK Data Via a Buffered PCI Card

CSCI 315 Operating Systems Design

Data Link Issues Relates to Lab 2.

I/O Systems I/O Hardware Application I/O Interface

Implementing an OpenFlow Switch on the NetFPGA platform

Reference Router on NetFPGA 1G

NetFPGA - an open network development platform

Chapter 13: I/O Systems.

Cluster Computers.

Presentation transcript:

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 1 - by Adrian Riedo - Summer 2000 High Performance Computing using Portals over TNet

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 2 - PoT Project The Portals over TNet Project Introduction Analysis Portals 3 TNet Design Case study Concepts Implementation Development System TNAL Conclusion

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 3 - PoT Introduction About High Performance Computing (HPC) Supercomputers  Superclusters Message Passing (e.g. MPI) Datamovement layer OS-Bypass (avoid kernel calls) zero-copy (network bandwidth  memory bandwidth) Application Bypass (large transfers w/o intervention by Appl.) High Performance Network Design rules on all levels: Scalability low latency, high bandwidth Portability, platform independence (host & network) Goal of the PoT project Evaluation of a first implementation of Portals on TNet

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 4 - PoT Analysis Portals 3 CPlan environment at Sandia National Laboratories, Albuquerque IO (temp) IO (temp) IP myrip.mod IP myrip.mod Application (MPI) on Portals Application (MPI) on Portals Portals 2 portals.mod Portals 3 p3.mod Portals 3 p3.mod RTS / CTS rtscts.mod RTS / CTS rtscts.mod Firmware (Myrinet) rtsmcp Firmware (Myrinet) rtsmcp

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 5 - PoT Analysis Portals 3 Portals 3 Architecture, Network Abstraction Layer Application (USER) Application (USER) Driver (OS) Driver (OS) Firmware (NIC) Firmware (NIC) API Library api-p30/* lib-p30/* nal.c lib_nal.c... to NIC / wire

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 6 - PoT Analysis Portals 3 Portals 3 Structures, Addressing me md Portal Table Match List Memory Descriptor List Event Queue Memory Region ApplicationLibrary Access Control Lists Network interfaces

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 7 - PoT Analysis TNet TNet environment, Swiss-Tx Application (MPI) on FCI Application (MPI) on FCI FCI tnet.mod FCI tnet.mod Firmware (TNet) cc_b35_lc_c35 Firmware (TNet) cc_b35_lc_c35 irq handler kernel thread tnet.c...

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 8 - PoT Analysis TNet TNet OSI Specification * corresponds to MPI Broadcast in 1 MPI Group ** specially for SMP Nodes (2, 4 Processors) BC is a particular case of MC Process 1 Process 2 Process 3 Process 4 TM Receiving Node Sender Process TM Example (4 Processes) DM Layer specific Communication Types

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 9 - PoT Analysis TNet TNet Address Translation CMB VCA CMB: Contiguous Memory Block VCA: Virtual Communication Address pg: Page pg VCA Network virtual communication address space Host memory address space offsetpagetable

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 10 - PoT Analysis TNet TNet PCI Adapter Specification CC PLX LC GBE SDRAM SRAM TX-FIFO RX-FIFO PCI CC Communication - Controller Lucent Orca MHz Tx, Rx Unit / CRC / DMW / Flags S(D)RAM Controllers Rx, Tx FIFO Interfaces PLX Controller LC Link - Controller Lucent Orca 62.5 MHz Handshake Process TNet retransmission protocol Buffer: Out 3 Packets, IN 1 Packet CRC Check GBE GigBit - Eth Vitesse VSC MHz SDRAM Page Table Index to Address Translation Table (for TM mode) Std 16 MB or more SRAM ID Validation Table 128 x 18 Bit

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 11 - PoT Design case study Portals over TNet case study Hardware Solution Library in hardware on NIC big FPGA, fast RAM required for optimal solution special design tools long implementation time high knowledge on Portals 3 and TNet Software Solution Library still in OS usage of TNet firmware & driver Portals NAL and TNet driver knowledge Performance workaround: pagetable as “memory descriptor”  first learn on software level, then approach step by step

IO (temp) IO (temp) Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 12 - PoT Design Concepts Driver architecture & modules (Portals  Myrinet / FCI  TNet) IP myrip.mod IP myrip.mod Application (MPI) on Portals Application (MPI) on Portals Application (MPI) on FCI Application (MPI) on FCI FCI Portals 2 portals.mod FCI tnet.mod FCI tnet.mod Firmware (TNet) cc_b35_lc_c35 Firmware (TNet) cc_b35_lc_c35 Portals 3 p3.mod Portals 3 p3.mod RTS / CTS rtscts.mod RTS / CTS rtscts.mod Firmware (Myrinet) rtsmcp Firmware (Myrinet) rtsmcp myrnal  forward PTL_IFACE_MYR lib-p30 lib_myrnal irq handler kernel thread tnet.c...

IO (temp) IO (temp) Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 13 - PoT Design Concepts Driver architecture & modules (Portals,FCI  TNet) P3oT p3ot.mod P3oT p3ot.mod IP myrip.mod IP myrip.mod Application (MPI) on Portals Application (MPI) on Portals Application (MPI) on FCI Application (MPI) on FCI FCI tnal  forward PTL_IFACE_T lib-p30 FCI Firmware (TNet) cc_b35_lc_c35 Firmware (TNet) cc_b35_lc_c35 Portals 2 portals.mod RTS / CTS rtscts.mod RTS / CTS rtscts.mod Firmware (Myrinet) rtsmcp Firmware (Myrinet) rtsmcp tnet.c... lib_tnal

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 14 - PoT Design Concepts Dataflow in the P3oT module (CMB & IRQ - Large Msgs using DMA ) CMB P3oT p3ot.mod P3oT p3ot.mod Application (MPI) on Portals Application (MPI) on Portals tnal  forward PTL_IFACE_T lib-p30 Firmware (TNet) - Pagetable set up for virtual CMB cc_b35_lc_c35 Firmware (TNet) - Pagetable set up for virtual CMB cc_b35_lc_c35 tnet.c... lib_tnal no OS Bypass no zero-copy DMA

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 15 - PoT Implementation Development System System 2 Alpha workstation 164LX alpha processor, 320 MB RAM 100 Base T Ethernet Mini CPlant 64 Bit / 33 MHz Myrinet Myrinet 8 port switch TNet 32 Bit TNet NIC, 16 MB RAM no switch OS TRU64, RedHat Linux (dualboot)

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 16 - PoT Implementation TNAL (network abstraction layer for Portals over TNet) lib-p30/* lib_tnal.c... tnet.c FCI ioctl CMB, gcw TNET_ioctl in tnet.c case TNET_PTL_DISPATCH: copy_from_user(..); lib_dispatch(..); copy_to_user(..);.. break; from lib_dispatch Do_PtlPut in wrap.c.. for the PtlPut tnal_send in lib_tnal.c.. memcpy(..); //for header copy_from_user(..); //for data.. //send data using CMB, DMA.. // remote IRQ on last packet lib_finalize(..); Incoming message TNET_Interrupt in tnet.c.. memcpy(..); //for header lib_parse(..”header”..) from TNET_Interrupt lib_parse in ~.c parse_put in ~.c from parse_put tnal_rcv in tnal.c copy_to_user(..); lib_finalize(..);

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 17 - PoT Implementation Milestones Setting up PC / Development System TNet Documentation / vhdl & C Sources  Presentation Getting familiar with Portals Experimenting with mpich Install Myrinet, TNet, FCI, Portals on Tru64 / Linux Experimenting with modules & test programs for TNet Presentations / Website PoT Design Writing hybrid module (P3oT) Debugging Benchmarking Report

CMB Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 18 - PoT Prospects Dataflow in the P3oT module (Pagetable - Small Msgs using PIO) P3oT p3ot.mod P3oT p3ot.mod Application (MPI) on Portals Application (MPI) on Portals tnal  forward PTL_IFACE_T lib-p30 Firmware (TNet) - Pagetable points to Appl. Space cc_b35_lc_c35 Firmware (TNet) - Pagetable points to Appl. Space cc_b35_lc_c35 tnet.c... lib_tnal OS Bypass zero-copy LIB PIOdynamic Pagetable

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 19 - PoT Conclusion Conclusions CMB Software Solution: approx. 80  s latency (first version) Not the best solution, but learned a lot Software solution profits from CRC and retransmit on card TNAL lays basis for further research  Implementation using Pagetable & PIO for OS Bypass Experience Analysis and design take a lot of time (important) Wide knowledge needed Kernel programming is not trivial Long debugging time compared to applications

Scalable Systems Lab / The University of New Mexico© Summer 2000 by Adrian Riedo- Slide 20 - PoT Project The PoT website at