Presentation is loading. Please wait.

Presentation is loading. Please wait.

Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems.

Similar presentations


Presentation on theme: "Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems."— Presentation transcript:

1 Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems

2 Outline  Background ­ Evolution of cluster computers ­ Multimedia of “Resource-rich” cluster computers  Design of extensible message layers ­ GRIM: General-purpose Reliable In-order Messages  Extensions ­ Integrating peripheral devices ­ Streaming computations  Host-to-host performance  Concluding remarks

3 Background An Evolution of Cluster Computers

4 Cluster Computers  Cost-effective alternative to supercomputers ­ Number of commodity workstations ­ Specialized network hardware and software  Result: Large pool of host processors CPU Network Interface Memory I/O Bus CPU Network Interface Memory I/O Bus CPU Network Interface Memory I/O Bus CPU Network Interface Memory I/O Bus System Area Network

5 Improving Cluster Computers  Adding more host CPUs  Adding intelligent peripheral devices Peripheral Devices Host CPUs

6 Peripheral Device Trends  Increasingly independent, intelligent peripheral devices  Feature on-card processing and memory facilities  Migration of computing power and bandwidth requirements to peripherals Ethernet Host Storage CPU SAN NI Media Capture

7 Resource-Rich Cluster Computers  Inclusion of diverse peripheral devices ­ Ethernet server cards, multimedia capture devices, embedded storage, computational accelerators  Processing takes place in host CPUs and peripherals SAN NI Ethernet Host System Area Network Cluster SAN NI Video Capture FPGA Host Storage Host CPU

8 Benefits of Resource-Rich Clusters  Employ cluster computing in new applications ­ Real-time constraints ­ I/O intensive ­ Network  Example: Digital libraries ­ Enormous amounts of data ­ Large number of network users  Example: Multimedia ­ Capture and process large streams of multimedia data ­ CAVE or Visualization clusters

9 Extensible Message Layers Supporting Resource-Rich Cluster Computers

10 Problem: Utilizing distributed cluster resources  How is efficient intra-cluster communication provided?  How can applications make use of resources? CPU Video Capture FPGA RAID FPGA Ethernet RAID ? ? ?

11 Answer: Flexible “Message Layer” Communication Software  Message layers are enabling technology for clusters ­ Enable cluster to function as single image multiprocessor system  Current message layers ­ Optimized for transmissions between host CPUs ­ Peripheral devices only available in context of the local host  What is needed ­ Support efficient communication with host CPUs and peripherals ­ Ability to harness peripheral devices as pool of resources

12 GRIM: An Implementation A message layer for resource-rich clusters

13 GRIM Core General-purpose Reliable In-order Message Layer (GRIM)  Message layer for resource-rich clusters ­ Myrinet SAN backbone ­ Both host CPUs and peripheral devices are endpoints ­ Communication core implemented in NI CPU FPGA Card Storage Card Network Interface Card System Area Network

14 Per-hop Flow Control  End-to-end flow control necessary for reliable delivery ­ Prevents buffer overflows in communication path  Endpoint-managed schemes ­ Impractical for peripheral devices  Per-hop flow control scheme ­ Transfer data as soon as next stage can accept ­ Optimistic approach Receiving Endpoint Sending Endpoint SAN Network Interface PCI Reply Receiving Endpoint Sending Endpoint Send SAN Network Interface PCI Receiving Endpoint Sending Endpoint DATA ACK DATA ACK PCI SAN Network Interface DATA ACK PCI

15 Logical Channels  Multiple endpoints in a host share the NI  Employ multiple logical channels in the NI ­ Each endpoint owns one or more logical channels ­ Logical channel provides virtual interface to network Endpoint 1 Endpoint n Logical Channel Network Interface Scheduler Network

16 Programming Interfaces: Active Messages  Message specifies function to be executed at receiver ­ Similar to remote procedure calls, but lightweight ­ Invoke operations at remote resources  Useful for constructing device-specific APIs  Example: Interactions with remote storage controller CPU Storage Controller NI SAN AM_fetch_file() AM_return_file()

17 Programming Interfaces: Remote Memory  Transfer blocks of data from one host to another ­ Receiving NI executes transfer directly  Read and Write operations ­ NI interacts with kernel driver to translate virtual addresses ­ Optional notification mechanisms CPU NI SAN Memory CPU Memory

18 Integrating Peripheral Devices Hardware Extensibility

19 Peripheral Device Overview NI CPU Peripheral Device  In GRIM peripherals are endpoints  Intelligent peripherals ­ Operate autonomously ­ On-card message queues ­ Process incoming active messages ­ Eject outgoing active messages  Legacy peripherals ­ Managed by host application or ­ Remote memory operations Legacy Peripheral Device

20 Peripheral Devices Examples  Video display card ­ Manipulate frame buffer ­ Remote memory writes Video Display D/AAGP Frame Buffer  Server adaptor card ­ Networked host on PCI card ­ AM handlers for LAN-SAN bridge Server Adaptor Ethernet PCIi960 SCSI PCI DMA A/D Frame Buffer Host Memory Video Capture  Video capture card ­ Specialized DMA engine ­ AM handlers capture data

21 Celoxica RC-1000 FPGA Card  FPGAs provide acceleration ­ Load with application-specific circuits  Celoxica RC-1000 FPGA card ­ Xilinx Virtex-1000 FPGA ­ 8 MB SRAM  Hardware implementation ­ Endpoint as state machines ­ AM handlers are circuits SRAM 0 SRAM 1 SRAM 2 SRAM 3 PCI FPGA Control & Switching

22 FPGA Endpoint Organization Frame Input Queues Output Queues Communication Library API Application Data Memory API FPGA Card Memory FPGA Circuit Canvas User Circuit API User Circuit n User Circuit 1

23 Example FPGA Configuration  Cryptography configuration ­ DES, RC6, MD5, and ALU  Occupies 70% of FPGA ­ Newer FPGAs 8x in size  Operates with 20 MHz clock ­ Newer FPGAs 6x faster ­ 4KB Payload => 55  s (73MB/s)

24 Expansion: Sharing the FPGA  FPGA has limited space for hardware circuits ­ Host reconfigures FPGA on demand ­ FPGA Function Fault Host CPU FPGA Circuit X Circuit Y Configuration A Circuit X Circuit Y Configuration A Configuration B Circuit E Circuit F Configuration C Circuit G State Storage SRAM 0 Message: Use Circuit F Function Fault Circuit E Circuit F Configuration C Circuit G (150 ms)

25 Extension: Streaming Computations Software extensibility

26 Streaming Computation Overview  Programming method for distributed resources ­ Establish pipeline for streaming operations ­ Example: Multimedia processing  Celoxica RC-1000 FPGA endpoint CPU NI Video Capture CPU NI Media Processor CPU NI Media Processor CPU NI Media Processor System Area Network

27 Streaming Fundamentals  Computation: How is a computation performed? ­ Active message approach  Forwarding: Where are results transmitted? ­ Programmable forwarding directory Destination: FPGA Forward Entry: X AM: Perform FFT In Message FPGA Computational Circuits Circuit 1:FFT Circuit N:Encrypt Forwarding Directory Destination: Host Forward Entry: X AM: Receive FFT Out Message

28 Host-to-Host Performance Transferring data between two host-level endpoints

29 Host-to-Host Communication Performance  Host-to-Host transfers standard benchmark  Three phases of data transfer ­ Injection most challenging  Overall communication path NISAN CPU NI CPU Memory Active Messages Remote Memory Operations 1 1 2 2 3 3 SourceDestination

30 Host-NI: Data Injections  Host-NI transfers challenging ­ Host lacks DMA engine  Multiple transfer methods ­ Programmed I/O ­ DMA  Automatically select method Result: Tunable PCI Injection Library (TPIL) CPU Main Memory PCI Bus PCI DMA Peripheral Device Memory Controller Cache

31 TPIL Performance: LANai 9 NI with Pentium III-550 MHz Host Bandwidth (MBytes/s) Injection Size (Bytes)

32 Overall Communication Pipeline  Three phases of transmission ­ Optimization: Use fragmentation to increase utilization ­ Optimization: Allow cut-through transmissions time Sending Host-NI NI-NI Receiving NI-Host Message 1 Message 2 Overall Transmission Time Message 1 Message 3Message 2 Message 3Message 2 Message 3Message 2 Overall Transmission Time

33 Overall Host-to-Host Performance HostNILatency (μs)Bandwidth (MB/s) P4-1.7GHz LANai 98146 LANai 414.5108 P3-550MHz LANai 99.5116 LANai 41496 Bandwidth (MBytes/s) Message Size (Bytes)

34 Comparison to Existing Message Layers Latency (μs) μs Bandwidth (MB/s) MB/s

35 Concluding Remarks

36 Key Contributions  Framework for communication in resource-rich clusters ­ Reliable delivery mechanisms, virtualized network interface, and flexible programming interfaces ­ Comparable performance to state-of-the-art message layers  Extensible for peripheral devices ­ Suitable for intelligent and legacy peripherals ­ Methods for managing card resources  Extensible for higher-level programming abstractions ­ Endpoint-level: Streaming computations and sockets emulation ­ NI-level: Multicast support

37 Future Directions  Continued work with GRIM ­ Video card vendors opening cards to developers ­ Myrinet connected embedded devices  Adaptation to other network substrates ­ Gigabit Ethernet appealing because of cost ­ Modification to transmission protocols ­ InfiniBand technology promising  Active system area networks ­ FPGA chips beginning to feature gigabit transceivers ­ Use FPGA chips as networked processing device

38 Additional Research Projects

39 Wireless Sensor Networks  NASA JPL Research ­ In-situ WSNs ­ Exploration of Mars  Communication ­ Self organization ­ Routing  SensorSim ­ Java simulator ­ Evaluate protocols

40 PeZ: Pole-Zero Editor for MATLAB

41 Related Publications  A Tunable Communications Library for Data Injection, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2002.  Active SANs: Hardware Support for Integrating Computation and Communication, C. Ulmer, C. Wood, and S. Yalamanchili, Proceedings of the Workshop on Novel Uses of System Area Networks at HPCA, 2002.  A Messaging Layer for Heterogeneous Endpoints in Resource Rich Clusters, C. Ulmer and S. Yalamanchili, Proceedings of the First Myrinet User Group Conference, 2000.  An Extensible Message Layer for High-Performance Clusters, C. Ulmer and S. Yalamanchili, Proceedings of Parallel and Distributed Processing Techniques and Applications, 2000. Papers and Software Available at http://www.CraigUlmer.com/research

42 Backup Slides

43 Performance: FPGA Computations Acquire SRAM Detect New Message Fetch Header Computation Store Results Store Header Lookup Forwarding Update Queues Release SRAM 8 4 7 1024 16 5 3 1 Fetch Payload 1024 Clocks Clock Speed: 20MHz Operation Latency: 55  s (4KB 73MB/s)

44 SRAM 0 (Incoming Queues) SRAM 1 (User Page 0) SRAM 3 (Outgoing Queues) Port APort C Built-in ALU Ops SRAM 2 (User Page 1) Message Generator Results Cache Port B Scratchpad Controller Scratchpad Controller Fetch/Decode Control/ Status Port

45 Page Fault Expansion: Sharing On-Card Memory  Limited on-card memory for storing application data ­ Construct virtual memory system for on-card memory ­ Swap space is host memory Host CPU FPGA User-defined Circuits Page Frame 1 SRAM 1 Page Frame 2 SRAM 2 Page Frame 1 Page Frame 1 Page Frame 1 User Page X

46 RC-1000 Challenges  Hardware implementation ­ Queue state machines  Memory locking ­ SRAM single ported ­ Arbitrate for use  CPU / NI contention ­ NI manages FPGA lock FPGA User Circuits SRAM CPU Memory Lock NI

47 Example: Autonomous Spaceborne Clusters  NASA Remote Exploration and Experimentation ­ Spaceborne vehicle processes data locally ­ Clusters in the sky  Number of peripheral devices ­ Data sensors ­ FPGA & DSPs  Adaptive hardware ­ Modify functionality after deployment

48  Acquire FPGA SRAM ­ CPU-NI: 20  s ­ NI: 8  s  Inject 4 KB message to FPGA ­ CPU: 58  s (70 MB/s) ­ NI: 32  s (128 MB/s)  Release FPGA SRAM ­ CPU-NI: 8  s ­ NI: 5  s Performance: Card Interactions FPGA User Circuits SRAM Memory Lock NI CPU

49 Example: Digital Libraries  Enormous amount of data and users ­ Intelligent LAN and storage cards to manage requests CPU Intelligent LAN Adaptor Storage Adaptor SAN NI Files A-H CPU Intelligent LAN Adaptor Storage Adaptor SAN NI Files S-Z Client CPU Intelligent LAN Adaptor Storage Adaptor SAN NI Files I-R SAN Backbone

50 Cyclone Systems I 2 O Server Adaptor Card  Networked host on a PCI card  Integration with GRIM ­ Interact directly with the NI ­ Ported host-level endpoint software  Utilized as a LAN-SAN bridge Host System i960 Rx Processor DMA Engines Primary PCI Interface DRAM 10/100 Ethernet SCSI ROM DMA Engine Secondary PCI Interface Daughter Card Local Bus

51 GRIM Multicast Extensions  Distribute the same message to multiple receivers ­ Tree based distributions ­ Replicate message at NI ­ Messages are recycled back into network  Extensions to NI’s core communication operations ­ Recycled messages in separate logical channel ­ Utilize per-hop flow control for reliable delivery A BC DE NI Endpoint A NI Endpoint B NI Endpoint D NI Endpoint C NI Endpoint E A B C D E

52 Multicast Performance LANai 4, P4-1.7 GHz Hosts Time (μs) 8 Hosts Multicast Message Size (Bytes)

53 Multicast Observations  Beneficial: reduces sending overhead  Performance loss for large messages ­ Dependent on NI memory copy bandwidth  On-card memory copy benchmark: ­ LANai 4:19 MB/s ­ LANai 9:66 MB/s

54 Extension: Sockets Emulation  Berkeley sockets is a communication standard ­ Utilized in numerous distributed applications  GRIM provides sockets API emulation ­ Functions for intercepting socket calls ­ AM handler functions for buffering connection data write() Intercept Generate AM AM: Append Socket X Socket Data Socket X AM Handler Append Socket Intercept Extract Data read() SenderReceiver

55 Sockets Emulation Performance P4-1.7 GHz Hosts Bandwidth (MBytes/s) Transfer Size (Bytes)

56 Overall Performance: Store-and-Forward  Approach: Single message, no overlap ­ Three transmission stages ­ Expect roughly 1/3 of bandwidth of individual stage P3-550 MHz Hosts Message 1 time PCI: 132 MB/s Myrinet: 160 MB/s Overall Transmission Time Sending Host-NI NI-NI Receiving NI-Host Bandwidth (MBytes/s) Message Size (Bytes)

57 Enhancement: Message Pipelining  Allow overlap with multiple in-flight messages ­ GRIM uses AM and RM fragmentation/reassembly ­ Performance depends on fragment size LANai 9, P3-550 MHz Hosts Sending Host-NI NI-NI Receiving NI-Host Message 1 time Message 3Message 2 Overall Transmission Time Message 1Message 3Message 2 Message 1Message 3Message 2 Bandwidth (MBytes/s) Message Size (Bytes)

58 Enhancement: Cut-through Transfers  Forward data as soon as it begins to arrive ­ Cut-through at sending and receiving NIs time Message 1 Message 2 Sending Host-NI NI-NI Receiving NI-Host Overall Transmission Time LANai 9, P3-550 MHz Hosts Message Size (Bytes) Bandwidth (MBytes/s)


Download ppt "Extensible Message Layers for Multimedia Cluster Computers Dr. Craig Ulmer Center for Experimental Research in Computer Systems."

Similar presentations


Ads by Google