Developing a Scalable Coherent Interface (SCI) device for MPJ Express Guillermo López Taboada 14th October, 2005 Dept. of Electronics and Systems University of A Coruña (Spain) http://www.des.udc.es Visitor at Distributed Systems Group http://dsg.port.ac.uk
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Introduction The interconnection network and its associated software libraries play a key role in High Performance Clustering Technology Cluster interconnection technologies: Gb & 10Gb Ethernet Myrinet SCI Infiniband Qsnet Quadrics GSN - HIPPI Giganet Latencies are small (usually under 10us) Bandwidths are high (usually above 1Gbps) Outline the project November 24, 2018
Introduction SCI (Scalable Coherent Interface) Latency 1.42 us (theoretical) Bandwidth 5333 Mbps (bi-directional) Usually without switch (small clusters) Topologies 1D (ring) / 2D (torus 2D) Outline the project November 24, 2018
Introduction Example of a 2D torus SCI cluster with FE (admin) Outline the project November 24, 2018
Introduction Software available from Dolphinics: Outline the project Software available from Scali: ScaIP: IP emulation ScaSISCI: SISCI (Sw Infrastructure for SCI) ScaMPI: proprietary MPI implementation November 24, 2018
Introduction Java’s portability means in networking that only the widely extended TCP/IP is supported by the JDK Previously, IP emulations were used (ScaIP & SCIP) but performance is similar to FE Now a High Performance Socket Implementation, SCI SOCKETS Similar to other Interconnection Tech. Myrinet (IPoGM->GMSockets) Outline the project November 24, 2018
Introduction Several research projects have been trying to get support in Java for these System Area Networks, mainly in Myrinet: KaRMI/GM (JavaParty, Univ. Karlsruhe) Manta/LFC/Panda/Ibis (Univ. Vrije – Holland) Java GM Sockets RMIX myrinet mpiJava/MPICH-GM or MPICH-MX … But nothing in SCI Outline the project November 24, 2018
Introduction My PhD Project: “Designing Efficient Mechanisms for Java communications on SCI systems” The motivation is filling the gap between Java and this high-speed interconnect, which lacks of sw support for Java SCI Java Fast Sockets An SCI communication device, base of a messaging system SCI Channel for Java NIO Wrappers for some libraries Optimized RMI for High Speed Networks Low level Java buffering and communication system Outline the project November 24, 2018
Introduction MPJ Express, a reference implementation of the MPI bindings for the Java language, has been released. Already mature bindings for C, C++, and Fortran, but ongoing efforts on the Java binding at DSG A good opportunity to provide SCI support to a messaging system Outline the project November 24, 2018
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Design of scidev Use of Java Native Interface JNI (unavoidable) In order to provide support and good performance we have to rely on specific low level libraries In the presence of SCI hw it should use it Lost of portability in exchange of higher performance Differences between mpiJava and scidev: mpiJava- thin wrapper providing a large number of Java MPI primitives scidev- thicker layer providing a small API November 24, 2018
Design of scidev Implementing the xdev API: init() finish() id() iprobe(ProcessID srcID, int tag, int context) irecv(Buffer buf, ProcessID srcID, int tag, int context, Status status) isend(Buffer buf, ProcessID destID, int tag, int context) and the blocking counterparts of these functions: probe, recv, send + issend & ssend November 24, 2018
Design of scidev November 24, 2018
Design of scidev mpjdev JVM xdev mxdev scidev JNI O.S Native Libraries November 24, 2018
Design of scidev Native libraries: SCILib and SISCI SCILIB Outline the project November 24, 2018
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Implementation Issues Optimizations / initialization process: JNI: Caching field identifiers and references to objects Sending 2 messages in Long protocol 1st from a 4-byte multiple address and second from a 128-byte multiple address up to a 128-byte multiple address (go further the end of the message – raw Buffer has a 2^n length) Algorithm to init the message queues of SCILib Connect (to nodes with lower rank) Create (for all nodes, beginning with the following rank) Connect (the remaining nodes) The complexity is O(n) November 24, 2018
Implementation Issues Tranport protocols: 3 native protocols: Inline 1-113b Short 114b-64Kb Long 64Kb-1Mb scidev fragments messages > 1MB and is using: Inline for control messages and small messages<113b Short with PIO (Programmed Input-Output) for messages < 8Kb Short with DMA (Direct Memory Access) for messages 8-64Kb Long in user level libraries does not use DMA transfers, so it is replaced by own Long protocol with DMA tx November 24, 2018
Implementation Issues Communications: scidev is based on non-blocking communications It’s coded having niodev as template Asynchronous sends for messages sizes > 1MB Notification strategy: Following the approach of SCI SOCKET, using the mbox interruption library Created without transfering the references (SCI interrupt handlers) Each interruption (both user_interruptions and dma_interruptions) register a callback method November 24, 2018
Implementation Issues Sending/Receiving: 2 threads: user and selector thread, synchronized for reducing latency 1 message queue in which the control messages of pending communications are kept Sending directly from the “Buffer” Direct ByteBuffer If selector thread receives a message not posted -> creates an intermediate buffer for temporal storage If the message has been posted, it copies the message directly to the “Buffer” Direct ByteBuffer November 24, 2018
Implementation Issues This schema for each pair of nodes selector thread user thread user thread SBUFFER RBUFFER ULL ULL LONG LONG Intermediate SHORT SHORT Queue Queue Queue Queue SCI Inline Inline November 24, 2018
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Benchmarking JDK 1.5 on holly. Latency (us). SCI 51 12 5 11 FE 161 145 MPJE mpiJava C sockets Java S. SCI 51 12 5 11 FE 161 145 83 109 GbE 131 101 65 86 scidev latency is 33us! November 24, 2018
Benchmarking JDK 1.5 on holly. Asymptotic Bandwidths (Mbps). SCI 1200 MPJE mpiJava C sockets Java S. SCI 1200 1480 400 360 FE 90 92 93 GbE 680 587 900 600* scidev throughput is 1280 Mbps! November 24, 2018
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Future work Immediatily: Testing for collective communications (here only was for point-to-point) A design with lower interdependence between xdev and mpjbuf Get information from different formats of configuration files in SCI Benchmarking with MPJ applications and developing MPJ and xdev applications. New buffering implementation November 24, 2018
Future work Buffering System with Sbuffer and Rbuffer in ULL (still intermidiate) SBUFFER RBUFFER ULL ULL SBUFFER RBUFFER LONG LONG Intermediate SHORT SHORT Queue Queue Queue Queue SCI Inline Inline November 24, 2018
Outline Introduction Design of scidev Implementation issues Benchmarking Future work Conclusions November 24, 2018
Conclusions Performance is still a problem Try to avoid control message. Maybe integrating this data in the ul library Aim: latency 30us & Bw 1350 Mbps Current phase in developing: Testing Hard to do multiple initializations in a single thread (restart the device) Design is a bit coupled with MPJ – strong interdependence Needs evaluation and implementation using a kernel level library (threads and spawns process natively) November 24, 2018
Questions ? November 24, 2018
Appendix Visitor at the DSG during summer 05 Pursuing PhD at Univ. of A Coruña (Spain) November 24, 2018
Appendix BS in Computing Tech. in 2002 at A Coruña Univ. Member of the Computer Architecture Group. Areas of interest of the group: High Performance compilers (automatic detection of parallelism) Cluster computing Grid applications Management of Parallel/Distributed systems Fault tolerance in MPI Computer graphics (rendering, radiosity) Geographical Information Systems 12 staff members, 8 PhD students November 24, 2018
Appendix Computer Architecture Group. Crossgrid (eu project within Gridstart) November 24, 2018
Appendix The Computer Architecture Group is young, has an average age of 32 years Some achievements (2000-2004): Papers in international conferences: 102 Papers in Journals: 53 (41 in JCR/SCI list) Regional, national and european funded projects (+/- 1M € in 5 years) November 24, 2018
Gratitudes DSG for providing full support for my work Specially Aamir and Raz for late, smoky and caffeinated DSG office hours Mark for hosting the visit and his valuable support ICG and UoP for the facilities and services Bryan Carpenter for his rare but valuable comments, and his help with some JNI pbs. DXIDI – Xunta de Galicia, for funding the visit November 24, 2018
A Coruña You will be always welcome to A Coruña! November 24, 2018
A Coruña You will be always welcome to A Coruña! November 24, 2018