Application Mapping Over OFIWG SFI Sean Hefty. MPI Over SFI Example MPI Implementation over SFI Demonstrates possible usage model –Initialization –Send.

Slides:



Advertisements
Similar presentations
Proposal (More) Flexible RMA Synchronization for MPI-3 Hubert Ritzdorf NEC–IT Research Division
Advertisements

RMA Considerations for MPI-3.1 (or MPI-3 Errata)
MPI Message Passing Interface
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
More on Processes Chapter 3. Process image _the physical representation of a process in the OS _an address space consisting of code, data and stack segments.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Asynchronous I/O with MPI Anthony Danalis. Basic Non-Blocking API  MPI_Isend()  MPI_Irecv()  MPI_Wait()  MPI_Waitall()  MPI_Waitany()  MPI_Test()
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
KOFI Stan Smith Intel SSG/DPD January, 2015 Kernel OpenFabrics Interface.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
13 June mpiJava. Related projects mpiJava (Syracuse) JavaMPI (Getov et al, Westminster) JMPI (MPI Software Technology) MPIJ.
Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.
Message-Passing Programming and MPI CS 524 – High-Performance Computing.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Parallel Programming with Java
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface KFI Framework.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface kOFI Framework.
Open Fabrics Interfaces Architecture Introduction Sean Hefty Intel Corporation.
OpenFabrics 2.0 Sean Hefty Intel Corporation. Claims Verbs is a poor semantic match for industry standard APIs (MPI, PGAS,...) –Want to minimize software.
Inter-Process Communication Mechanisms CSE331 Operating Systems Design.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Scalable Fabric Interfaces Sean Hefty Intel Corporation OFI software will be backward compatible.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
OFI SW - Progress Sean Hefty - Intel Corporation.
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Scalable RDMA Software Solution Sean Hefty Intel Corporation.
1 The Message-Passing Model l A process is (traditionally) a program counter and address space. l Processes may have multiple threads (program counters.
Device Drivers CPU I/O Interface Device Driver DEVICECONTROL OPERATIONSDATA TRANSFER OPERATIONS Disk Seek to Sector, Track, Cyl. Seek Home Position.
Storage Interconnect Requirements Chen Zhao, Frank Yang NetApp, Inc.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
Fabric Interfaces Architecture Sean Hefty - Intel Corporation.
Stan Smith Intel SSG/DPD February, 2015 Kernel OpenFabrics Interface Initialization.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
OFI SW Sean Hefty - Intel Corporation. Target Software 2 Verbs 1.x + extensions 2.0 RDMA CM 1.x + extensions 2.0 Fabric Interfaces.
1 Using PMPI routines l PMPI allows selective replacement of MPI routines at link time (no need to recompile) l Some libraries already make use of PMPI.
1 Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Instructor: Evgeny Fiksman Students: Meir.
OpenFabrics 2.0 rsockets+ requirements Sean Hefty - Intel Corporation Bob Russell, Patrick MacArthur - UNH.
1 Binding Protocol Addresses (ARP ). 2 Resolving Addresses Hardware only recognizes MAC addresses IP only uses IP addresses Consequence: software needed.
MPI Adriano Cruz ©2003 NCE/UFRJ e Adriano Cruz NCE e IM - UFRJ Summary n References n Introduction n Point-to-point communication n Collective.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
Message Passing Interface Using resources from
Stan Smith Intel SSG/DPD June, 2015 Kernel Fabric Interface Kfabric Framework.
MPI-Message Passing Interface. What is MPI?  MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a.
Embedded Real-Time Systems Processing interrupts Lecturer Department University.
SC’13 BoF Discussion Sean Hefty Intel Corporation.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Operating Systems CMPSC 473
developing to the openfabrics interfaces libfabric
A Brief Introduction to OpenFabrics Interfaces - libfabric
Messaging layer software
Fabric Interfaces Architecture – v4
MPI Point to Point Communication
Chapter 3: Processes.
Advancing open fabrics interfaces
Persistent memory support
OpenFabrics Interfaces: Past, present, and future
Request ordering for FI_MSG and FI_RDM endpoints
FIGURE 12-1 Memory Hierarchy
MPI-Message Passing Interface
CGS 3763 Operating Systems Concepts Spring 2013
I/O Systems I/O Hardware Application I/O Interface
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
System Calls System calls are the user API to the OS
Presentation transcript:

Application Mapping Over OFIWG SFI Sean Hefty

MPI Over SFI Example MPI Implementation over SFI Demonstrates possible usage model –Initialization –Send injection –Send Completions –Polling –RMA Counters Completions 2

Query Interfaces: Tagged /* Tagged provider */ hints.type = FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format = FI_ADDR; #else hints.addr_format = FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_TAGGED | FI_BUFFERED_RECV | FI_REMOTE_COMPLETE | FI_CANCEL; hints.op_flags = FI_REMOTE_COMPLETE; 3 Reliable unconnected endpoint Address vector optimized for minimal memory footprint and no internal lookups Transport agnostic Default flags to apply to data transfer operations Behavior required by endpoint

Query Interfaces: RMA/Atomics /* RMA provider */ hints.type = FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format = FI_ADDR; #else hints.addr_format = FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_RMA | FI_ATOMICS | FI_REMOTE_COMPLETE | FI_REMOTE_READ | FI_REMOTE_WRITE; hints.op_flags = FI_REMOTE_COMPLETE; 4 Support for RMA and atomic operations Remote RMA read and write support Separate endpoint for RMA operations

Query Interfaces: Message Queue eq_attr.mask = FI_EQ_ATTR_MASK_V1; eq_attr.domain = FI_EQ_DOMAIN_COMP; eq_attr.format = FI_EQ_FORMAT_TAGGED; fi_eq_open(domainfd, &eq_attr, &p2p_eqfd, NULL); eq_attr.mask = FI_EQ_ATTR_MASK_V1; eq_attr.domain = FI_EQ_DOMAIN_COMP; eq_attr.format = FI_EQ_FORMAT_DATA; fi_eq_open(domainfd, &eq_attr, rma_eqfd, NULL); fi_bind(tagged_epfd, p2p_eqfd, FI_SEND | FI_RECV); fi_bind(rma_epfd, rma_eqfd, FI_READ | FI_WRITE); 5 Event queue optimized to report tagged completions Event queue optimized to report RMA completions Associate endpoints with event queues

Query Limits optlen = sizeof(max_buffered_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_INJECTED_SEND, &max_buffered_send, &optlen); optlen = sizeof(max_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_MSG_SIZE, &max_send, &optlen); 6 Query endpoint limits Maximum ‘inject’ data size – buffer is reusable immediately after function call returns Maximum application level message size

Short Send int MPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { data_sz = get_size(count, datatype); if (data_sz <= max_buffered_send) { match_bits = init_sendtag(comm->context_id + context_offset, comm->rank, tag, 0); fi_tinjectto(tagged_epfd, buf, data_sz, COMM_TO_PHYS(comm, rank), match_bits); } else {... } 7 Small sends map directly to tagged-injectto call Fabric address provided directly to provider

Large Message Send int MPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { /* code for type calculations, tag creation, etc */ REQUEST_CREATE(sreq); fi_tsendto(MPIDI_Global.tagged_epfd,send_buf, data_sz, NULL, COMM_TO_PHYS(comm,rank), match_bits, &(REQ_OF2(sreq)->of2_context)); *request = sreq; } 8 Large sends require request allocation SFI completion context embedded in request object

Progress/Polling for Completions int MPIDI_Progress() { eq_tagged_entry_t wc; fid_eq_t fd[2] = {p2p_eqfd, rma_eqfd}; for(i=0;i<2;i++) { MPID_Request *req; rc = fi_eq_read(fd[i],(void *)&wc, sizeof(wc)); handle_errs(rc); req = context_to_request(wc.op_context); req->callback(req); } 9 Fields align on tagged entry to data_entry

RMA Completions (Counters and Completions) int MPIDI_Win_fence(MPID_Win *win) { /* synchronize software counters via completions */ PROGRESS_WHILE(win->started!=win->completed); /* Syncronize hardware counters */ fi_sync(WIN_OF2(win)->rma_epfd, FI_WRITE|FI_READ|FI_BLOCK, NULL); /* Notify any request based objects that use counter completion */ RequestQ->notify() } 10