Application Mapping Over OFIWG SFI Sean Hefty
MPI Over SFI Example MPI Implementation over SFI Demonstrates possible usage model –Initialization –Send injection –Send Completions –Polling –RMA Counters Completions 2
Query Interfaces: Tagged /* Tagged provider */ hints.type = FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format = FI_ADDR; #else hints.addr_format = FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_TAGGED | FI_BUFFERED_RECV | FI_REMOTE_COMPLETE | FI_CANCEL; hints.op_flags = FI_REMOTE_COMPLETE; 3 Reliable unconnected endpoint Address vector optimized for minimal memory footprint and no internal lookups Transport agnostic Default flags to apply to data transfer operations Behavior required by endpoint
Query Interfaces: RMA/Atomics /* RMA provider */ hints.type = FID_RDM; #ifdef MPIDI_USE_AV_MAP hints.addr_format = FI_ADDR; #else hints.addr_format = FI_ADDR_INDEX; #endif hints.protocol = FI_PROTO_UNSPEC; hints.ep_cap = FI_RMA | FI_ATOMICS | FI_REMOTE_COMPLETE | FI_REMOTE_READ | FI_REMOTE_WRITE; hints.op_flags = FI_REMOTE_COMPLETE; 4 Support for RMA and atomic operations Remote RMA read and write support Separate endpoint for RMA operations
Query Interfaces: Message Queue eq_attr.mask = FI_EQ_ATTR_MASK_V1; eq_attr.domain = FI_EQ_DOMAIN_COMP; eq_attr.format = FI_EQ_FORMAT_TAGGED; fi_eq_open(domainfd, &eq_attr, &p2p_eqfd, NULL); eq_attr.mask = FI_EQ_ATTR_MASK_V1; eq_attr.domain = FI_EQ_DOMAIN_COMP; eq_attr.format = FI_EQ_FORMAT_DATA; fi_eq_open(domainfd, &eq_attr, rma_eqfd, NULL); fi_bind(tagged_epfd, p2p_eqfd, FI_SEND | FI_RECV); fi_bind(rma_epfd, rma_eqfd, FI_READ | FI_WRITE); 5 Event queue optimized to report tagged completions Event queue optimized to report RMA completions Associate endpoints with event queues
Query Limits optlen = sizeof(max_buffered_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_INJECTED_SEND, &max_buffered_send, &optlen); optlen = sizeof(max_send); fi_getopt(tagged_epfd, FI_OPT_ENDPOINT, FI_OPT_MAX_MSG_SIZE, &max_send, &optlen); 6 Query endpoint limits Maximum ‘inject’ data size – buffer is reusable immediately after function call returns Maximum application level message size
Short Send int MPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { data_sz = get_size(count, datatype); if (data_sz <= max_buffered_send) { match_bits = init_sendtag(comm->context_id + context_offset, comm->rank, tag, 0); fi_tinjectto(tagged_epfd, buf, data_sz, COMM_TO_PHYS(comm, rank), match_bits); } else {... } 7 Small sends map directly to tagged-injectto call Fabric address provided directly to provider
Large Message Send int MPIDI_Send(buf, count, datatype, rank, tag, comm, context_offset, **request) { /* code for type calculations, tag creation, etc */ REQUEST_CREATE(sreq); fi_tsendto(MPIDI_Global.tagged_epfd,send_buf, data_sz, NULL, COMM_TO_PHYS(comm,rank), match_bits, &(REQ_OF2(sreq)->of2_context)); *request = sreq; } 8 Large sends require request allocation SFI completion context embedded in request object
Progress/Polling for Completions int MPIDI_Progress() { eq_tagged_entry_t wc; fid_eq_t fd[2] = {p2p_eqfd, rma_eqfd}; for(i=0;i<2;i++) { MPID_Request *req; rc = fi_eq_read(fd[i],(void *)&wc, sizeof(wc)); handle_errs(rc); req = context_to_request(wc.op_context); req->callback(req); } 9 Fields align on tagged entry to data_entry
RMA Completions (Counters and Completions) int MPIDI_Win_fence(MPID_Win *win) { /* synchronize software counters via completions */ PROGRESS_WHILE(win->started!=win->completed); /* Syncronize hardware counters */ fi_sync(WIN_OF2(win)->rma_epfd, FI_WRITE|FI_READ|FI_BLOCK, NULL); /* Notify any request based objects that use counter completion */ RequestQ->notify() } 10