Presentation is loading. Please wait.

Presentation is loading. Please wait.

General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley.

Similar presentations


Presentation on theme: "General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley."— Presentation transcript:

1 General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley

2 3/12/99CS258 S992 *T: Network Co-Processor

3 3/12/99CS258 S993 iWARP: Systolic Computation Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory Interface unit Host

4 3/12/99CS258 S994 Dedicated Message Processor General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem

5 3/12/99CS258 S995 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or make elastic) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Effect observed on destination address space and/or events Protocol divided between two layers Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI

6 3/12/99CS258 S996 Example: Intel Paragon

7 3/12/99CS258 S997 User Level Abstraction (Lok Liu) Any user process can post a transaction for any other in protection domain –communication layer moves OQ src –> IQ dest –may involve indirection: VAS src –> VAS dest Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS

8 3/12/99CS258 S998 Msg Processor Events Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event

9 3/12/99CS258 S999 Basic Implementation Costs: Scalar Cache-to-cache transfer (two 32B lines, quad word ops) –producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT) –consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT) to NI FIFO: read status, chk, write,... from NI FIFO: read status, chk, dispatch, read, read,... CP User OQ MP Registers Cache Net FIFO User IQ MPCP Net 21.52 4.4 µs5.4 µs 10.5 µs 7 wds 222 250ns + H*40ns

10 3/12/99CS258 S9910 Virtual DMA -> Virtual DMA Send MP segments into 8K pages and does VA –> PA Recv MP reassembles, does dispatch and VA –> PA per page CP User OQ MP Registers Cache Net FIFO User IQ MP CP Net 21.52 7 wds 222 Memory sDMA hdr rDMA MP 2048 400 MB/s 175 MB/s 400 MB/s

11 3/12/99CS258 S9911 Single Page Transfer Rate Actual Buffer Size: 2048 Effective Buffer Size: 3232

12 3/12/99CS258 S9912 Msg Processor Assessment Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled –Large transfers segmented Reduces overhead but adds latency User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher

13 3/12/99CS258 S9913 Case Study: Meiko CS2 Concept Circuit-switched Network Transaction –source-dest circuit held open for request response –limited cmd set executed directly on NI Dedicated communication processor for each step in flow

14 3/12/99CS258 S9914 Case Study: Meiko CS2 Organization

15 3/12/99CS258 S9915 Shared Physical Address Space NI emulates memory controller at source NI emulates processor at dest –must be deadlock free

16 3/12/99CS258 S9916 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address

17 3/12/99CS258 S9917 Case Study: NOW General purpose processor embedded in NIC

18 3/12/99CS258 S9918 Message Time Breakdown Communication pipeline

19 3/12/99CS258 S9919 Message Time Comparison

20 3/12/99CS258 S9920 SAS Time Comparison

21 3/12/99CS258 S9921 Message-Passing Time vs Size

22 3/12/99CS258 S9922 Message-Passing Bandwidth vs Size

23 3/12/99CS258 S9923 Application Performance on LU

24 3/12/99CS258 S9924 Working Sets Change with P 8-fold reduction in miss rate from 4 to 8 proc

25 3/12/99CS258 S9925 Application Performance on BT

26 3/12/99CS258 S9926 NAS Communication Scaling Normalized Msgs per ProcAverage Message Size

27 3/12/99CS258 S9927 NAS Communication Scaling: Volume

28 3/12/99CS258 S9928 Communication Characteristics: BT

29 3/12/99CS258 S9929 Beware Average BW analysis

30 3/12/99CS258 S9930 Reflective Memory Writes to local region reflected to remote

31 3/12/99CS258 S9931 Case Study: DEC Memory Channel See also Shrimp

32 3/12/99CS258 S9932 Scalable Synchronization Operations Messages: point-to-point synchronization Build all-to-all as trees Recall: sophisticated locks reduced contention by spinning on separate locations –caching brought them local –test&test&set, ticket-lock, array lock »O(p) space Problem: with array lock location determined by arrival order => not likely to be local Solution: queue-lock –build distributed linked-list, each spins on local node

33 3/12/99CS258 S9933 Queue Locks Head holds lock;Each points to next waiter Shared pointer to tail Acquire –swap (fetch&store) tail with node address, chain in prev Release –signal next –compare&swap plus check to reset tail

34 3/12/99CS258 S9934 Parallel Prefix: Upward Sweep generalization of barrier (reduce-broadcast) compute S i = X i  X i-1 +... + X 0, for i = 0, 1,... combine children, store least significant

35 3/12/99CS258 S9935 Downward Sweep of parallel Prefix Least branch send to most sig child when receive from above –send to least significant –combine with stored and send result to most sign


Download ppt "General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley."

Similar presentations


Ads by Google