Download presentation
Presentation is loading. Please wait.
1
General Purpose Node-to-Network Interface in Scalable Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley
2
3/12/99CS258 S992 *T: Network Co-Processor
3
3/12/99CS258 S993 iWARP: Systolic Computation Nodes integrate communication with computation on systolic basis Msg data direct to register Stream into memory Interface unit Host
4
3/12/99CS258 S994 Dedicated Message Processor General Purpose processor performs arbitrary output processing (at system level) General Purpose processor interprets incoming network transactions (at system level) User Processor Msg Processor share memory Msg Processor Msg Processor via system network transaction Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI UserSystem
5
3/12/99CS258 S995 Levels of Network Transaction User Processor stores cmd / msg / data into shared output queue –must still check for output queue full (or make elastic) Communication assists make transaction happen –checking, translation, scheduling, transport, interpretation Effect observed on destination address space and/or events Protocol divided between two layers Network ° ° ° dest Mem P M P NI UserSystem Mem P M P NI
6
3/12/99CS258 S996 Example: Intel Paragon
7
3/12/99CS258 S997 User Level Abstraction (Lok Liu) Any user process can post a transaction for any other in protection domain –communication layer moves OQ src –> IQ dest –may involve indirection: VAS src –> VAS dest Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS Proc OQ IQ VAS
8
3/12/99CS258 S998 Msg Processor Events Dispatcher User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event
9
3/12/99CS258 S999 Basic Implementation Costs: Scalar Cache-to-cache transfer (two 32B lines, quad word ops) –producer: read(miss,S), chk, write(S,WT), write(I,WT),write(S,WT) –consumer: read(miss,S), chk, read(H), read(miss,S), read(H),write(S,WT) to NI FIFO: read status, chk, write,... from NI FIFO: read status, chk, dispatch, read, read,... CP User OQ MP Registers Cache Net FIFO User IQ MPCP Net 21.52 4.4 µs5.4 µs 10.5 µs 7 wds 222 250ns + H*40ns
10
3/12/99CS258 S9910 Virtual DMA -> Virtual DMA Send MP segments into 8K pages and does VA –> PA Recv MP reassembles, does dispatch and VA –> PA per page CP User OQ MP Registers Cache Net FIFO User IQ MP CP Net 21.52 7 wds 222 Memory sDMA hdr rDMA MP 2048 400 MB/s 175 MB/s 400 MB/s
11
3/12/99CS258 S9911 Single Page Transfer Rate Actual Buffer Size: 2048 Effective Buffer Size: 3232
12
3/12/99CS258 S9912 Msg Processor Assessment Concurrency Intensive –Need to keep inbound flows moving while outbound flows stalled –Large transfers segmented Reduces overhead but adds latency User Output Queues Send FIFO ~Empty Rcv FIFO ~Full Send DMA Rcv DMA DMA done Compute Processor Kernel System Event User Input Queues VAS Dispatcher
13
3/12/99CS258 S9913 Case Study: Meiko CS2 Concept Circuit-switched Network Transaction –source-dest circuit held open for request response –limited cmd set executed directly on NI Dedicated communication processor for each step in flow
14
3/12/99CS258 S9914 Case Study: Meiko CS2 Organization
15
3/12/99CS258 S9915 Shared Physical Address Space NI emulates memory controller at source NI emulates processor at dest –must be deadlock free
16
3/12/99CS258 S9916 Case Study: Cray T3D Build up info in ‘shell’ Remote memory operations encoded in address
17
3/12/99CS258 S9917 Case Study: NOW General purpose processor embedded in NIC
18
3/12/99CS258 S9918 Message Time Breakdown Communication pipeline
19
3/12/99CS258 S9919 Message Time Comparison
20
3/12/99CS258 S9920 SAS Time Comparison
21
3/12/99CS258 S9921 Message-Passing Time vs Size
22
3/12/99CS258 S9922 Message-Passing Bandwidth vs Size
23
3/12/99CS258 S9923 Application Performance on LU
24
3/12/99CS258 S9924 Working Sets Change with P 8-fold reduction in miss rate from 4 to 8 proc
25
3/12/99CS258 S9925 Application Performance on BT
26
3/12/99CS258 S9926 NAS Communication Scaling Normalized Msgs per ProcAverage Message Size
27
3/12/99CS258 S9927 NAS Communication Scaling: Volume
28
3/12/99CS258 S9928 Communication Characteristics: BT
29
3/12/99CS258 S9929 Beware Average BW analysis
30
3/12/99CS258 S9930 Reflective Memory Writes to local region reflected to remote
31
3/12/99CS258 S9931 Case Study: DEC Memory Channel See also Shrimp
32
3/12/99CS258 S9932 Scalable Synchronization Operations Messages: point-to-point synchronization Build all-to-all as trees Recall: sophisticated locks reduced contention by spinning on separate locations –caching brought them local –test&test&set, ticket-lock, array lock »O(p) space Problem: with array lock location determined by arrival order => not likely to be local Solution: queue-lock –build distributed linked-list, each spins on local node
33
3/12/99CS258 S9933 Queue Locks Head holds lock;Each points to next waiter Shared pointer to tail Acquire –swap (fetch&store) tail with node address, chain in prev Release –signal next –compare&swap plus check to reset tail
34
3/12/99CS258 S9934 Parallel Prefix: Upward Sweep generalization of barrier (reduce-broadcast) compute S i = X i X i-1 +... + X 0, for i = 0, 1,... combine children, store least significant
35
3/12/99CS258 S9935 Downward Sweep of parallel Prefix Least branch send to most sig child when receive from above –send to least significant –combine with stored and send result to most sign
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.