Presentation is loading. Please wait.

Presentation is loading. Please wait.

Computer architecture II

Similar presentations


Presentation on theme: "Computer architecture II"— Presentation transcript:

1 Computer architecture II
Introduction Computer Architecture II

2 Recap Programming for performance
Amdahl’s law Partitioning for performance Addressing decomposition and assignment Orchestration for performance Case studies Ocean Barnes-Hut Raytrace Computer Architecture II

3 Plan for today Programming for performance
Case studies Ocean Barnes-Hut Raytrace Scalable interconnection networks Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

4 Case study 1: Simulating Ocean Currents
(a) Cross sections (b) Spatial discretization of a cross section Ocean: Modeled as several two-dimensional grids :Static and regular Steps: set up the movement equation, solve them, update the grid values Multi grid method sweep n x n first (finest level) Go coarser (n/2 x n/2, n/4 x n/4) or finer depending on the error in the current sweep Computer Architecture II

5 Computer Architecture II
Case Study 1: Ocean Computer Architecture II

6 Computer Architecture II
Partitioning Function parallelism: identify independent computation => reduce synchronization Data parallelism: static partitioning within a grid similar issues to those for kernel solver Block versus strip inherent communication: block better Artifactual communication: line strip better (spatial locality) Load imbalance due to grid border elements In the block case internal blocks do not have border elements Computer Architecture II

7 Computer Architecture II
Orchestration Spatial Locality similar to equation solver 4D versus 2D arrays Block partitioning: poor spatial locality across rows, good across columns except lots of grids, so cache conflicts across grids Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on column-oriented boundary Computer Architecture II

8 Computer Architecture II
Orchestration Temporal locality: Complex working set hierarchy (six working sets, three important) A few points for near-neighbor reuse three sub-rows partition of one grid Synchronization Barriers between phases and solver sweeps Locks for global variables Lots of work between synchronization events Computer Architecture II

9 Execution Time Breakdown
4D arrays 2D arrays 1026 x 1026 grid size with block partitioning on 32-processor Origin2000 4MB 2nd level cache 4-d grids much better than 2-d Smaller access time (better locality) Less time waiting at barriers Computer Architecture II

10 Case Study 2: Barnes-Hut
Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Barnes Hut: Hierarchical Method taking advantage of force law G (m1m2/ r2) Computer Architecture II

11 Case Study 2: Barnes-Hut
Space cell containing one body Sequential algorithm For each body (n times) traverse the tree top-down and compute the total force acting on that body. If the cell is far enough, compute the force Expected tree height: log n Computer Architecture II

12 Application Structure
Main data structures: array of bodies, of cells, and of pointers to them Each body/cell has several fields: mass, position, pointers to others Contiguous chunks of pointers to bodies and cells are assigned to processes Computer Architecture II

13 Computer Architecture II
Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: Non-uniform body distribution => non-uniform work and communication Cannot assign by inspection Distribution changes dynamically across time-steps Cannot assign statically Information needs fall off with distance from body Partitions should be spatially contiguous for locality Different phases have different work distributions across bodies No single assignment ideal for all Communication: fine-grained and irregular Computer Architecture II

14 Computer Architecture II
Load Balancing Particles are not equal The number and mass of bodies acting upon differs Solution: Assign costs to particles based on the work Work unknown before hand and changes with time-steps But: System evolves slowly Solution: Use work per particle in the current phase as a estimate for the cost for next time-step Computer Architecture II

15 Load balancing: Orthogonal Recursive Bisection (ORB)
Recursively bisect space into subspaces with equal work Work is associated with bodies, as computed in the previous phase Continue until one partition per processor costly Computer Architecture II

16 Another Approach: Cost-zones
Insight: Quad-tree already contains an encoding of spatial locality. Cost-zones is low-overhead and very easy to program Store cost in each node of the tree Compute total work of the system (eg: 1000ops) and divide to the number of processors (eg 1000 ops / 10 proc = 100 ops/proc) Each processor traverses the tree and picks its range (0-100, …) Computer Architecture II

17 Orchestration and Mapping
Spatial locality: data distribution is much more difficult than in Ocean Redistribution across time-steps Logical granularity (body/cell) much smaller than page Partitions contiguous in physical space does not imply contiguous in array (where the body are stored) Temporal locality and working sets First working set (body to body interaction) Second working set (compute forces on a body: good temporal locality because system evolves slowly) Synchronization: Barriers between phases No synch within force calculation: data written different from data read Locks in tree-building, pt. to pt. event synch in center of mass phase Mapping: ORB maps well to hypercube, costzones to linear array Computer Architecture II

18 Execution Time Breakdown
512K bodies on 32-processor Origin2000 Static assignment of bodies versus costzones Good load balance Slow access for static due to lack of locality Computer Architecture II

19 Computer Architecture II
Raytrace Map a 3D scene on a 2D display pixel by pixel Rays shot through pixels in image are called primary rays Reflect and refract when they hit objects and compute color and opacity Recursive process generates ray tree per primary ray Hierarchical spatial data structure keeps track of primitives in scene (similar to the tree of Barnes-Hut) Nodes are space cells, leaves have linked list of jobs Tradeoffs between execution time and image quality Computer Architecture II

20 Partitioning Scene-oriented approach
Partition scene cells, process rays while they are in an assigned cell Ray-oriented approach Partition primary rays (pixels), access scene data as needed Simpler; used here Static assignment: bad load balance, unpredictability of ray bounce Dynamic assignment: use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing A block, the unit of assignment Insert all tiles in a queue A tile, the unit of decomposition and stealing Steal one tile at a time Computer Architecture II

21 Orchestration and Mapping
Spatial locality Proper data distribution for ray-oriented approach very difficult Dynamically changing, unpredictable access, fine-grained access Distribute the memory pages round-robin to avoid contention Poor spatial locality Temporal locality Working sets large and ill defined due to unpredictability Replication would do well (but capacity limited) Synchronization: One barrier at end, locks on task queues Mapping: natural to 2-d mesh for image, but likely not important Computer Architecture II

22 Execution Time Breakdown
Balls arranged in bunch Task stealing clearly very important for load balance Computer Architecture II

23 Scalable Interconnection Networks
Computer Architecture II

24 Computer Architecture II
Outline Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

25 Computer Architecture II
Formalism Graph G=(V,E) V : switches and nodes E: communication channels (edges) e ÍV ´ V Route: (v0, ..., vk) path of length k between nodes 0 und k, where (vi,vi+1)E Routing distance Diameter: the maximal route length between two nodes Average distance Degree: number of input (output) channels of a node Bisection width: minimal number of parallel connections that saturates the network Computer Architecture II

26 What characterizes a network?
Bandwidth (offered bandwidth) b = wf where width w (in bytes) and signaling rate f = 1/t (in Hz) Latency Time a message travels between two nodes Throughput (delivered bandwidth) How much from the offered bandwidth is effectively used Computer Architecture II

27 What characterizes a network?
Topology physical interconnection structure of the network graph Routing Algorithm restricts the set of paths that messages may follow many algorithms with different properties Switching Strategy how data in a message traverses a route circuit switching vs. packet switching Flow Control Mechanism when a message or portions of it traverse a route what happens when traffic is encountered? Computer Architecture II

28 Computer Architecture II
Goals Latency as small as possible High Throughput As many concurrent transfers as possible Bisection width gives the potential number of parallel connection Cost as low as possible Computer Architecture II

29 Computer Architecture II
Bus (e.g. Ethernet) 1 2 3 4 5 Degree = 1 diameter = 1 No routing necessary bisection width = 1 CSMA/CD-protocol limited bus length Simplest and cheapest dynamic network Grad 1: Jeder Knoten hat nur eine ein-/ausgehende Leitung diameter: Es gibt eine direkte Verbindung von jedem Knoten zu jedem anderen Knoten, auf der kein weiterer Knoten als Zwischenstation eingesetzt ist. conectivity 1: Man kann einen Knoten abtrennen (z.B. vom Ethernet nehmen). Dann ist das Netz in zwei Teilnetze (eines davon ein-elementig) zerlegt. Keine Ausfallsicherheit. Wenn Netz belegt ist, dann gibt es keine "Umfahrung". bisection width: Eine einzige Nachricht von der einen Knotenhälfte zur anderen reicht aus, um das Netz zu sättigen. CSMD/CD-Protokoll Computer Architecture II

30 Computer Architecture II
Complete graph 2 1 degree= n-1 too expensive for big nets diameter = 1 bisection width=ën/2û én/2ù 3 5 4 Static Network Connection between each Pair of nodes When cutting the network into two halves, each node has connection to n/2 other nodes. There are n/2 such Nodes. Keine Vermittlung/Adressierung nötig. Jeder Knoten ist mit jedem anderen Knoten direkt verbunden. Hoher Grad. diameter 1: Wegen der direkten Verbindung ist man in einem Schritt beim Ziel der Botschaft. Keine Vermittlungsarbeit notwendig. conectivity: Um einen Knoten vom Netz abzutrennen müssen n-1 Verbindungen, die zu den n-1 anderen Knoten bestehen durchtrennt werden. Computer Architecture II

31 Ring degree= 2 diameter = n/2 bisection width = 2
1 degree= 2 diameter = n/2 slow for big networks bisection width = 2 3 5 4 Static network A node i linked with nodes i+1 and i-1 modulo n. Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 Computer Architecture II

32 Computer Architecture II
d-dimensional grid Cray T3D und T3E. 1,1 1,2 1,3 For d dimensions degree= d diameter = d ( dn –1) bisection width = ( dn) d–1 2,1 2,2 2,3 3,1 3,2 3,3 Static network Grad: Betrachte Knoten (2,2). Dieser hat in jeder Dimension 2 Nachbarn bisection width: Eindimensionales Gitter = Kette. Durch das Auftrennen einer Kante kann das Netz in zwei gleichgroße Hälften zerlegt werden. Zweidimensionales Gitter = Stellen Sie sich 16 Knoten in einem 4x4-Gitter vor. Man kann das Gitter in zwei 2x4-Teile zerlegen, indem man 4 Kanten zerschneidet. Wurzel(16)=4. Computer Architecture II

33 Computer Architecture II
Crossbar 1 fast and expensive (n2 switches) Most: Processor x memory degree= 1 diameter = 2 bisection width = n/2 Ex: 4x4, 8x8, 16x16 2 3 1 2 3  switch conectivity: 1 Knoten abtrennen (Kreis und gestrichelter Kreis sind derselbe Knoten.) bisection width: Es ist möglich, dass die Hälfte der Prozessoren gleichzeitig mit der anderen Hälfte kommuniziert. n/2 Botschaften können gleichzeitig im Netz unterwegs sein. Die bisection width ist daher optimial (n/2). Dynamic network Computer Architecture II

34 Computer Architecture II
Hypercube (1) Hamming-Distance = number of bits in which the binary representation of two numbers differ Two nodes are connected if the Hamming distance is 1 Routing from x to y by decreasing the Hemming distance 0010  0011  0000  0001  0100  0101  0111  0110  0000  0001  0011  0010  Static network Computer Architecture II

35 Computer Architecture II
Hypercube (2) k dimensions, n= 2k nodes 0000  0001  0011  0010  degree= k diameter = k bisection width = n/2 Two (k-1)-hypercubes are linked through n/2 edges to form a k-hypercube 0100  0101  0111  0110  0000  0001  0011  0010  Intel iPSC/860, SGI Origin 2000 Computer Architecture II

36 Computer Architecture II
Omega-Network (1) Building block: 2x2 Shuffle Perfect Shuffle Target = cyclic left shift 000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111 Computer Architecture II

37 Computer Architecture II
Omega-Network (2) Log2n levels of of 2x2 Shuffle building block dynamic network 000 001 010 011 100 101 110 111 Level i looks at bit i If 0 goes up If 1 goes down See example for 100 sending to 110 Computer Architecture II

38 Computer Architecture II
Omega-Network (3) n nodes, (n/2) log2n building blocks degree= 2 for nodes, 4 for building blocks diameter = log2n bisection width = n/2 for a random permutation, n/2 messages are expected to cross the network in parallel Extremes If all the nodes want to send to 0, only one message in parallel If each sends a message to himself n messages in parallel Computer Architecture II

39 Fat Tree /Clos-Network (1)
Nodes = leaves of a tree Tree has the diameter 2log2n „von farthest left over the root to farthest right" Simple tree has bisection width = 1 bottleneck Fat Tree: Edges at level i have double capacity as edges at level i-1 At level i expensive switches with 2i inputs and 2i outputs Known as Clos-networks Computer Architecture II

40 Fat Tree/Clos-Network (2)
Routing: Direct way over the lowest common parent When alternative exists, choose randomly. Tolerance to node failure diameter 2log2n, bisection width: n Bei 16 Knoten hat eder Knoten im Inneren des Baums hat 4 Nachfolger. Wenn man die Knoten halbieren will, dann muss man bei jedem Wurzelknoten die hälft der Nachfolgerkanten, die zur anderen Knotenhälfte gehen, auftrennen. Bei 4 Wurzelknoten macht das insgesamt 8 Kanten. CM-5 Computer Architecture II

41 Computer Architecture II
Switching How a message traverses the network from one node to the other Circuit switching One path from source to destination established All packets will take that way Like the telephone system Packet switching A message broken into a sequence of packets which can be sent across different routes Better utilization of network resources Computer Architecture II

42 Packet Routing There are two basic approaches to routing packets, based on what a switch does when the packet begins arriving Store-and-forward Cut-through Virtual cut-through Wormhole

43 Packet routing: Store-and-Forward
A packet is completely stored at a switch before being forwarded The packet is always on at least two nodes Pb: Switches need lots of memory for storing the incoming packets Switching takes place step-by-step, the blocking danger is small Computer Architecture II

44 Packet routing: Cut through
A packet may come partially into the switch and leave its tail on other nodes It may reside on more than 2 switches The decision to forward the packet may be taken right away What to do with the rest of the packet if the head blocks? Cut-through: gather tail where the head is It degenerates into store-and-forward for high contention Wormhole: If the head blocks the whole “worm” blocks Computer Architecture II

45 Store&Forward vs Cut-Through Routing
h(n/b + D) vs n/b + h D h: number of hops n: message size b: bandwidth D: routing delay per hop Computer Architecture II

46 Routing Algorithm How do I know where a packet should go?
Topology does NOT determine routing Routing algorithms Arithmetic Source-based Table lookup Adaptive—route based on network state (e.g., contention)

47 (1) Arithmetic Routing For regular topology, use simple arithmetic to determine route E.g., 3D Torus xy-routing Packet header contains signed offset to destination (per dimension) At each hop, switch +/- to reduce offset in a dimension When x == 0 and y == 0, then at correct processor Drawbacks Requires ALU in switch Must re-compute CRC at each hop (1,1,1) (0,1,1) (0,0,1) (1,0,1) (0,1,0) (1,1,0) (0,0,0) (1,0,0)

48 (2) Source Based & (3) Table Lookup Routing
Source specifies output port for each switch in route Very simple switches No control state Strip output port off header Myrinet uses this Can’t be made adaptive Table Lookup Very small header: contains a field that is a index into table for output port Big tables, must be kept up-to-date

49 Deterministic vs. Adaptive Routing
Deterministic—follows a pre-specified route K-ary d-cube: dimension-order routing (x1, y1)  (x2, y2) First Dx = x2 - x1, Then Dy = y2 - y1, Tree: common ancestor Adaptive—route determined by contention for output port 001 000 101 100 010 110 111 011

50 Computer Architecture II
(4) Adaptive Routing Essential for fault tolerance At least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Computer Architecture II

51 Computer Architecture II
Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel machines networks block in place Traffic may back up toward the source tree saturation: backing up all the way long toward destination Discard packets and inform the source about that Computer Architecture II

52 Communication Perf: Latency
Time(n)s-d = overhead + routing delay + channel occupancy + contention delay Overhead: time necessary for initiating the sending and reception of a message occupancy = (n + ne) / b n: data (payload) size ne: packet envelope size Routing delay Contention Computer Architecture II

53 Computer Architecture II
Bandwidth What affects local bandwidth? packet density b x n/(n + ne) routing delay b x n / (n + ne + wD) D: nr. Of cycles waiting for a routing decision w: width of the channel contention endpoints within the network Aggregate bandwidth bisection bandwidth sum of bandwidth of smallest set of links that partition the network Bad if not uniform distribution of communication total bandwidth of all the channels Computer Architecture II

54 Computer Architecture II
Interconnects Name Latency Bandwidth Topology Comments Gigabit us 1 Gb/s Star or Fat Tree Cheap for small systems Infiniband 4x 3.5-7us 10-20 Gb/s Fat Tree -Not as mature as Myrinet -Smaller switches(128 port) -Cost ~$500/card+port Myrinet 2-8 Gb/s Clos -Mature, de facto standard port switches -cost ~$500/card + port NUMAlink4 1-2us 8-16 Gb/s -SGI Proprietary -Special uproc for I/O -shmem Quadrics 9 Gb/s -Expensive -Used in turn-key machines SCI/Dolphin 4 Gb/s 2D/3D Torus -Cabling nightmare! -Costs more than Myrinet Computer Architecture II

55 Computer Architecture II
Myrinet Offered bandwidth 2+2 Gbit/s, full duplex 5-7 s latency Arbitrary Topology, Fat Tree/Clos-Network preferable Routing: Wormhole, Source Routing Cable (8+1 Bit parallel) or fiber optics Flow-control on each link Adaptor programmable RISC-Processor 333 MHz, PCI/PCI-X connection, upto 133 MHz, 64-Bit, 8 Gb/s over PCI-X Bus uni-directional 2 MB Computer Architecture II

56 Myrinet Fat Tree (128 node)
16x16 crossbar Hier sind nur 8 Linien gezeigt. Jede ist doppelt ausgelegt, wegen Duplex. Computer Architecture II

57 Myrinet PCI-Bus-Adaptor
cable connect Netw. interface Net- DMA 2 MB SRAM Host- DMA PCI Bridge LanAI CPU 2MB SRAM PCI (-X)-bridge, 64 Bit, MHz LanAI RISC, 333 MHz 2 LWL-connectors, both duplex Computer Architecture II

58 Computer Architecture II
Myrinet 16x16 crossbar 8 computers connected in the front side (2 chanels) On the backside 8 outputs (2 chanels) toward next level of Clos network 32x32, two Computer Architecture II

59 Computer Architecture II
128-nodes Clos Building block from earlier Computer Architecture II

60 Myrinet 256+256-Clos-Network
Routing network with bisection width 256 Front side 256 computer connection Back side 256 connection to next level routing units Computer Architecture II

61 Clos-Network with full bisection width: 64 nodes and 32 nodes
Computer Architecture II


Download ppt "Computer architecture II"

Similar presentations


Ads by Google