Download presentation
Presentation is loading. Please wait.
1
Computer architecture II
Introduction Computer Architecture II
2
Recap Programming for performance
Amdahl’s law Partitioning for performance Addressing decomposition and assignment Orchestration for performance Case studies Ocean Barnes-Hut Raytrace Computer Architecture II
3
Plan for today Programming for performance
Case studies Ocean Barnes-Hut Raytrace Scalable interconnection networks Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II
4
Case study 1: Simulating Ocean Currents
(a) Cross sections (b) Spatial discretization of a cross section Ocean: Modeled as several two-dimensional grids :Static and regular Steps: set up the movement equation, solve them, update the grid values Multi grid method sweep n x n first (finest level) Go coarser (n/2 x n/2, n/4 x n/4) or finer depending on the error in the current sweep Computer Architecture II
5
Computer Architecture II
Case Study 1: Ocean Computer Architecture II
6
Computer Architecture II
Partitioning Function parallelism: identify independent computation => reduce synchronization Data parallelism: static partitioning within a grid similar issues to those for kernel solver Block versus strip inherent communication: block better Artifactual communication: line strip better (spatial locality) Load imbalance due to grid border elements In the block case internal blocks do not have border elements Computer Architecture II
7
Computer Architecture II
Orchestration Spatial Locality similar to equation solver 4D versus 2D arrays Block partitioning: poor spatial locality across rows, good across columns except lots of grids, so cache conflicts across grids Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on column-oriented boundary Computer Architecture II
8
Computer Architecture II
Orchestration Temporal locality: Complex working set hierarchy (six working sets, three important) A few points for near-neighbor reuse three sub-rows partition of one grid Synchronization Barriers between phases and solver sweeps Locks for global variables Lots of work between synchronization events Computer Architecture II
9
Execution Time Breakdown
4D arrays 2D arrays 1026 x 1026 grid size with block partitioning on 32-processor Origin2000 4MB 2nd level cache 4-d grids much better than 2-d Smaller access time (better locality) Less time waiting at barriers Computer Architecture II
10
Case Study 2: Barnes-Hut
Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Barnes Hut: Hierarchical Method taking advantage of force law G (m1m2/ r2) Computer Architecture II
11
Case Study 2: Barnes-Hut
Space cell containing one body Sequential algorithm For each body (n times) traverse the tree top-down and compute the total force acting on that body. If the cell is far enough, compute the force Expected tree height: log n Computer Architecture II
12
Application Structure
Main data structures: array of bodies, of cells, and of pointers to them Each body/cell has several fields: mass, position, pointers to others Contiguous chunks of pointers to bodies and cells are assigned to processes Computer Architecture II
13
Computer Architecture II
Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: Non-uniform body distribution => non-uniform work and communication Cannot assign by inspection Distribution changes dynamically across time-steps Cannot assign statically Information needs fall off with distance from body Partitions should be spatially contiguous for locality Different phases have different work distributions across bodies No single assignment ideal for all Communication: fine-grained and irregular Computer Architecture II
14
Computer Architecture II
Load Balancing Particles are not equal The number and mass of bodies acting upon differs Solution: Assign costs to particles based on the work Work unknown before hand and changes with time-steps But: System evolves slowly Solution: Use work per particle in the current phase as a estimate for the cost for next time-step Computer Architecture II
15
Load balancing: Orthogonal Recursive Bisection (ORB)
Recursively bisect space into subspaces with equal work Work is associated with bodies, as computed in the previous phase Continue until one partition per processor costly Computer Architecture II
16
Another Approach: Cost-zones
Insight: Quad-tree already contains an encoding of spatial locality. Cost-zones is low-overhead and very easy to program Store cost in each node of the tree Compute total work of the system (eg: 1000ops) and divide to the number of processors (eg 1000 ops / 10 proc = 100 ops/proc) Each processor traverses the tree and picks its range (0-100, …) Computer Architecture II
17
Orchestration and Mapping
Spatial locality: data distribution is much more difficult than in Ocean Redistribution across time-steps Logical granularity (body/cell) much smaller than page Partitions contiguous in physical space does not imply contiguous in array (where the body are stored) Temporal locality and working sets First working set (body to body interaction) Second working set (compute forces on a body: good temporal locality because system evolves slowly) Synchronization: Barriers between phases No synch within force calculation: data written different from data read Locks in tree-building, pt. to pt. event synch in center of mass phase Mapping: ORB maps well to hypercube, costzones to linear array Computer Architecture II
18
Execution Time Breakdown
512K bodies on 32-processor Origin2000 Static assignment of bodies versus costzones Good load balance Slow access for static due to lack of locality Computer Architecture II
19
Computer Architecture II
Raytrace Map a 3D scene on a 2D display pixel by pixel Rays shot through pixels in image are called primary rays Reflect and refract when they hit objects and compute color and opacity Recursive process generates ray tree per primary ray Hierarchical spatial data structure keeps track of primitives in scene (similar to the tree of Barnes-Hut) Nodes are space cells, leaves have linked list of jobs Tradeoffs between execution time and image quality Computer Architecture II
20
Partitioning Scene-oriented approach
Partition scene cells, process rays while they are in an assigned cell Ray-oriented approach Partition primary rays (pixels), access scene data as needed Simpler; used here Static assignment: bad load balance, unpredictability of ray bounce Dynamic assignment: use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing A block, the unit of assignment Insert all tiles in a queue A tile, the unit of decomposition and stealing Steal one tile at a time Computer Architecture II
21
Orchestration and Mapping
Spatial locality Proper data distribution for ray-oriented approach very difficult Dynamically changing, unpredictable access, fine-grained access Distribute the memory pages round-robin to avoid contention Poor spatial locality Temporal locality Working sets large and ill defined due to unpredictability Replication would do well (but capacity limited) Synchronization: One barrier at end, locks on task queues Mapping: natural to 2-d mesh for image, but likely not important Computer Architecture II
22
Execution Time Breakdown
Balls arranged in bunch Task stealing clearly very important for load balance Computer Architecture II
23
Scalable Interconnection Networks
Computer Architecture II
24
Computer Architecture II
Outline Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II
25
Computer Architecture II
Formalism Graph G=(V,E) V : switches and nodes E: communication channels (edges) e ÍV ´ V Route: (v0, ..., vk) path of length k between nodes 0 und k, where (vi,vi+1)E Routing distance Diameter: the maximal route length between two nodes Average distance Degree: number of input (output) channels of a node Bisection width: minimal number of parallel connections that saturates the network Computer Architecture II
26
What characterizes a network?
Bandwidth (offered bandwidth) b = wf where width w (in bytes) and signaling rate f = 1/t (in Hz) Latency Time a message travels between two nodes Throughput (delivered bandwidth) How much from the offered bandwidth is effectively used Computer Architecture II
27
What characterizes a network?
Topology physical interconnection structure of the network graph Routing Algorithm restricts the set of paths that messages may follow many algorithms with different properties Switching Strategy how data in a message traverses a route circuit switching vs. packet switching Flow Control Mechanism when a message or portions of it traverse a route what happens when traffic is encountered? Computer Architecture II
28
Computer Architecture II
Goals Latency as small as possible High Throughput As many concurrent transfers as possible Bisection width gives the potential number of parallel connection Cost as low as possible Computer Architecture II
29
Computer Architecture II
Bus (e.g. Ethernet) 1 2 3 4 5 Degree = 1 diameter = 1 No routing necessary bisection width = 1 CSMA/CD-protocol limited bus length Simplest and cheapest dynamic network Grad 1: Jeder Knoten hat nur eine ein-/ausgehende Leitung diameter: Es gibt eine direkte Verbindung von jedem Knoten zu jedem anderen Knoten, auf der kein weiterer Knoten als Zwischenstation eingesetzt ist. conectivity 1: Man kann einen Knoten abtrennen (z.B. vom Ethernet nehmen). Dann ist das Netz in zwei Teilnetze (eines davon ein-elementig) zerlegt. Keine Ausfallsicherheit. Wenn Netz belegt ist, dann gibt es keine "Umfahrung". bisection width: Eine einzige Nachricht von der einen Knotenhälfte zur anderen reicht aus, um das Netz zu sättigen. CSMD/CD-Protokoll Computer Architecture II
30
Computer Architecture II
Complete graph 2 1 degree= n-1 too expensive for big nets diameter = 1 bisection width=ën/2û én/2ù 3 5 4 Static Network Connection between each Pair of nodes When cutting the network into two halves, each node has connection to n/2 other nodes. There are n/2 such Nodes. Keine Vermittlung/Adressierung nötig. Jeder Knoten ist mit jedem anderen Knoten direkt verbunden. Hoher Grad. diameter 1: Wegen der direkten Verbindung ist man in einem Schritt beim Ziel der Botschaft. Keine Vermittlungsarbeit notwendig. conectivity: Um einen Knoten vom Netz abzutrennen müssen n-1 Verbindungen, die zu den n-1 anderen Knoten bestehen durchtrennt werden. Computer Architecture II
31
Ring degree= 2 diameter = n/2 bisection width = 2
1 degree= 2 diameter = n/2 slow for big networks bisection width = 2 3 5 4 Static network A node i linked with nodes i+1 and i-1 modulo n. Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 Computer Architecture II
32
Computer Architecture II
d-dimensional grid Cray T3D und T3E. 1,1 1,2 1,3 For d dimensions degree= d diameter = d ( dn –1) bisection width = ( dn) d–1 2,1 2,2 2,3 3,1 3,2 3,3 Static network Grad: Betrachte Knoten (2,2). Dieser hat in jeder Dimension 2 Nachbarn bisection width: Eindimensionales Gitter = Kette. Durch das Auftrennen einer Kante kann das Netz in zwei gleichgroße Hälften zerlegt werden. Zweidimensionales Gitter = Stellen Sie sich 16 Knoten in einem 4x4-Gitter vor. Man kann das Gitter in zwei 2x4-Teile zerlegen, indem man 4 Kanten zerschneidet. Wurzel(16)=4. Computer Architecture II
33
Computer Architecture II
Crossbar 1 fast and expensive (n2 switches) Most: Processor x memory degree= 1 diameter = 2 bisection width = n/2 Ex: 4x4, 8x8, 16x16 2 3 1 2 3 switch conectivity: 1 Knoten abtrennen (Kreis und gestrichelter Kreis sind derselbe Knoten.) bisection width: Es ist möglich, dass die Hälfte der Prozessoren gleichzeitig mit der anderen Hälfte kommuniziert. n/2 Botschaften können gleichzeitig im Netz unterwegs sein. Die bisection width ist daher optimial (n/2). Dynamic network Computer Architecture II
34
Computer Architecture II
Hypercube (1) Hamming-Distance = number of bits in which the binary representation of two numbers differ Two nodes are connected if the Hamming distance is 1 Routing from x to y by decreasing the Hemming distance 0010 0011 0000 0001 0100 0101 0111 0110 0000 0001 0011 0010 Static network Computer Architecture II
35
Computer Architecture II
Hypercube (2) k dimensions, n= 2k nodes 0000 0001 0011 0010 degree= k diameter = k bisection width = n/2 Two (k-1)-hypercubes are linked through n/2 edges to form a k-hypercube 0100 0101 0111 0110 0000 0001 0011 0010 Intel iPSC/860, SGI Origin 2000 Computer Architecture II
36
Computer Architecture II
Omega-Network (1) Building block: 2x2 Shuffle Perfect Shuffle Target = cyclic left shift 000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111 Computer Architecture II
37
Computer Architecture II
Omega-Network (2) Log2n levels of of 2x2 Shuffle building block dynamic network 000 001 010 011 100 101 110 111 Level i looks at bit i If 0 goes up If 1 goes down See example for 100 sending to 110 Computer Architecture II
38
Computer Architecture II
Omega-Network (3) n nodes, (n/2) log2n building blocks degree= 2 for nodes, 4 for building blocks diameter = log2n bisection width = n/2 for a random permutation, n/2 messages are expected to cross the network in parallel Extremes If all the nodes want to send to 0, only one message in parallel If each sends a message to himself n messages in parallel Computer Architecture II
39
Fat Tree /Clos-Network (1)
Nodes = leaves of a tree Tree has the diameter 2log2n „von farthest left over the root to farthest right" Simple tree has bisection width = 1 bottleneck Fat Tree: Edges at level i have double capacity as edges at level i-1 At level i expensive switches with 2i inputs and 2i outputs Known as Clos-networks Computer Architecture II
40
Fat Tree/Clos-Network (2)
Routing: Direct way over the lowest common parent When alternative exists, choose randomly. Tolerance to node failure diameter 2log2n, bisection width: n Bei 16 Knoten hat eder Knoten im Inneren des Baums hat 4 Nachfolger. Wenn man die Knoten halbieren will, dann muss man bei jedem Wurzelknoten die hälft der Nachfolgerkanten, die zur anderen Knotenhälfte gehen, auftrennen. Bei 4 Wurzelknoten macht das insgesamt 8 Kanten. CM-5 Computer Architecture II
41
Computer Architecture II
Switching How a message traverses the network from one node to the other Circuit switching One path from source to destination established All packets will take that way Like the telephone system Packet switching A message broken into a sequence of packets which can be sent across different routes Better utilization of network resources Computer Architecture II
42
Packet Routing There are two basic approaches to routing packets, based on what a switch does when the packet begins arriving Store-and-forward Cut-through Virtual cut-through Wormhole
43
Packet routing: Store-and-Forward
A packet is completely stored at a switch before being forwarded The packet is always on at least two nodes Pb: Switches need lots of memory for storing the incoming packets Switching takes place step-by-step, the blocking danger is small Computer Architecture II
44
Packet routing: Cut through
A packet may come partially into the switch and leave its tail on other nodes It may reside on more than 2 switches The decision to forward the packet may be taken right away What to do with the rest of the packet if the head blocks? Cut-through: gather tail where the head is It degenerates into store-and-forward for high contention Wormhole: If the head blocks the whole “worm” blocks Computer Architecture II
45
Store&Forward vs Cut-Through Routing
h(n/b + D) vs n/b + h D h: number of hops n: message size b: bandwidth D: routing delay per hop Computer Architecture II
46
Routing Algorithm How do I know where a packet should go?
Topology does NOT determine routing Routing algorithms Arithmetic Source-based Table lookup Adaptive—route based on network state (e.g., contention)
47
(1) Arithmetic Routing For regular topology, use simple arithmetic to determine route E.g., 3D Torus xy-routing Packet header contains signed offset to destination (per dimension) At each hop, switch +/- to reduce offset in a dimension When x == 0 and y == 0, then at correct processor Drawbacks Requires ALU in switch Must re-compute CRC at each hop (1,1,1) (0,1,1) (0,0,1) (1,0,1) (0,1,0) (1,1,0) (0,0,0) (1,0,0)
48
(2) Source Based & (3) Table Lookup Routing
Source specifies output port for each switch in route Very simple switches No control state Strip output port off header Myrinet uses this Can’t be made adaptive Table Lookup Very small header: contains a field that is a index into table for output port Big tables, must be kept up-to-date
49
Deterministic vs. Adaptive Routing
Deterministic—follows a pre-specified route K-ary d-cube: dimension-order routing (x1, y1) (x2, y2) First Dx = x2 - x1, Then Dy = y2 - y1, Tree: common ancestor Adaptive—route determined by contention for output port 001 000 101 100 010 110 111 011
50
Computer Architecture II
(4) Adaptive Routing Essential for fault tolerance At least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Computer Architecture II
51
Computer Architecture II
Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel machines networks block in place Traffic may back up toward the source tree saturation: backing up all the way long toward destination Discard packets and inform the source about that Computer Architecture II
52
Communication Perf: Latency
Time(n)s-d = overhead + routing delay + channel occupancy + contention delay Overhead: time necessary for initiating the sending and reception of a message occupancy = (n + ne) / b n: data (payload) size ne: packet envelope size Routing delay Contention Computer Architecture II
53
Computer Architecture II
Bandwidth What affects local bandwidth? packet density b x n/(n + ne) routing delay b x n / (n + ne + wD) D: nr. Of cycles waiting for a routing decision w: width of the channel contention endpoints within the network Aggregate bandwidth bisection bandwidth sum of bandwidth of smallest set of links that partition the network Bad if not uniform distribution of communication total bandwidth of all the channels Computer Architecture II
54
Computer Architecture II
Interconnects Name Latency Bandwidth Topology Comments Gigabit us 1 Gb/s Star or Fat Tree Cheap for small systems Infiniband 4x 3.5-7us 10-20 Gb/s Fat Tree -Not as mature as Myrinet -Smaller switches(128 port) -Cost ~$500/card+port Myrinet 2-8 Gb/s Clos -Mature, de facto standard port switches -cost ~$500/card + port NUMAlink4 1-2us 8-16 Gb/s -SGI Proprietary -Special uproc for I/O -shmem Quadrics 9 Gb/s -Expensive -Used in turn-key machines SCI/Dolphin 4 Gb/s 2D/3D Torus -Cabling nightmare! -Costs more than Myrinet Computer Architecture II
55
Computer Architecture II
Myrinet Offered bandwidth 2+2 Gbit/s, full duplex 5-7 s latency Arbitrary Topology, Fat Tree/Clos-Network preferable Routing: Wormhole, Source Routing Cable (8+1 Bit parallel) or fiber optics Flow-control on each link Adaptor programmable RISC-Processor 333 MHz, PCI/PCI-X connection, upto 133 MHz, 64-Bit, 8 Gb/s over PCI-X Bus uni-directional 2 MB Computer Architecture II
56
Myrinet Fat Tree (128 node)
16x16 crossbar Hier sind nur 8 Linien gezeigt. Jede ist doppelt ausgelegt, wegen Duplex. Computer Architecture II
57
Myrinet PCI-Bus-Adaptor
cable connect Netw. interface Net- DMA 2 MB SRAM Host- DMA PCI Bridge LanAI CPU 2MB SRAM PCI (-X)-bridge, 64 Bit, MHz LanAI RISC, 333 MHz 2 LWL-connectors, both duplex Computer Architecture II
58
Computer Architecture II
Myrinet 16x16 crossbar 8 computers connected in the front side (2 chanels) On the backside 8 outputs (2 chanels) toward next level of Clos network 32x32, two Computer Architecture II
59
Computer Architecture II
128-nodes Clos Building block from earlier Computer Architecture II
60
Myrinet 256+256-Clos-Network
Routing network with bisection width 256 Front side 256 computer connection Back side 256 connection to next level routing units Computer Architecture II
61
Clos-Network with full bisection width: 64 nodes and 32 nodes
Computer Architecture II
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.