Computer architecture II

Computer architecture II
Introduction Computer Architecture II

Recap Programming for performance
Amdahl’s law Partitioning for performance Addressing decomposition and assignment Orchestration for performance Case studies Ocean Barnes-Hut Raytrace Computer Architecture II

Plan for today Programming for performance
Case studies Ocean Barnes-Hut Raytrace Scalable interconnection networks Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

Case study 1: Simulating Ocean Currents
(a) Cross sections (b) Spatial discretization of a cross section Ocean: Modeled as several two-dimensional grids :Static and regular Steps: set up the movement equation, solve them, update the grid values Multi grid method sweep n x n first (finest level) Go coarser (n/2 x n/2, n/4 x n/4) or finer depending on the error in the current sweep Computer Architecture II

Computer Architecture II
Case Study 1: Ocean Computer Architecture II

Partitioning Function parallelism: identify independent computation => reduce synchronization Data parallelism: static partitioning within a grid similar issues to those for kernel solver Block versus strip inherent communication: block better Artifactual communication: line strip better (spatial locality) Load imbalance due to grid border elements In the block case internal blocks do not have border elements Computer Architecture II

Orchestration Spatial Locality similar to equation solver 4D versus 2D arrays Block partitioning: poor spatial locality across rows, good across columns except lots of grids, so cache conflicts across grids Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on column-oriented boundary Computer Architecture II

Orchestration Temporal locality: Complex working set hierarchy (six working sets, three important) A few points for near-neighbor reuse three sub-rows partition of one grid Synchronization Barriers between phases and solver sweeps Locks for global variables Lots of work between synchronization events Computer Architecture II

Execution Time Breakdown
4D arrays 2D arrays 1026 x 1026 grid size with block partitioning on 32-processor Origin2000 4MB 2nd level cache 4-d grids much better than 2-d Smaller access time (better locality) Less time waiting at barriers Computer Architecture II

Case Study 2: Barnes-Hut
Simulate the interactions of many stars evolving over time Computing forces is expensive O(n2) brute force approach Barnes Hut: Hierarchical Method taking advantage of force law G (m1m2/ r2) Computer Architecture II

Case Study 2: Barnes-Hut
Space cell containing one body Sequential algorithm For each body (n times) traverse the tree top-down and compute the total force acting on that body. If the cell is far enough, compute the force Expected tree height: log n Computer Architecture II

Application Structure
Main data structures: array of bodies, of cells, and of pointers to them Each body/cell has several fields: mass, position, pointers to others Contiguous chunks of pointers to bodies and cells are assigned to processes Computer Architecture II

Partitioning Decomposition: bodies in most phases, cells in computing moments Challenges for assignment: Non-uniform body distribution => non-uniform work and communication Cannot assign by inspection Distribution changes dynamically across time-steps Cannot assign statically Information needs fall off with distance from body Partitions should be spatially contiguous for locality Different phases have different work distributions across bodies No single assignment ideal for all Communication: fine-grained and irregular Computer Architecture II

Load Balancing Particles are not equal The number and mass of bodies acting upon differs Solution: Assign costs to particles based on the work Work unknown before hand and changes with time-steps But: System evolves slowly Solution: Use work per particle in the current phase as a estimate for the cost for next time-step Computer Architecture II

Load balancing: Orthogonal Recursive Bisection (ORB)
Recursively bisect space into subspaces with equal work Work is associated with bodies, as computed in the previous phase Continue until one partition per processor costly Computer Architecture II

Another Approach: Cost-zones
Insight: Quad-tree already contains an encoding of spatial locality. Cost-zones is low-overhead and very easy to program Store cost in each node of the tree Compute total work of the system (eg: 1000ops) and divide to the number of processors (eg 1000 ops / 10 proc = 100 ops/proc) Each processor traverses the tree and picks its range (0-100, …) Computer Architecture II

Orchestration and Mapping
Spatial locality: data distribution is much more difficult than in Ocean Redistribution across time-steps Logical granularity (body/cell) much smaller than page Partitions contiguous in physical space does not imply contiguous in array (where the body are stored) Temporal locality and working sets First working set (body to body interaction) Second working set (compute forces on a body: good temporal locality because system evolves slowly) Synchronization: Barriers between phases No synch within force calculation: data written different from data read Locks in tree-building, pt. to pt. event synch in center of mass phase Mapping: ORB maps well to hypercube, costzones to linear array Computer Architecture II

512K bodies on 32-processor Origin2000 Static assignment of bodies versus costzones Good load balance Slow access for static due to lack of locality Computer Architecture II

Raytrace Map a 3D scene on a 2D display pixel by pixel Rays shot through pixels in image are called primary rays Reflect and refract when they hit objects and compute color and opacity Recursive process generates ray tree per primary ray Hierarchical spatial data structure keeps track of primitives in scene (similar to the tree of Barnes-Hut) Nodes are space cells, leaves have linked list of jobs Tradeoffs between execution time and image quality Computer Architecture II

Partitioning Scene-oriented approach
Partition scene cells, process rays while they are in an assigned cell Ray-oriented approach Partition primary rays (pixels), access scene data as needed Simpler; used here Static assignment: bad load balance, unpredictability of ray bounce Dynamic assignment: use contiguous blocks to exploit spatial coherence among neighboring rays, plus tiles for task stealing A block, the unit of assignment Insert all tiles in a queue A tile, the unit of decomposition and stealing Steal one tile at a time Computer Architecture II

Orchestration and Mapping
Spatial locality Proper data distribution for ray-oriented approach very difficult Dynamically changing, unpredictable access, fine-grained access Distribute the memory pages round-robin to avoid contention Poor spatial locality Temporal locality Working sets large and ill defined due to unpredictability Replication would do well (but capacity limited) Synchronization: One barrier at end, locks on task queues Mapping: natural to 2-d mesh for image, but likely not important Computer Architecture II

Balls arranged in bunch Task stealing clearly very important for load balance Computer Architecture II

Scalable Interconnection Networks
Computer Architecture II

Outline Basic concepts, definitions Topologies Switching Routing Performance Computer Architecture II

Formalism Graph G=(V,E) V : switches and nodes E: communication channels (edges) e ÍV ´ V Route: (v0, ..., vk) path of length k between nodes 0 und k, where (vi,vi+1)E Routing distance Diameter: the maximal route length between two nodes Average distance Degree: number of input (output) channels of a node Bisection width: minimal number of parallel connections that saturates the network Computer Architecture II

What characterizes a network?
Bandwidth (offered bandwidth) b = wf where width w (in bytes) and signaling rate f = 1/t (in Hz) Latency Time a message travels between two nodes Throughput (delivered bandwidth) How much from the offered bandwidth is effectively used Computer Architecture II

What characterizes a network?
Topology physical interconnection structure of the network graph Routing Algorithm restricts the set of paths that messages may follow many algorithms with different properties Switching Strategy how data in a message traverses a route circuit switching vs. packet switching Flow Control Mechanism when a message or portions of it traverse a route what happens when traffic is encountered? Computer Architecture II

Goals Latency as small as possible High Throughput As many concurrent transfers as possible Bisection width gives the potential number of parallel connection Cost as low as possible Computer Architecture II

Bus (e.g. Ethernet) 1 2 3 4 5 Degree = 1 diameter = 1 No routing necessary bisection width = 1 CSMA/CD-protocol limited bus length Simplest and cheapest dynamic network Grad 1: Jeder Knoten hat nur eine ein-/ausgehende Leitung diameter: Es gibt eine direkte Verbindung von jedem Knoten zu jedem anderen Knoten, auf der kein weiterer Knoten als Zwischenstation eingesetzt ist. conectivity 1: Man kann einen Knoten abtrennen (z.B. vom Ethernet nehmen). Dann ist das Netz in zwei Teilnetze (eines davon ein-elementig) zerlegt. Keine Ausfallsicherheit. Wenn Netz belegt ist, dann gibt es keine "Umfahrung". bisection width: Eine einzige Nachricht von der einen Knotenhälfte zur anderen reicht aus, um das Netz zu sättigen. CSMD/CD-Protokoll Computer Architecture II

Complete graph 2 1 degree= n-1 too expensive for big nets diameter = 1 bisection width=ën/2û én/2ù 3 5 4 Static Network Connection between each Pair of nodes When cutting the network into two halves, each node has connection to n/2 other nodes. There are n/2 such Nodes. Keine Vermittlung/Adressierung nötig. Jeder Knoten ist mit jedem anderen Knoten direkt verbunden. Hoher Grad. diameter 1: Wegen der direkten Verbindung ist man in einem Schritt beim Ziel der Botschaft. Keine Vermittlungsarbeit notwendig. conectivity: Um einen Knoten vom Netz abzutrennen müssen n-1 Verbindungen, die zu den n-1 anderen Knoten bestehen durchtrennt werden. Computer Architecture II

Ring degree= 2 diameter = n/2 bisection width = 2
1 degree= 2 diameter = n/2 slow for big networks bisection width = 2 3 5 4 Static network A node i linked with nodes i+1 and i-1 modulo n. Examples: FDDI, SCI, FiberChannel Arbitrated Loop, KSR1 Computer Architecture II

d-dimensional grid Cray T3D und T3E. 1,1 1,2 1,3 For d dimensions degree= d diameter = d ( dn –1) bisection width = ( dn) d–1 2,1 2,2 2,3 3,1 3,2 3,3 Static network Grad: Betrachte Knoten (2,2). Dieser hat in jeder Dimension 2 Nachbarn bisection width: Eindimensionales Gitter = Kette. Durch das Auftrennen einer Kante kann das Netz in zwei gleichgroße Hälften zerlegt werden. Zweidimensionales Gitter = Stellen Sie sich 16 Knoten in einem 4x4-Gitter vor. Man kann das Gitter in zwei 2x4-Teile zerlegen, indem man 4 Kanten zerschneidet. Wurzel(16)=4. Computer Architecture II

Crossbar 1    fast and expensive (n2 switches) Most: Processor x memory degree= 1 diameter = 2 bisection width = n/2 Ex: 4x4, 8x8, 16x16 2    3    1 2 3  switch conectivity: 1 Knoten abtrennen (Kreis und gestrichelter Kreis sind derselbe Knoten.) bisection width: Es ist möglich, dass die Hälfte der Prozessoren gleichzeitig mit der anderen Hälfte kommuniziert. n/2 Botschaften können gleichzeitig im Netz unterwegs sein. Die bisection width ist daher optimial (n/2). Dynamic network Computer Architecture II

Hypercube (1) Hamming-Distance = number of bits in which the binary representation of two numbers differ Two nodes are connected if the Hamming distance is 1 Routing from x to y by decreasing the Hemming distance 0010  0011  0000  0001  0100  0101  0111  0110  0000  0001  0011  0010  Static network Computer Architecture II

Hypercube (2) k dimensions, n= 2k nodes 0000  0001  0011  0010  degree= k diameter = k bisection width = n/2 Two (k-1)-hypercubes are linked through n/2 edges to form a k-hypercube 0100  0101  0111  0110  0000  0001  0011  0010  Intel iPSC/860, SGI Origin 2000 Computer Architecture II

Omega-Network (1) Building block: 2x2 Shuffle Perfect Shuffle Target = cyclic left shift 000 000 001 001 010 010 011 011 100 100 101 101 110 110 111 111 Computer Architecture II

Omega-Network (2) Log2n levels of of 2x2 Shuffle building block dynamic network 000 001 010 011 100 101 110 111 Level i looks at bit i If 0 goes up If 1 goes down See example for 100 sending to 110 Computer Architecture II

Omega-Network (3) n nodes, (n/2) log2n building blocks degree= 2 for nodes, 4 for building blocks diameter = log2n bisection width = n/2 for a random permutation, n/2 messages are expected to cross the network in parallel Extremes If all the nodes want to send to 0, only one message in parallel If each sends a message to himself n messages in parallel Computer Architecture II

Fat Tree /Clos-Network (1)
Nodes = leaves of a tree Tree has the diameter 2log2n „von farthest left over the root to farthest right" Simple tree has bisection width = 1 bottleneck Fat Tree: Edges at level i have double capacity as edges at level i-1 At level i expensive switches with 2i inputs and 2i outputs Known as Clos-networks Computer Architecture II

Fat Tree/Clos-Network (2)
Routing: Direct way over the lowest common parent When alternative exists, choose randomly. Tolerance to node failure diameter 2log2n, bisection width: n     Bei 16 Knoten hat eder Knoten im Inneren des Baums hat 4 Nachfolger. Wenn man die Knoten halbieren will, dann muss man bei jedem Wurzelknoten die hälft der Nachfolgerkanten, die zur anderen Knotenhälfte gehen, auftrennen. Bei 4 Wurzelknoten macht das insgesamt 8 Kanten.         CM-5                 Computer Architecture II

Switching How a message traverses the network from one node to the other Circuit switching One path from source to destination established All packets will take that way Like the telephone system Packet switching A message broken into a sequence of packets which can be sent across different routes Better utilization of network resources Computer Architecture II

Packet Routing There are two basic approaches to routing packets, based on what a switch does when the packet begins arriving Store-and-forward Cut-through Virtual cut-through Wormhole

Packet routing: Store-and-Forward
A packet is completely stored at a switch before being forwarded The packet is always on at least two nodes Pb: Switches need lots of memory for storing the incoming packets Switching takes place step-by-step, the blocking danger is small Computer Architecture II

Packet routing: Cut through
A packet may come partially into the switch and leave its tail on other nodes It may reside on more than 2 switches The decision to forward the packet may be taken right away What to do with the rest of the packet if the head blocks? Cut-through: gather tail where the head is It degenerates into store-and-forward for high contention Wormhole: If the head blocks the whole “worm” blocks Computer Architecture II

Store&Forward vs Cut-Through Routing
h(n/b + D) vs n/b + h D h: number of hops n: message size b: bandwidth D: routing delay per hop Computer Architecture II

Routing Algorithm How do I know where a packet should go?
Topology does NOT determine routing Routing algorithms Arithmetic Source-based Table lookup Adaptive—route based on network state (e.g., contention)

(1) Arithmetic Routing For regular topology, use simple arithmetic to determine route E.g., 3D Torus xy-routing Packet header contains signed offset to destination (per dimension) At each hop, switch +/- to reduce offset in a dimension When x == 0 and y == 0, then at correct processor Drawbacks Requires ALU in switch Must re-compute CRC at each hop (1,1,1) (0,1,1) (0,0,1) (1,0,1) (0,1,0) (1,1,0) (0,0,0) (1,0,0)

(2) Source Based & (3) Table Lookup Routing
Source specifies output port for each switch in route Very simple switches No control state Strip output port off header Myrinet uses this Can’t be made adaptive Table Lookup Very small header: contains a field that is a index into table for output port Big tables, must be kept up-to-date

Deterministic vs. Adaptive Routing
Deterministic—follows a pre-specified route K-ary d-cube: dimension-order routing (x1, y1)  (x2, y2) First Dx = x2 - x1, Then Dy = y2 - y1, Tree: common ancestor Adaptive—route determined by contention for output port 001 000 101 100 010 110 111 011

(4) Adaptive Routing Essential for fault tolerance At least multipath Can improve utilization of the network Simple deterministic algorithms easily run into bad permutations Computer Architecture II

Contention Two packets trying to use the same link at same time limited buffering drop? Most parallel machines networks block in place Traffic may back up toward the source tree saturation: backing up all the way long toward destination Discard packets and inform the source about that Computer Architecture II

Communication Perf: Latency
Time(n)s-d = overhead + routing delay + channel occupancy + contention delay Overhead: time necessary for initiating the sending and reception of a message occupancy = (n + ne) / b n: data (payload) size ne: packet envelope size Routing delay Contention Computer Architecture II

Bandwidth What affects local bandwidth? packet density b x n/(n + ne) routing delay b x n / (n + ne + wD) D: nr. Of cycles waiting for a routing decision w: width of the channel contention endpoints within the network Aggregate bandwidth bisection bandwidth sum of bandwidth of smallest set of links that partition the network Bad if not uniform distribution of communication total bandwidth of all the channels Computer Architecture II

Interconnects Name Latency Bandwidth Topology Comments Gigabit us 1 Gb/s Star or Fat Tree Cheap for small systems Infiniband 4x 3.5-7us 10-20 Gb/s Fat Tree -Not as mature as Myrinet -Smaller switches(128 port) -Cost ~$500/card+port Myrinet 2-8 Gb/s Clos -Mature, de facto standard port switches -cost ~$500/card + port NUMAlink4 1-2us 8-16 Gb/s -SGI Proprietary -Special uproc for I/O -shmem Quadrics 9 Gb/s -Expensive -Used in turn-key machines SCI/Dolphin 4 Gb/s 2D/3D Torus -Cabling nightmare! -Costs more than Myrinet Computer Architecture II

Myrinet Offered bandwidth 2+2 Gbit/s, full duplex 5-7 s latency Arbitrary Topology, Fat Tree/Clos-Network preferable Routing: Wormhole, Source Routing Cable (8+1 Bit parallel) or fiber optics Flow-control on each link Adaptor programmable RISC-Processor 333 MHz, PCI/PCI-X connection, upto 133 MHz, 64-Bit, 8 Gb/s over PCI-X Bus uni-directional 2 MB Computer Architecture II

Myrinet Fat Tree (128 node)
16x16 crossbar Hier sind nur 8 Linien gezeigt. Jede ist doppelt ausgelegt, wegen Duplex. Computer Architecture II

Myrinet PCI-Bus-Adaptor
cable connect Netw. interface Net- DMA 2 MB SRAM Host- DMA PCI Bridge LanAI CPU 2MB SRAM PCI (-X)-bridge, 64 Bit, MHz LanAI RISC, 333 MHz 2 LWL-connectors, both duplex Computer Architecture II

Myrinet 16x16 crossbar 8 computers connected in the front side (2 chanels) On the backside 8 outputs (2 chanels) toward next level of Clos network 32x32, two Computer Architecture II

128-nodes Clos Building block from earlier Computer Architecture II

Myrinet 256+256-Clos-Network
Routing network with bisection width 256 Front side 256 computer connection Back side 256 connection to next level routing units Computer Architecture II

Clos-Network with full bisection width: 64 nodes and 32 nodes
Computer Architecture II

Computer architecture II

Similar presentations

Presentation on theme: "Computer architecture II"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computer architecture II

Similar presentations

Presentation on theme: "Computer architecture II"— Presentation transcript:

Similar presentations

About project

Feedback