Introduction to Parallel Computing

Slides:



Advertisements
Similar presentations
Shantanu Dutt Univ. of Illinois at Chicago
Advertisements

SE-292 High Performance Computing
Super computers Parallel Processing By: Lecturer \ Aisha Dawood.
Parallel System Performance CS 524 – High-Performance Computing.
Parallel Architectures: Topologies Heiko Schröder, 2003.
Parallel Architectures: Topologies Heiko Schröder, 2003.
1 Lecture 23: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Appendix E)
Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.
Communication operations Efficient Parallel Algorithms COMP308.
Parallel Computing Platforms
1 Lecture 24: Interconnection Networks Topics: communication latency, centralized and decentralized switches (Sections 8.1 – 8.5)
Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500 Cluster.
Parallel System Performance CS 524 – High-Performance Computing.
1 Lecture 25: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Review session,
Introduction to Parallel Processing Ch. 12, Pg
Interconnect Network Topologies
Interconnection Networks. Applications of Interconnection Nets Interconnection networks are used everywhere! ◦ Supercomputers – connecting the processors.
Interconnect Networks
Network Topologies Topology – how nodes are connected – where there is a wire between 2 nodes. Routing – the path a message takes to get from one node.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
CSE Advanced Computer Architecture Week-11 April 1, 2004 engr.smu.edu/~rewini/8383.
Dynamic Interconnect Lecture 5. COEN Multistage Network--Omega Network Motivation: simulate crossbar network but with fewer links Components: –N.
Computer Architecture Distributed Memory MIMD Architectures Ola Flygt Växjö University
CHAPTER 12 INTRODUCTION TO PARALLEL PROCESSING CS 147 Guy Wong page
Lecture 3 Innerconnection Networks for Parallel Computers
Anshul Kumar, CSE IITD CSL718 : Multiprocessors Interconnection Mechanisms Performance Models 20 th April, 2006.
Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters,
InterConnection Network Topologies to Minimize graph diameter: Low Diameter Regular graphs and Physical Wire Length Constrained networks Nilesh Choudhury.
Anshul Kumar, CSE IITD ECE729 : Advanced Computer Architecture Lecture 27, 28: Interconnection Mechanisms In Multiprocessors 29 th, 31 st March, 2010.
Super computers Parallel Processing
HYPERCUBE ALGORITHMS-1
Basic Communication Operations Carl Tropper Department of Computer Science.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
Interconnection Networks Communications Among Processors.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Network Layer COMPUTER NETWORKS Networking Standards (Network LAYER)
INTERCONNECTION NETWORK
Overview Parallel Processing Pipelining
Parallel Architecture
Distributed and Parallel Processing
Interconnect Networks
Multiprocessor Systems
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Interconnection Networks (Part 2) Dr.
Lecture 23: Interconnection Networks
Interconnection topologies
Course Outline Introduction in algorithms and applications
Introduction to Architecture for Parallel Computing
Refer example 2.4on page 64 ACA(Kai Hwang) And refer another ppt attached for static scheduling example.
CS 147 – Parallel Processing
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs (cont.) Dr. Xiao.
Multiprocessors Interconnection Networks
Switching, routing, and flow control in interconnection networks
Lecture 14: Interconnection Networks
Outline Interconnection networks Processor arrays Multiprocessors
Interconnection Network Design Lecture 14
Mesh-Connected Illiac Networks
Communication operations
Static Interconnection Networks
High Performance Computing & Bioinformatics Part 2 Dr. Imad Mahgoub
Advanced Computer and Parallel Processing
Embedded Computer Architecture 5SAI0 Interconnection Networks
Dynamic Interconnection Networks
CS 6290 Many-core & Interconnect
Birds Eye View of Interconnection Networks
Advanced Computer and Parallel Processing
Static Interconnection Networks
Multiprocessors and Multi-computers
Chapter 2 from ``Introduction to Parallel Computing'',
Presentation transcript:

Introduction to Parallel Computing

Models of parallel computers: SEQUENTIAL 1. PE M PE M 2.

Code M PE 3. 4. With Pipelining : f1 f2 f3 f4 2M1i 1M2i 1M1i 3M1i 2M2i 1M3i --------------------------------------------------------------------------- ---------------------------------------------------------------------------

TAXONOMY Control Address Space Interconnect. Net. Granularity Control: SISD – Single Instruction Stream Single Data Stream. - Single program less memory. SIMD – Single Instruction Stream Multiple Data Stream. - Dedicated hardware (expensive). MISD – Multiple Instruction Stream Single Data Stream. - Rigid. MIMD – Multiple Instruction Stream Multiple Data Stream. - program & data all PEs. - Own the self hardware (inexpensive). - flexible

SIMD MIMD global Control Local Control

Address – Space: - Two paradigms: Message passing - Pain PE + Memory. Shared Memory - Each PE accesses any location in memory. Message Passing: INN P1/M P2/M P/M

Shared Memory: P1 P2 P3 Mn Global Memory INN M1 Easier to program need specialized hardware(expensive). To minimize communication: Hiding latency technique.

P M Global Memory INN Local Memory NUMA Local m

Inter connection Networks: Static (direct) – for message passing Dynamic (indirect) – for shared memory Processor Granularity : - Coarse –grain(for alg. that require frequent communication.) - Medium grain( for algorithms in which: time required basic communication time required basic comp. =𝑠𝑚𝑎𝑙𝑙 - Fine grain- do not require frequent communication. PRAM- model- (idealized parallel computer.) P- processor (shave common clock- each works on its own instruction) M- global memory uniformly accessible to all PEs. Synchronous shared- memory computers MIMD. Interaction between PE occurs at no cost.

How is memory accessed by R & W? EREW – weakest, minimum concurrency. CREW – write access serialized, read is concurrent. ERCW – write access is concurrent, multiple read access is serialized. CRCW – short powerful – can be simulated on a EREW model. CR – does not need modification in program. CW – access to memory requires arbitration. Protocols: Common – iff all values are identical. Arbitrary – an arbitrary PE proceeds other PEs fail. Priority – PE with highest priority written. Sum – sum of all quantities is written.

Dynamic Inter Connection Networks : i.e. EREW PRAM: – P- Processor m- words in global memory switching elements (Determine the memory word accessed.) each P can access any of the memory words :-> # member of switching elements Ѳ(mp). (UMA) => expensive => performance Solution: Reduce m try using memory banks (m words organized in b banks) Each P switches between b banks Total # of switching elements Ѳ(bp) Less expensive Note: This is a weak approximate. Of EREW because no 2 PE can access the same memory bank at the same time.

Crossbars Switching Nets Mb-1 M0 M1 M2 P0 P1 Pp-1

Bus-Based Networks M, b, p For m>= p, each PE accesses one memory bank If b = m Crossbar simulates EREW PRAM. -> Total # of switching elements: Ѳ(pb) -> If p ↑↑ and p>b => total # of switching elements shows as  Ω(p2) (and more p are unable to access any memory bank) => Not scalable. Bus-Based Networks Global M 𝑃 1 𝑃 2 𝑃 3 𝑃 𝑝−1 𝑃 𝑝 BUS UMA

- Each PE accesses data from M over the same but (UMA) As P↑↑ => each PE spends an increasing a moment of time waiting for memory access. While the bus is used by other PEs. Not scalable Solution- provide local-cache -> reduces the total # of access to global memory - This implies replicating data => Cache coherency problems Global M 𝑃 1 𝑃 2 𝑃 𝑝 Cache BUS

Multistage Interconnection Nets. High # of switching elements performance costs Crossbar (+) (−) Bus (−) (+) Bottle neck Multi stage – provides best trade off. Cost better than CROSS BAR Perf. Better than BUS cost Cross bar Perf. Cross bar Multistage Multistage BUS BUS P P

Multistage Interconnect Net. 6mega Network: P = # PE B = # memory banks log P = # stages P = b

Model of Parallel Computers A link between input i and output j if: j = 2𝑖 0≤𝑖 ≤ 𝑃 2 −1 2𝑖+1−𝑃 𝑃 2 ≤𝑖 ≤𝑃−1 Model of Parallel Computers

- Switch configuration in one Stage. Pass- through Cross - Over Given by the MSB corresponding to that stage.

Static Interconnection Nets. link is When 2 PE access a link = 𝑢𝑠𝑒𝑑 𝑏𝑦 2 communication paths ⇒Blocking networks Static Interconnection Nets. - Used in message - passing Completely - Connected Net Similar to cross-bar but can support multiple channels. Every PE connected to every other PE. Communication ≡1 step

Star-Connected Net. Linear Array and Ring Central PE(bottle neck) Similar to but Communication 2-steps. Linear Array and Ring Linear Array Ring Mesh Network(2d, 3d ) Linear array with (2d) dimensions or (3d) Each PE connected to 4 PEs or 6 PEs. If periphery PEs are connected => Wraparound mesh ≡Trees

Tree Net Only 1 path between any pair of PEs. Linear arrays and star-connected nets. are special cases of tree nets Static – each node Corresponds to a PE Dynamic – only leaf nodes are Pes - Intermediate nodes are Switching elements S D

Fat Tree : increasing the number of communication links for PEs closer to the root => In this way bottlenecks at higher levels in the tree are alleviated. HyperCube Net Multidimensional mesh with 2 PEs in each dimension A d-dimentional hypercube has :𝑃= 2 𝑑 PEs . Can be recursively constructed: 𝟐 𝟎 =1 0-d 𝟐 𝟑 =𝟖 𝟐 𝟐 =𝟒 𝟐 𝟏 =𝟐 1-d 2-d 3-d

- A (d+1)- dimensional hyper cube(HyC) is constructed by connecting the corresponding PEs of 2 d – dimentional HyC.(The labels of the PEs of one HyC are prefixed with (∅) and of the second HyC with (1) )

000 100 110 010 101 011 111 001 000 100 110 010 101 011 111 001 000 100 110 010 101 011 111 001 Figure HC2: Three distinct partition of three dimensional hypercube in two two-dimensional cubes . Link connecting processors within a partition are indicated y bold lines.

Fig: b) HC3

Fig: b) HC3

Properties: 2 PEs are connected by a direct link iff the binary representation of their labels differ at exactly one bit position. In a d-dimensional HyC, each PE is directly connected to d other PEs. A d-dimensional HyC can be partitioned in (d-1)- dimensional sub cubes(see fig HC2) Since PEs labels have d bits, d such partitions exists. Fixing any k – bits in a d- dimensional HC with d – bits. => PEs that differ in the remaining (d-k) bit positions form a (d-k)- dimensional subcube composed of 2 𝑑−𝑘 PEs (such subcubes) see (fig HC3) Eg: k = 2, d = 4 => - 4 subcubes by fixing 2 MSB - 4 subcubes by fixing 2 LSB - always 4 subcubes formed by fixing 2 bits - etc

The total number of bit positions at which 2 labels differ = HAMMING distance s = source 𝑠 ⨁ t Hamming distance t = destination ⨁ = EOR (exclusive OR) The number of communication links in the shortes path between 2 PEs is the hamming distance between their labels. The shortest path between any 2 PEs is a HyC cannot have more than d links.(Since 𝑠 ⨁ t cannot contain more than d bits) k- ary d- cubes Nets d- dimensional HyC is a(2 PEs along each link): binary d-cube or 2-ary d-cube k-ary d-cubes => the number of PEs d- dimension of the network k- radix- ver.of PE in each direction 𝑝= 𝑘 𝑑 Eg: 2-d. mesh with p PEs : p = 𝑘 2 => 𝑘= 𝑝

Evaluating static Interconnection Nets (in terms of costs and performance) Diameter: The max distance between any 2PEs in the network distance: between 2PEs is the shortest path between them. determines communication time. Diameter: completely connected net = 1 star-connected net = 2 Diameter: ring = p / 2 2-d mesh = 2( 𝑝 −1) Wraparound 2-d mesh = 2 𝑝 /2 Hyper cube connected net = log p Complete binary tree = 2(log(p+1)/2) i.e. p = # processors = # nodes (15 PE) h = log( (𝑝+1) 2 ) Diameter = 2h d 𝑝+1 2 𝑝−1 2 S

Connectivity: It is a measure of the multiplicity of paths between PEs. high connectivity:- lowers contention for communication resources ave connectivity:- is a measure of connectivity the minimum number of arcs that must be removed to break the net into 2 disconnected nets. 1: linear arrays, star, tree 2: rings, 2D- meshes without wraparound 4: 2D- meshes w/ wraparround d: d- hypercubes Bisection width: the minimum number of communication links that have to be removed to partition the network into 2 equal halves. ring : :2 2-d mesh w/o wraparround: 𝑝 2-d mesh w/o wraparround: 2 𝑝 tree , star : 1 ⟶ (BWstar = BW tree = 1) fuely – connected net. : 𝑝 2 4 hypercube : 𝑝 2

Hypercube : i.e dim. HyC (P processors )consists of (connects corresponding links to 𝑃 2 𝐻𝑦𝐶1 (d-1) 𝑃 2 𝐻𝑦𝐶2 (d-1) 𝑃 2 𝐻𝑦𝐶 d P

Bisection Bandwidth: measures cost by providing a lower bound on the (area in a 2-d/volume in a 3-d) packaging. If the bisection width of a network is: w => the lower bound on the area in 2-d is 𝜃( 𝑤 2 ). lower bound on the volume in 3-d is ( 𝑤 3 2 ) Channel width: the number of bits that can be simultaneously communicated over a link connecting 2PEs. = no of wires Channel rate = peak rate Channel bandwidth: channel rate ×𝑐ℎ𝑎𝑛𝑛𝑒𝑙 𝑤𝑖𝑑𝑡ℎ : Peak rate at which data can be communicated between the ends of a communication link. Bisection Bandwidth: bisection width × channel bandwidth minimum volume of communication allowed between any two halves of a net. with an equal no. of PEs. Cost: - in terms of a) no. of communication links b) bisection bandwidth

Embedding other networks into A Hypercube # Communication links(# wires required) linear arrays, trees : p-1 (links) d-dimensional mesh wraparound :dp Hyc :p log 𝑝 2 Embedding other networks into A Hypercube D) Given 2 graph: G(V,E) and G’(V’,E’) Embedding graph G into graph G’ maps each vertex in the set V onto a vertex(or a set of vertices) in a set V’ and each edge in set E onto an edge (or a set of edges) in E’. Nodes corresponds to PEs and edges to communication links. Why need EMBEDDING? Answer: It may be necessary to adapt one network to another (when an application is written for a specific network which is not available at present) Parameters: - congestion (# edges in Emapped to one edge in E’) - dilation(reverse of congestion) - expansion(ratio of # of vertices in V’ corresp. to an vertex in V) - contraction(reverse of expansion)

Embedding a linear array into a Hyc Linear array of 2 𝑑 processors(labeled “0” to “ 2 𝑑 -1”) can be embedded into a “d”- dimensional Hyc by mapping processors “i” of linear array to → processor G(i,d) of a Hyc. G(0,1) = 0 G(1,1,) = 1 G(i, x+1) = G(i, x) i< 2 𝑥 2 𝑥 + G( 2 𝑥+1 −1−1,𝑥) i≥ 2 𝑥 This function G is the Binary Reflected Gray Code(RGC) Embedding a Mesh into a Hypercube: processors in: - the same column – have identical LSB - the same row – have identical MSB - each row and each column in the mesh is mapped to a unique subcube.

Embedding A Binary Tree into A Hypercube A tree of depth “d” embedded in a 2 𝑑 PE Hyc The root of the tree is mapped onto a Hyc processor For each node m at depth j in the tree the left child of m is mapped to the Hyc PE to which node 𝑚 is, and the right child of m is mapped to the Hyc processor whose label we obtain by inverting bit j of i expansion = 1 (but max 4 nodes in tree may corresp and to 1 node in congestion = Hyc) dilation = Note: for array and mesh expansion = congestion =

(b) (a) Fig: A tree rooted at processor 011(=3) and embedded into a three-dimensional hypercube: a) the organization of the tree rooted at processor 011, and b) the tree embedded into a three-dimensional hypercube.

Routing Mechanism for Static Networks determines the path a message takes through the network to get from the source to the destination processor. Considers : Source Destination Info about state of net Returns one or more path(s) Classification (criteria): -congestion – minimal : (shortest path between source and destination) - does not take in consideration congestion - non minimal: (avoid network congestion)(path may not be shortest) Use of state of network info – deterministic routing(does not use net. info) Determines a unique path Uses only source and destination info. - adaptive (uses network state info. Avoids congestion)

X-Y routing E-cube routing: Eg. Dimension ordered routing(deterministic minimal) (xy-routing & E-cube routing.) X-Y routing - message sent along dimension until it reaches the column of the destination PE, and then sent along dimension until it reaches its destination path length: | 𝑆 𝑥 − 𝐷 𝑥 |+| 𝑆 𝑦 − 𝐷 𝑦 | E-cube routing: - The minimum distance between 2 PE : 𝑃 𝑠 and 𝑃 𝑑 is given by the number of ones in the 𝑃 𝑠 ⨁ 𝑃 𝑑 - 𝑃 𝑠 computes 𝑃 𝑠 ⨁ 𝑃 𝑑 and sends a message along dimension k, where k is the position of the LSB in 𝑃 𝑠 ⨁ 𝑃 𝑑 . At each intermediate step, 𝑃 𝑖 (the PE which receives the message), computes 𝑃 𝑖 ⨁ 𝑃 𝑑 and forwards the message along the dimension corresponding to the LSB routers. Process continues until destination is reached. X Y

E- cube routing on a Hypercube network 𝑃 𝑠 =010 𝑃 𝑑 =111 Step 1: 𝑃 𝑠 ⨁ 𝑃 𝑑 : 010 111 = 101 𝑃 𝑠 forwards message along the dimension corresp. to the LSB 010……..>> 011 Step 2: 𝑃 𝑖 = 011 𝑃 𝑖 ⨁ 𝑃 𝑑 : 011 111 = 100 𝑃 𝑖 forwards message along the dimension corresp. to the LSB that is not zero. 011 ……>> 111 110 100 101 111 000 011 001 010 Pd Ps

Communication costs in static Interconnection Networks Communication latency: time taken to communicate a message between two processors in the network. Parameters : Startup time: - prepare message(add header, trailer, error correction info.) - execute routing algorithm - establish interface between PE & router Per- hap time: - time taken by header to travel between 2 directly (node latency) connected PEs Per word transfer time: - if channel bandwidth is “r” words/sec, each transfer takes 𝑡 𝑤 = 1 𝑟 to traverse the link Communication latency is influenced by: 1. network topology and 2. switching techniques

Store - and - forward Routing Each intermediate processor on the path forwards the message to the next processor after it has received and stored the entire message. 𝑚 - size of message 𝑙 - no of size to be travelled 𝑡 ℎ - cost for the header to traverse link 𝑚𝑡 𝑤 - cost for the rest of the message to traverse the link 𝑡 𝑐𝑜𝑚𝑚 = (𝑡 ℎ + 𝑚𝑡 𝑤 )𝑙 Since 𝑡 ℎ is small compared to 𝑚𝑡 𝑤 𝑙 𝑡 𝑐𝑜𝑚𝑚 = 𝑡 𝑠 + 𝑚𝑡 𝑤 𝑙 𝑡 𝑠 .constant :θ(𝑚𝑙) time capacity 𝑝 0 𝑝 1 𝑝 2 𝑝 3 time

Cut- through routing store and forward routing: - communication time is high - poor utilization of communication resources Cut through routing: message is advanced from the incoming link to the outgoing link as it arrives - message travels in small units called flow control digits a flits (pipelined through the net) - an intermediate processor does not wait for the entire message to arrive before forwarding it. As soon as flit arrives, it is passed on to the next processor. - all flits are routed along the same path Note: - no need for buffer space to store entire message at intermediate PEs . - uses – less memory (storage) - less memory bandwidth - it is faster

𝑝 0 𝑝 1 𝑝 2 𝑝 3 ← Store – and - forward 𝑝 0 𝑝 1 𝑝 2 𝑝 3 ← cut- through – routing (with message bother in 4 parts) ← 𝑡 𝑠𝑎𝑣𝑒𝑑 ⟶ t 𝑡 𝑐𝑜𝑚𝑚 𝐶𝑅 = 𝑡 𝑠 + 𝑙 𝑡 ℎ +𝑚𝑡 𝑤 < 𝑡 𝑠 + 𝑙 𝑡 ℎ +𝑚𝑡 𝑤 𝑙 = 𝑡 𝑐𝑜𝑚𝑚 𝑆𝐹 If 𝑡 𝑠 = constatant θ(𝑚+𝑙) time complexity if 𝑙=1⇒ 𝑡 𝑐𝑜𝑚𝑚 𝐶𝑅 = 𝑡 𝑐𝑜𝑚𝑚 𝑆𝐹 : θ(𝑚)

Cost-performance tradeoffs Performance analysis of a mesh and Hyc with identical costs. (based on various cost metrics) Cost of network proportional to - # wires - Bisection bandwidth Assumption : lightly loaded nets. And cut-through routing. If cost of Net proportional to # wires: - 2-d wraparound mesh with log 𝑃 4 wires per channel costs as much as P processor Hyc with 1 wire per channel.(see page 38 table 2.1) Average commnic latencies (for mesh and HC of the same cost) 𝑙 𝑎𝑣 − average distance between 2 PE. 𝑝 2 → in 2-d wraparround mesh log 𝑃 4 → in Hyc Time to send a message of size m between PEs that are 𝑙 𝑎𝑣 keps apart: 𝑡 𝑠 + 𝑡 ℎ 𝑙 𝑎𝑣 +𝑡 𝑤 𝑚 ←𝑓𝑜𝑟 𝑐𝑢𝑡 𝑡ℎ𝑟𝑜𝑢𝑔ℎ 𝑟𝑜𝑢𝑡𝑖𝑛𝑔

In mesh: channel width scaled up by: log 𝑃 4 per-word transfer reduced by log 𝑃 4 In Hyc : if per-word transfer time is 𝑡 𝑤 => the same time for mesh is : 4𝑡 𝑤 𝑙𝑜𝑔𝑝 𝑡 𝑤 𝑚𝑒𝑠ℎ = 4 𝑙𝑜𝑔𝑝 𝑡 𝑤 𝐻𝑦𝑐 Average communication latency: Hyc : 𝑡 𝑠 + 𝑡 ℎ log 𝑃 2 + 𝑡 𝑤 𝑚 Mesh: : 𝑡 𝑠 + 𝑡 ℎ 𝑝 2 + 4 m 𝑡 𝑤 /(𝑙𝑜𝑔𝑝) Analysis: p = et m ↑ => 4𝑚𝑡 𝑤 𝑙𝑜𝑔𝑝 < 𝑡 𝑤 m if p > 16 (mesh ) (HC) Communication due to 𝑡 𝑤 dominates

Point- to- point communication of large messages between random pairs of PE takes less times on a map around mesh with cut-through routing, than on a HC of the same cost. Note: (1) - if state - and – forward routing is used, the mesh is not make cost – efficient than HC (2) The above analysis was performed under light load conditions in the network. If the number of messages => contention Mesh is affected by contention more than HC. => for high load contentions HC better than mesh 2) IF cost of Net proportional to its bisection width - 2-d wraparound mesh (p processors) with 𝑝 4 wires per channel has a cost equal to a p processor HC with 1 wire per channel (see page 38, Table 2.1)

Average communication latencies (for mesh & HC of same cost) Mesh channel wider by 𝑝 4 => per word transfer is reduced by the same factor: 𝑝 4 Communication latencies by mesh & HC : HC: 𝑡 𝑠 + 𝑡 ℎ log 𝑝 2 + 𝑡 𝑤 𝑚 Mesh: 𝑡 𝑠 + 𝑡 ℎ 𝑝 2 +4 𝑚 𝑡 𝑤 1 𝑝 Analysis: p = et m ↑ Mesh outperforms HC. Of the same cost provided the network is lightly loaded For heavily loaded nets perf. Mesh ≈ perf. HC 𝑡 𝑤 𝑑𝑜𝑚𝑖𝑛𝑎𝑡𝑒𝑠⇒𝑓𝑜𝑟 𝑝>16