Download presentation
Presentation is loading. Please wait.
Published byJean Pierce Modified over 9 years ago
1
CA406 Computer Architecture Networks
2
Data Flow - Summary Fine-Grain Dataflow Suffered from comms network overload! Coarse-Grain Dataflow Monsoon... Overtaken by commercial technology!! A sad “fact-of-life” It’s almost impossible to generate the funds for non-”mainstream” computer architecture research $n x 10 8 required Non-mainstream = interesting!
3
Data Flow - Summary As a software model … Functional languages Dataflow in a different guise! Theoretically important Practically? Inefficient ( = slow!!) ….. Ask your CS colleagues! Cilk - based on C Used on CIIPS Myrmidons Uses a dataflow model Threads become ready for execution when their data is generated Message passing efficiency Without explicit data transfer & synchronisation!
4
Networks Network Topology (or shape) Vital to efficient parallel algorithms Communication is the limiting factor! Ideal Cross-bar Any-to-any Non-blocking Except two sources to same receiver Realisable But only for limited order (number of ports)
5
Networks Cross-bars Achilles 8 x 8 Full duplex Simultaneous Input and Output at each port 32 bit data-path Target : 1Gbyte / second total throughput but we needed the 3-D arrangement to achieve bandwidth high order
6
Networks Cross-bars Achilles Hardware almost trivial! Single FPGA on each level Programmable VHDL Models Several topologies Just by changing the software!
7
Networks - More than 8 PEs Simple Use 2 8x8 routers! but …. This link gets a lot of traffic!
8
Networks - Fat tree Problem: High-traffic links between PEs can become a bottleneck Solution: Fat-tree Links higher up the tree are “fatter” Sustainable bandwidth between all PEs is the same
9
Networks - Performance Metrics Metrics for comparing network topologies Diameter Maximum distance between any pair of nodes Determines latency Bisection Bandwidth Aggregate bandwidth over any “cut” which divides the network in half Determines throughput Crossbar Diameter: 1 Every PE is directly connected to router so a single “hop” suffices Bisection Bandwidth: b bytes/sec b is the bandwidth of a single link
10
Networks - Performance Metrics Metrics for comparing network topologies To connect n Pes with mxm crossbars Single link bandwidth b bytes/s Simple: n = 14 (2 switches) Diameter3 Bisection Bandwidth b 1 2 3
11
Networks - Performance Metrics Fat-tree Diameter: 2 log m n Height is log m n Worst case distance - up and down Bisection Bandwidth: b n/2 bytes/sec Links are fatter higher up the tree log m n
12
Networks - Performance Metrics Mesh Diameter: 2 n-2 Bisection Bandwidth: b n bytes/sec Order: 4
13
Networks - Performance Metrics Hypercube Hypercube of order m Link 2 order m-1 hypercubes with 2 m-1 links Number of PEs: n = 2 m Order: log 2 n = m Order 2 Hypercube Order 3 Hypercube
14
Networks - Hypercubes Embedding property In an n PE hypercube, we have hypercubes of size n/2, n/4, … Number PEs with binary numbers 000, 001, 010, 011, 100, … Joining two hypercubes add one binary digit to the numbering Each PE is connected to every PE whose index differs in only one bit
15
Networks - Hypercubes Embedding property Partitioning tasks Allocate to sub-cubes Sub-tasks allocated to sub-cubes of that cube, etc
16
Futures
17
VLIW - Very Long Instruction Word Instruction word: multiple operations n RISC-style instructions Architecture: fixed set of functional units Each FU matched to a “slot” in the instruction
18
VLIW - Very Long Instruction Word Compiler responsible for allocating instructions to words Burden squarely on compiler Needs to produce near optimal schedule Inevitable: large number of empty slots! çLower code density Similar to superscalar but instruction issue flexibility missing VLIW simpler faster? Re-compilation needed Each new generation will have different functional unit mix
19
Synchronous Logic Systems Clock distribution Major problem for chip architect Clock skews < 100-200ps over whole die 10% of cycle time Small changes çRe-engineer whole chip Checking for data hazards & logic races
20
Synchronous Logic Systems Clock distribution Power consumption Major problem @ 30W+ per chip CMOS logic consumes power only on switch but synch systems clock a lot of logic on every cycle Clock is distributed to every subsystem Even if the logic of the subsystem is disabled!
21
Synchronous Logic Systems Clock distribution Power consumption Worst case propagation delay Determines maximum clock speed Clock edge must wait until all logic has settled Temperature and process fabrication çEven slower clocks Design is simpler Logic designers have experience Good tools
22
Asynchronous Logic Systems Clock distribution No longer a problem Synchronisation bundled with data Circuits are composable No global clock … åNo need to re-engineer a whole chip to change one section! Known correct circuits can be combined Power consumption Circuits switch only when they’re computing çPotentially very low power consumption May be the biggest attraction of asynch systems!
23
Asynchronous Logic Systems Clock distribution problem removed Circuits are composable Power consumption Average case propagation delay Completion signal generated when result is available Independent of Temperature and process fabrication Design is harder Experience will remove this?
24
Laboratory 1.51 Practical Examinations will be held in this laboratory every afternoon from 1:50pm to 5:30pm next week, June 1 to June 5 The laboratory will be closed to everyone except those in CT105/CLP110 actually taking the exams during these times. Please consider the students taking the exam by not disturbing them in any way.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.