Download presentation
Presentation is loading. Please wait.
1
EE382C Final Project Crouching Tiger, Hidden Dragonfly
Alexander Neckar Camilo Moreno Matthew Murray Ziyad Abdel Khaleq
2
Outline Topology, consideration and layout Routing solution
Mirroring and simulation Results and conclusion
3
Dragonfly Topology Fully-connected local groups Low hop count
Fast access to global links
4
Dragonfly Topology Load balance:
Endpoints/router >= global links per router ~All traffic is bound for other groups. BW should fit. Local links per router >= endpoints+global links ~All traffic needs to traverse local link before,after global. Adaptive Routing helps deal with adversarial traffic. As long as overall BW is sufficient And we have good backpressure
5
Considerations Costs Optical links drive cost
Minimize number, good utilization Local links much cheaper Overprovisioning helps feed global links Physical layout fully-connected group size limit (5m cables)
6
Considerations Power Traffic Links dominate power
Mostly limited in throughput by send window(RPC). some (RDMA) very large packets. hotspots. So... what?
7
Layout Considerations
Maybe as many as 60 racks per group!
8
Layout Considerations
Realistically, 34ish
9
Layout Considerations
Maximize racks per group? routers on bottom slots, wire diagonally Actually not a constraint Balance / cost issues with very large groups. 100m optical cables ~70m square: 147 x 50 racks: >200K rack slots
10
Chips Channels: Chips size is perimeter-driven
5GB/s = 4 diff. 1 optical cable 4 elec. cable pairs each direction Chips size is perimeter-driven buffers+crossbar are only a few mm2. High-radix requires large perimeter for I/O
11
Exploring options Lots of guesstimation!
12
Basic >114k nodes Balanced for uniform random TOPOLOGY 13x26x13
Cost 6.16M Power 68Kw Router Radix 51 Opt. Links 57291 Elect. Links 110175 Groups 339 Endpts/group 338
13
Cheaper, better? Fewer optical cables Overprovisioned in- group links
4% higher power TOPOLOGY 10x32x10 Cost 5.64M Power 70.7Kw Router Radix 51 Opt. Links 51360 Elect. Links 159216 Groups 321 Endpts/group 320
14
A little more savings 90% of normal global links
Overprovisioned in- group links Even cheaper Any good? TOPOLOGY 10x34x9 Cost 5.22M Power 70.5Kw Router Radix 52 Opt. Links 46971 Elect. Links 172227 Groups 307 Endpts/group 340
15
What if...? Half the “necessary” global links
Very overprovisioned in-group links Otherwise not 100K Almost half the price! TOPOLOGY 10x45x5 Cost 3.11M Power 65.9Kw Router Radix 59 Opt. Links 25425 Elect. Links 223740 Groups 226 Endpts/group 450
16
Improving Global Adaptive Routing
I feel the need…the need for speed.
17
Challenges Quick congestion detection
Quick and accurate return to minimal Tricks with credits, etc., can provide stiff backpressure How do we avoid incorrectly taking the non- minimal route?
18
Solution idea Use the rate of change of the queue to provide quick congestion detection and quick return to minimal Potential advantages: More accurate representation of network performance Rapid detection Potential problems: Sensitivity to burstiness
19
Our Work ROC = 0.99*prev_ROC + 0.01*cur_ROC
Developed two new routing algorithms: Min_queue_rate < 2*nonmin_queue_rate || min_queue_rate < 0 Old algorithm || min_queue_rate < 0
20
Results 1024 nodes, 2*p = 2*h = a = 8, injection Uniform: Bad_dragon:
2% increase in average, 5% increase in max for both ROC and combo Bad_dragon: ROC = 69% ave. latency, 82% max Combo = 72% ave., 90% max
21
Bad Dragon Results
22
Simulation Challenge Booksim's cycle-accurate nature is at odds with simulating our very large system std::bad_alloc...
23
Solution: Slicing Do a fraction of the work and get all of the results! How do we not include components in our simulation and still effectively simulate the entire network?
24
Slicing idea 1: Scaledown
A = 8, H = 2
25
Idea: Relationships
26
Forget about hotspots for a minute...
27
Slicing Idea 2: Mirroring
28
Routing
29
Mirroring with Hotspots
30
Results for Different topologies
p/a/h p: Endpoints per switch a: Switches per group h: Global links per switch 100,000 nodes with “Project Traffic” Best from Million Cycles
31
Simulation Results For 13 / 26 / 13
32
Simulation Results For 10 / 32 / 10
33
Simulation Results For 10 / 32 / 10 WITH 10 Hotspots
34
Other Simulation Results
16 / 28 / 8: Runtime 4,130,224 Average Latency (too big) 10 / 45 / 5 (half global links) Runtime 4,190,192 Average latency
35
Conclusion ROC always wins in average latency and runtime cycles. At a small cost of additional power (4%) over the basic 13 / 26 / 13. We can get higher performance cheaper with the 10 / 32 / 10 topology. Simulated hotspots scenario is pessimistic, numbers are fine.
36
Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.