Presentation is loading. Please wait.

Presentation is loading. Please wait.

EE382C Final Project Crouching Tiger, Hidden Dragonfly

Similar presentations


Presentation on theme: "EE382C Final Project Crouching Tiger, Hidden Dragonfly"— Presentation transcript:

1 EE382C Final Project Crouching Tiger, Hidden Dragonfly
Alexander Neckar Camilo Moreno Matthew Murray Ziyad Abdel Khaleq

2 Outline Topology, consideration and layout Routing solution
Mirroring and simulation Results and conclusion

3 Dragonfly Topology Fully-connected local groups Low hop count
Fast access to global links

4 Dragonfly Topology Load balance:
Endpoints/router >= global links per router ~All traffic is bound for other groups. BW should fit. Local links per router >= endpoints+global links ~All traffic needs to traverse local link before,after global. Adaptive Routing helps deal with adversarial traffic. As long as overall BW is sufficient And we have good backpressure

5 Considerations Costs Optical links drive cost
Minimize number, good utilization Local links much cheaper Overprovisioning helps feed global links Physical layout fully-connected group size limit (5m cables)

6 Considerations Power Traffic Links dominate power
Mostly limited in throughput by send window(RPC). some (RDMA) very large packets. hotspots. So... what?

7 Layout Considerations
Maybe as many as 60 racks per group!

8 Layout Considerations
Realistically, 34ish

9 Layout Considerations
Maximize racks per group? routers on bottom slots, wire diagonally Actually not a constraint Balance / cost issues with very large groups. 100m optical cables ~70m square: 147 x 50 racks: >200K rack slots

10 Chips Channels: Chips size is perimeter-driven
5GB/s = 4 diff. 1 optical cable 4 elec. cable pairs each direction Chips size is perimeter-driven buffers+crossbar are only a few mm2. High-radix requires large perimeter for I/O

11 Exploring options Lots of guesstimation!

12 Basic >114k nodes Balanced for uniform random TOPOLOGY 13x26x13
Cost 6.16M Power 68Kw Router Radix 51 Opt. Links 57291 Elect. Links 110175 Groups 339 Endpts/group 338

13 Cheaper, better? Fewer optical cables Overprovisioned in- group links
4% higher power TOPOLOGY 10x32x10 Cost 5.64M Power 70.7Kw Router Radix 51 Opt. Links 51360 Elect. Links 159216 Groups 321 Endpts/group 320

14 A little more savings 90% of normal global links
Overprovisioned in- group links Even cheaper Any good? TOPOLOGY 10x34x9 Cost 5.22M Power 70.5Kw Router Radix 52 Opt. Links 46971 Elect. Links 172227 Groups 307 Endpts/group 340

15 What if...? Half the “necessary” global links
Very overprovisioned in-group links Otherwise not 100K Almost half the price! TOPOLOGY 10x45x5 Cost 3.11M Power 65.9Kw Router Radix 59 Opt. Links 25425 Elect. Links 223740 Groups 226 Endpts/group 450

16 Improving Global Adaptive Routing
I feel the need…the need for speed.

17 Challenges Quick congestion detection
Quick and accurate return to minimal Tricks with credits, etc., can provide stiff backpressure How do we avoid incorrectly taking the non- minimal route?

18 Solution idea Use the rate of change of the queue to provide quick congestion detection and quick return to minimal Potential advantages: More accurate representation of network performance Rapid detection Potential problems: Sensitivity to burstiness

19 Our Work ROC = 0.99*prev_ROC + 0.01*cur_ROC
Developed two new routing algorithms: Min_queue_rate < 2*nonmin_queue_rate || min_queue_rate < 0 Old algorithm || min_queue_rate < 0

20 Results 1024 nodes, 2*p = 2*h = a = 8, injection Uniform: Bad_dragon:
2% increase in average, 5% increase in max for both ROC and combo Bad_dragon: ROC = 69% ave. latency, 82% max Combo = 72% ave., 90% max

21 Bad Dragon Results

22 Simulation Challenge Booksim's cycle-accurate nature is at odds with simulating our very large system std::bad_alloc...

23 Solution: Slicing Do a fraction of the work and get all of the results! How do we not include components in our simulation and still effectively simulate the entire network?

24 Slicing idea 1: Scaledown
A = 8, H = 2

25 Idea: Relationships

26 Forget about hotspots for a minute...

27 Slicing Idea 2: Mirroring

28 Routing

29 Mirroring with Hotspots

30 Results for Different topologies
p/a/h p: Endpoints per switch a: Switches per group h: Global links per switch 100,000 nodes with “Project Traffic” Best from Million Cycles

31 Simulation Results For 13 / 26 / 13

32 Simulation Results For 10 / 32 / 10

33 Simulation Results For 10 / 32 / 10 WITH 10 Hotspots

34 Other Simulation Results
16 / 28 / 8: Runtime 4,130,224 Average Latency (too big) 10 / 45 / 5 (half global links) Runtime 4,190,192 Average latency

35 Conclusion ROC always wins in average latency and runtime cycles. At a small cost of additional power (4%) over the basic 13 / 26 / 13. We can get higher performance cheaper with the 10 / 32 / 10 topology. Simulated hotspots scenario is pessimistic, numbers are fine.

36 Questions


Download ppt "EE382C Final Project Crouching Tiger, Hidden Dragonfly"

Similar presentations


Ads by Google