Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley.

Similar presentations


Presentation on theme: "University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley."— Presentation transcript:

1 University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley and Manish Vachharajani University of Colorado at Boulder 2008.02.21 John Giacomoni

2 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Why? Why Pipelines? Multicore systems are the future Many apps can be pipelined if the granularity is fine enough – < 1 µs – 3.5 x interrupt handler

3 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Fine-Grain Pipelining Examples Network processing: –Intrusion detection (NID) –Traffic filtering (e.g., P2P filtering) –Traffic shaping (e.g., packet prioritization)

4 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Network Processing Scenarios LinkMbpsfpsns/frame T-11.52,941340,000 T-345.090,90911,000 OC-3155.0333,3333,000 OC-12622.01,219,512820 GigE1,000.01,488,095672 OC-482,500.05,000,000200 10 GigE10,000.014,925,37367 OC-1929,500.019,697,84351

5 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Core-Placements 4x4 NUMA Organization (ex: AMD Opteron Barcelona) AP P IPOP DecEnc AP P IP APP OP IP Dec App Enc OP

6 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Example 3 Stage Pipeline

7 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Example 3 Stage Pipeline

8 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead

9 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns GigE

10 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns GigE Lamport 160ns

11 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns Lamport 160ns Hardware 10ns GigE

12 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Communication Overhead Locks 320ns Lamport 160ns Hardware 10ns FastForward 28ns GigE

13 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab More Fine-Grain Pipelining Examples Network processing: –Intrusion detection (NID) –Traffic filtering (e.g., P2P filtering) –Traffic shaping (e.g., packet prioritization) Signal Processing –Media transcoding/encoding/decoding –Software Defined Radios Encryption –Counter-Mode AES Other Domains –Fine-grain kernels extracted from sequential applications

14 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward Cache-optimized point-to-point CLF queue 1.Fast 2.Robust against unbalanced stages 3.Hides die-die communication 4.Works with strong to weak memory consistency models

15 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (1) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } lamp_dequeue(*data) { while (head == tail) {} *data = buf[tail]; tail = NEXT(tail); }

16 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (2) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } headtail buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n]

17 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab AMD Opteron Cache Example M

18 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (2) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } headtail buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] Observe the mandatory cacheline ping-ponging for each enqueue and dequeue operation

19 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Lamports CLF Queue (3) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] Observe how cachelines will still ping-pong. What if the head/tail comparison was eliminated? tail

20 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward CLF Queue (1) lamp_enqueue(data) { NH = NEXT(head); while (NH == tail) {}; buf[head] = data; head = NH; } ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); }

21 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab buf[1]buf[0] FastForward CLF Queue (2) ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] tail Observe how head/tail cachelines will NOT ping-pong. BUT, buf will still cause the cachelines to ping-pong.

22 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab FastForward CLF Queue (3) ff_enqueue(data) { while(0 != buf[head]); buf[head] = data; head = NEXT(head); } head buf[0]buf[1]buf[2]buf[3] buf[4]buf[5]buf[6]buf[7] buf[ ] buf[n] tail Solution: Temporally slip stages by a cacheline. N:1 reduction in coherence misses per stage.

23 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Slip Timing

24 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Slip Timing Lost

25 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Maintaining Slip (Concepts) Use distance as the quality metric –Explicitly compare head/tail –Causes cache ping-ponging –Perform rarely

26 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Maintaining Slip (Method) adjust_slip() { dist = distance(producer, consumer); if (dist < *Danger*) { dist_old = 0; do { dist_old = dist; spin_wait(avg_stage_time * (*OK* - dist)); dist = distance(producer, consumer); } while (dist dist_old); }

27 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Comparative Performance LamportFastForward

28 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Thrashing and Auto-Balancing FastForward (Thrashing)FastForward (Balanced)

29 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Cache Verification FastForward (Thrashing)FastForward (Balanced)

30 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab On/Off Die Communications M On-die communication Off-die communication

31 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab On/Off-die Performance FastForward (On-Die)FastForward (Off-Die)

32 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Proven Property In the program order of the consumer, the consumer dequeues values in the same order that they were enqueued in the producer's program order.

33 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Work in Progress Operating Systems –27.5 ns/op 3.1 % cost reduction vs. reported 28.5 ns –Reduced jitter Applications –128bit AES encrypting filter Ethernet layer encryption at 1.45 mfps IP layer encryption at 1.51 mfps ~10 lines of code for each.

34 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Gazing into the Crystal Ball Locks 320ns Lamport 160ns Hardware 10ns FastForward 28ns GigE

35 University of Colorado at Boulder Core Research Lab University of Colorado at Boulder Core Research Lab Shared Memory Accelerated Queues Now Available! http://ce.colorado.edu/core Questions? john.giacomoni@colorado.edu


Download ppt "University of Colorado at Boulder Core Research Lab FastForward for Efficient Pipeline Parallelism: A Cache-Optimized Concurrent Lock-Free Queue Tipp Moseley."

Similar presentations


Ads by Google