Presentation is loading. Please wait.

Presentation is loading. Please wait.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.

Similar presentations


Presentation on theme: "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed."— Presentation transcript:

1 Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed

2 KICS, UETCopyright © 20095-2 Course Outline Introduction Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning Architecture and application level performance evaluation methodology Case study: sniffer performance tuning

3 KICS, UETCopyright © 20095-3 Agenda for Today Multi-core system performance and scalability Methodology Profiling tools Benchmarking Benchmarking and profiling: a case study Multi-threaded sniffer performance tuning Single queue based design Lock-free design

4 Multi-Core Performance Measurement Methodology Credits for slides: Dr. Rodric Rabbah, IBM

5 KICS, UETCopyright © 20095-5 Keys to Parallel Performance Coverage Extent of parallelism in algorithm Determined by Amdahl’s law Granularity Size of workload partitions for each core Impact communication cost and load balance Locality Location of computation vs. communication Inter-processor communication  non-local Processor-memory communication  local

6 KICS, UETCopyright © 20095-6 Communication Cost Model

7 KICS, UETCopyright © 20095-7 Overlapping Communication with Computation

8 KICS, UETCopyright © 20095-8 Limits in Pipelining Communication Computation to communication ratio limits performance gains from pipelining Where else to look for performance?

9 KICS, UETCopyright © 20095-9 Lessons from Uniprocessors In uniprocessors, CPU communicates with memory Loads and stores are to uniprocessors as GET and PUT are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality

10 KICS, UETCopyright © 20095-10 Spatial Locality CPU asks for data at address 1000 Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001, 1002, …, 1064 Otherwise wasted bandwidth and cache capacity

11 KICS, UETCopyright © 20095-11 Temporal Locality Main memory access is expensive Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over … Otherwise wasted cache capacity

12 KICS, UETCopyright © 20095-12 Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) Two keys to faster execution Load balance the work among the processors Make execution on each processor faster

13 KICS, UETCopyright © 20095-13 Understanding Performance Need some way of measuring performance Coarse grained measurements % gcc sample.c % time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3 % time a.out 1.921u 0.093s 0:02.03 99.0% … but did we learn much about what’s going on?

14 KICS, UETCopyright © 20095-14 Measurements Using Counters Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time

15 KICS, UETCopyright © 20095-15 Dynamic Profiling Event-based profiling Interrupt execution when an event counter reaches a threshold Time-based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies

16 KICS, UETCopyright © 20095-16 Performance Counter Examples Cycles (clock ticks) Instructions retired Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations …

17 KICS, UETCopyright © 20095-17 Useful Derived Metrics Processor utilization Cycles / Wall Clock Time Instructions per cycle Instructions / Cycles Instructions per memory operation Instructions / Loads + Stores Average number of instructions per load miss Instructions / L1 Load Misses Memory traffic Loads + Stores * Lk Cache Line Size Bandwidth consumed Loads + Stores * Lk Cache Line Size / Wall Clock Time Many others Cache miss rate Branch misprediction rate

18 KICS, UETCopyright © 20095-18 Common Profiling Workflow

19 KICS, UETCopyright © 20095-19 Popular Runtime Profiling Tools GNU gprof Widely available with UNIX/Linux distributions gcc –O2 –pg foo.c –o foo./foo gprof foo Oprofile http://oprofile.sourceforge.net PAPI http://icl.cs.utk.edu/papi/ VTune http://www.intel.com/cd/software/products/asmo- na/eng/vtune/ Many others

20 KICS, UETCopyright © 20095-20 GNU grpof

21 KICS, UETCopyright © 20095-21 Uniprocessor Performance time = compute + wait Instruction level parallelism Multiple functional units, deeply pipelined, speculation,... Data level parallelism SIMD: short vector instructions (multimedia extensions) Hardware is simpler, no heavily ported register files Instructions are more compact Reduces instruction fetch bandwidth Complex memory hierarchies Multiple level caches, may outstanding misses, prefetching, …

22 KICS, UETCopyright © 20095-22 Instruction Locality

23 KICS, UETCopyright © 20095-23 Instruction Locality (2)

24 KICS, UETCopyright © 20095-24 Example: Cache/Memory Optimization

25 KICS, UETCopyright © 20095-25 Example: Cache/Memory Optimization (2)

26 KICS, UETCopyright © 20095-26 Example: Cache/Memory Optimization (3)

27 KICS, UETCopyright © 20095-27 Results from Cache Optimizations

28 KICS, UETCopyright © 20095-28 Summary: Programming for Performance Tune the parallelism first Then tune performance on individual processors Modern processors are complex Need instruction level parallelism for performance Understanding performance requires a lot of probing Optimize for the memory hierarchy Memory is much slower than processors Multi-layer memory hierarchies try to hide the speed gap Data locality is essential for performance

29 Architecture and Application Level Performance Evaluation using Benchmarking and Profiling Dual Processor Performance Characterization for XML Application Oriented Networking by Jason Ding and Abdul Waheed, in Proceedings of ICPP2007

30 KICS, UETCopyright © 20095-30 Problem Statement Application under consideration XML message processing Functions include: xpath apttern matching and schema validation Two types of processors: Single processor with two cores Inel Pentium-M Single processor with hyper-threading enabled Intel Xeon Two processor without hyper-threading Same Intel Xeon Need to understand performance characteristics

31 KICS, UETCopyright © 20095-31 Processors and System

32 KICS, UETCopyright © 20095-32 Notations

33 KICS, UETCopyright © 20095-33 Workload Characterization

34 KICS, UETCopyright © 20095-34 Measurement Setup

35 KICS, UETCopyright © 20095-35 Performance Counter Events

36 KICS, UETCopyright © 20095-36 Performance Metrics High level End-to-end throughput Architecture level CPI L2 cache misses Bus transactions per instruction retired Branch misprediction ratio

37 KICS, UETCopyright © 20095-37 Baseline Measurements

38 KICS, UETCopyright © 20095-38 Baseline Analysis

39 KICS, UETCopyright © 20095-39 End-to-End Throughput

40 KICS, UETCopyright © 20095-40 Comparison of CPIs

41 KICS, UETCopyright © 20095-41 L2 Cache Misses

42 KICS, UETCopyright © 20095-42 Bus Transactions Per Instruction

43 KICS, UETCopyright © 20095-43 Branch Mispredictions Ratios

44 KICS, UETCopyright © 20095-44 Conclusions Hyper-threading technology in Xeon processor scales poorly compared to dual Xeon or dual-core Pentium- M processors Higher CPI Lower bus traffic related to memory accesses A higher branch misprediction ratios Dual-core pentium M exhibits Better scalability for XML workloads Wide dynamic execution, efficient memory access and superior branch prediction Shared L2 cache in dual-core processors May become a bottleneck for I/O intensive workloads Simpler micro-architecture wrt cache coherence

45 Case Study: Multi-Threaded Sniffer Performacne Tuning Single and Multiple Queues based Designs with and without Thread Locking (Labs 3-5)

46 KICS, UETCopyright © 20095-46 Packet Sniffer Passive mode packet monitoring on link Each packet is captured from the link Does not impact actual packet flow Multiple layers of sniffer functionality Layer 2: Ethernet frame capture Layer 3: matching of 5-tuple Layer 7: payload string search  DPI Increasing compute intensity with layers Layer 2 is least while 7 is most compute intensive Throughput limited by compute intensity Multiple cores are expected to help scale throughput

47 KICS, UETCopyright © 20095-47 Sniffer Setup 0246 1357 0246 1357 SenderReceiver Packet Sniffer 1 GigE Link System 1System 2 Data Packets Core Packet Mapping to Cores

48 KICS, UETCopyright © 20095-48 Single Queue Architecture T1T1 T0T0 T N-1 T1T1 T0T0 T1T1 T0T0 T1T1 T0T0 Dispatcher putting space Dispatcher putting direction Workers getting direction TNTN Worker Threads

49 KICS, UETCopyright © 20095-49 Single Queue Architecture (2) Single dispatcher thread Captures the packet from network Copies at the tail of a dispatcher queue Multiple worker threads Workers can access only specified locations Each location access is mutually exclusive Controlled by a flag per location No locking overhead No copying involved from dispatcher to worker Location access function: Get packet, if flag = 1 (workers) Put packet, if flag = 0 (Dispatcher) Worker processes the packet at three layer levels

50 KICS, UETCopyright © 20095-50 Performance Results Throughput does not scale with number of threads Layer 2: limited by link speed of 1 Gbps Layer 3 and 7: CPU bound Bottlenecks: Scanning of whole queue according to the status of flag Large stride size Needed more cache coherency

51 KICS, UETCopyright © 20095-51 Multiple Queues Architecture T N-1 T2T2 T1T1 T0T0 Dispatcher putting space Dispatcher put Workers get TNTN Worker Threads

52 KICS, UETCopyright © 20095-52 Multiple Queues Architecture (2) Queue size distributed between worker threads Dispatcher can access whole queue Each worker thread can access only dedicated sub-queue In-situ sniffing No copying from dispatcher to worker space Mutually exclusion is assured by get and set indices (get chases set) Location access directions No locking overhead Location access function: Get packet, if get < set (workers) Put packet, if get ≤ set (Dispatcher) Wait, otherwise

53 KICS, UETCopyright © 20095-53 Performance Results Throughput better for 1-2 threads but drops afterwards Inefficient (O(n)) enqueue and dequeue functions

54 KICS, UETCopyright © 20095-54 Optimized Architecture Improved queuing O(1) instead of O(n) One queue per worker thread No locking based design Other improvements Removed duplications in packet capturing Removed some redundant operations Layer 3 protocol type (IP, ARP, etc.) Improved 5-tuple matching

55 KICS, UETCopyright © 20095-55 Optimized Performance: Packet Capture Throughput scales from 1-2 threads Saturates at 1 Gbps link speed limit

56 KICS, UETCopyright © 20095-56 Optimized Performance: Matching 5-Tuple Performance similar to packet capturing 5-tuple comparison is not very expensive

57 KICS, UETCopyright © 20095-57 Optimized Performance: String Matching in Payload String not in payload: Throughput scales linearly from 1-4 threads/cores Saturates at 1 Gbps link bandwidth String in payload: Saturates with 2 threads/cores Significantly less compute intensive as string is found close to start

58 KICS, UETCopyright © 20095-58 Key Takeaways for Today’s Session Multi-core application development requires Identifying concurrent tasks Identifying interactions among tasks Mapping on to multiple cores using threads Fork-and-join based threading Initial development is fairly simple but… Real challenge is to obtain scalability Requires tuning at application as well as thread level performance tuning Locate bottlenecks using instrumentation, profiling, and measurements  challenging task


Download ppt "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed."

Similar presentations


Ads by Google