Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed

KICS, UETCopyright © 20095-2 Course Outline Introduction Multi-threading on multi-core processors Applications for multi-core processors Application layer computing on multi-core Performance measurement and tuning Architecture and application level performance evaluation methodology Case study: sniffer performance tuning

KICS, UETCopyright © 20095-3 Agenda for Today Multi-core system performance and scalability Methodology Profiling tools Benchmarking Benchmarking and profiling: a case study Multi-threaded sniffer performance tuning Single queue based design Lock-free design

Multi-Core Performance Measurement Methodology Credits for slides: Dr. Rodric Rabbah, IBM

KICS, UETCopyright © 20095-5 Keys to Parallel Performance Coverage Extent of parallelism in algorithm Determined by Amdahl’s law Granularity Size of workload partitions for each core Impact communication cost and load balance Locality Location of computation vs. communication Inter-processor communication  non-local Processor-memory communication  local

KICS, UETCopyright © 20095-9 Lessons from Uniprocessors In uniprocessors, CPU communicates with memory Loads and stores are to uniprocessors as GET and PUT are to distributed memory multiprocessors How is communication overlap enhanced in uniprocessors? Spatial locality Temporal locality

KICS, UETCopyright © 20095-10 Spatial Locality CPU asks for data at address 1000 Memory sends data at address 1000 … 1064 Amount of data sent depends on architecture parameters such as the cache block size Works well if CPU actually ends up using data from 1001, 1002, …, 1064 Otherwise wasted bandwidth and cache capacity

KICS, UETCopyright © 20095-11 Temporal Locality Main memory access is expensive Memory hierarchy adds small but fast memories (caches) near the CPU Memories get bigger as distance from CPU increases CPU asks for data at address 1000 Memory hierarchy anticipates more accesses to same address and stores a local copy Works well if CPU actually ends up using data from 1000 over and over and over … Otherwise wasted cache capacity

KICS, UETCopyright © 20095-12 Single Thread Performance Tasks mapped to execution units (threads) Threads run on individual processors (cores) Two keys to faster execution Load balance the work among the processors Make execution on each processor faster

KICS, UETCopyright © 20095-13 Understanding Performance Need some way of measuring performance Coarse grained measurements % gcc sample.c % time a.out 2.312u 0.062s 0:02.50 94.8% % gcc sample.c –O3 % time a.out 1.921u 0.093s 0:02.03 99.0% … but did we learn much about what’s going on?

KICS, UETCopyright © 20095-14 Measurements Using Counters Increasingly possible to get accurate measurements using performance counters Special registers in the hardware to measure events Insert code to start, read, and stop counter Measure exactly what you want, anywhere you want Can measure communication and computation duration But requires manual changes Monitoring nested scopes is an issue Heisenberg effect: counters can perturb execution time

KICS, UETCopyright © 20095-15 Dynamic Profiling Event-based profiling Interrupt execution when an event counter reaches a threshold Time-based profiling Interrupt execution every t seconds Works without modifying your code Does not require that you know where problem might be Supports multiple languages and programming models Quite efficient for appropriate sampling frequencies

KICS, UETCopyright © 20095-16 Performance Counter Examples Cycles (clock ticks) Instructions retired Pipeline stalls Cache hits Cache misses Number of instructions Number of loads Number of stores Number of floating point operations …

KICS, UETCopyright © 20095-17 Useful Derived Metrics Processor utilization Cycles / Wall Clock Time Instructions per cycle Instructions / Cycles Instructions per memory operation Instructions / Loads + Stores Average number of instructions per load miss Instructions / L1 Load Misses Memory traffic Loads + Stores * Lk Cache Line Size Bandwidth consumed Loads + Stores * Lk Cache Line Size / Wall Clock Time Many others Cache miss rate Branch misprediction rate

KICS, UETCopyright © 20095-19 Popular Runtime Profiling Tools GNU gprof Widely available with UNIX/Linux distributions gcc –O2 –pg foo.c –o foo./foo gprof foo Oprofile http://oprofile.sourceforge.net PAPI http://icl.cs.utk.edu/papi/ VTune http://www.intel.com/cd/software/products/asmo- na/eng/vtune/ Many others

KICS, UETCopyright © 20095-21 Uniprocessor Performance time = compute + wait Instruction level parallelism Multiple functional units, deeply pipelined, speculation,... Data level parallelism SIMD: short vector instructions (multimedia extensions) Hardware is simpler, no heavily ported register files Instructions are more compact Reduces instruction fetch bandwidth Complex memory hierarchies Multiple level caches, may outstanding misses, prefetching, …

KICS, UETCopyright © 20095-28 Summary: Programming for Performance Tune the parallelism first Then tune performance on individual processors Modern processors are complex Need instruction level parallelism for performance Understanding performance requires a lot of probing Optimize for the memory hierarchy Memory is much slower than processors Multi-layer memory hierarchies try to hide the speed gap Data locality is essential for performance

Architecture and Application Level Performance Evaluation using Benchmarking and Profiling Dual Processor Performance Characterization for XML Application Oriented Networking by Jason Ding and Abdul Waheed, in Proceedings of ICPP2007

KICS, UETCopyright © 20095-30 Problem Statement Application under consideration XML message processing Functions include: xpath apttern matching and schema validation Two types of processors: Single processor with two cores Inel Pentium-M Single processor with hyper-threading enabled Intel Xeon Two processor without hyper-threading Same Intel Xeon Need to understand performance characteristics

KICS, UETCopyright © 20095-44 Conclusions Hyper-threading technology in Xeon processor scales poorly compared to dual Xeon or dual-core Pentium- M processors Higher CPI Lower bus traffic related to memory accesses A higher branch misprediction ratios Dual-core pentium M exhibits Better scalability for XML workloads Wide dynamic execution, efficient memory access and superior branch prediction Shared L2 cache in dual-core processors May become a bottleneck for I/O intensive workloads Simpler micro-architecture wrt cache coherence

Case Study: Multi-Threaded Sniffer Performacne Tuning Single and Multiple Queues based Designs with and without Thread Locking (Labs 3-5)

KICS, UETCopyright © 20095-46 Packet Sniffer Passive mode packet monitoring on link Each packet is captured from the link Does not impact actual packet flow Multiple layers of sniffer functionality Layer 2: Ethernet frame capture Layer 3: matching of 5-tuple Layer 7: payload string search  DPI Increasing compute intensity with layers Layer 2 is least while 7 is most compute intensive Throughput limited by compute intensity Multiple cores are expected to help scale throughput

KICS, UETCopyright © 20095-48 Single Queue Architecture T1T1 T0T0 T N-1 T1T1 T0T0 T1T1 T0T0 T1T1 T0T0 Dispatcher putting space Dispatcher putting direction Workers getting direction TNTN Worker Threads

KICS, UETCopyright © 20095-49 Single Queue Architecture (2) Single dispatcher thread Captures the packet from network Copies at the tail of a dispatcher queue Multiple worker threads Workers can access only specified locations Each location access is mutually exclusive Controlled by a flag per location No locking overhead No copying involved from dispatcher to worker Location access function: Get packet, if flag = 1 (workers) Put packet, if flag = 0 (Dispatcher) Worker processes the packet at three layer levels

KICS, UETCopyright © 20095-50 Performance Results Throughput does not scale with number of threads Layer 2: limited by link speed of 1 Gbps Layer 3 and 7: CPU bound Bottlenecks: Scanning of whole queue according to the status of flag Large stride size Needed more cache coherency

KICS, UETCopyright © 20095-52 Multiple Queues Architecture (2) Queue size distributed between worker threads Dispatcher can access whole queue Each worker thread can access only dedicated sub-queue In-situ sniffing No copying from dispatcher to worker space Mutually exclusion is assured by get and set indices (get chases set) Location access directions No locking overhead Location access function: Get packet, if get < set (workers) Put packet, if get ≤ set (Dispatcher) Wait, otherwise

KICS, UETCopyright © 20095-54 Optimized Architecture Improved queuing O(1) instead of O(n) One queue per worker thread No locking based design Other improvements Removed duplications in packet capturing Removed some redundant operations Layer 3 protocol type (IP, ARP, etc.) Improved 5-tuple matching

KICS, UETCopyright © 20095-57 Optimized Performance: String Matching in Payload String not in payload: Throughput scales linearly from 1-4 threads/cores Saturates at 1 Gbps link bandwidth String in payload: Saturates with 2 threads/cores Significantly less compute intensive as string is found close to start

KICS, UETCopyright © 20095-58 Key Takeaways for Today’s Session Multi-core application development requires Identifying concurrent tasks Identifying interactions among tasks Mapping on to multiple cores using threads Fork-and-join based threading Initial development is fairly simple but… Real challenge is to obtain scalability Requires tuning at application as well as thread level performance tuning Locate bottlenecks using instrumentation, profiling, and measurements  challenging task

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.

Similar presentations

Presentation on theme: "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed.

Similar presentations

Presentation on theme: "Programming Multi-Core Processors based Embedded Systems A Hands-On Experience on Cavium Octeon based Platforms Dr. Abdul Waheed."— Presentation transcript:

Similar presentations

About project

Feedback