January 7, 2007 QNX Multi-Core Solution Optimizing Software for Multi-core Kerry Johnson
2 QNX Confidential. All content copyright QNX Software Systems. Transitioning to Multi-core An Embedded Software Perspective In the embedded space, multi-core processors offer well established benefits ► Scalable performance ► Increased MIPS/watt In the embedded market, true parallel processing capabilities have not been widely deployed Transitioning software is the key concern of most embedded system vendors ► Some code may not operate in a truly parallel environment ► Symmetric multiprocessing is not widely understood ► Parallel programming skills need to be developed ► Need development tools to help troubleshoot and optimize Embedded system vendors are looking for ways to de-risk move to multi-core ► Need for a full solution from an embedded software vendor that can support them during the transition
3 QNX Confidential. All content copyright QNX Software Systems. Migrating Legacy Software Software that works properly on multi-core processors ► Single threaded or multi-threaded with good synchronization Issues that cause improper software operation on multi-core ► Some result in program failures: Latent bugs that result in lack of synchronization between threads Using priorities / run to completion as mutual exclusion Assuming interrupts provide exclusivity ► Some result in sub-optimal performance on multi-core processors: Sequential code that can be made parallel Cache thrashing / ping-ponging Contention for shared resources
4 QNX Confidential. All content copyright QNX Software Systems. How does your environment stack up? Considerations: RTOS capable of handling all cores Libraries & middleware thread-safe, threaded for performance Tools for debugging and optimization on multi-core processors Core 1 Core 2 Core 3 Core 4 RTOS Libraries & Middleware Applications Tools Multi-core Readiness – Taking Inventory
5 QNX Confidential. All content copyright QNX Software Systems. QNX Multi-Core Solution QNX ® Neutrino ® RTOS Proven QNX Neutrino RTOS SMP introduced in 1997 Support for symmetric, bound and asymmetric operation Simple migration of uniprocessor based software Wide range of multi-core board support packages QNX ® Momentics ® development suite Advanced visualization tools for multi-core systems Multi-core debugging and profiling tools Open, Eclipse based IDE
January 7, 2007 Multi-Core RTOS
7 QNX Confidential. All content copyright QNX Software Systems. Multiprocessing Models OS Applications Core 1 Core 2 Cache Interconnect Symmetric Multiprocessing (SMP) One OS handles all cores ► Scaling: Scalable performance by adding cores ► Performance: Maximum processor utilization through dynamic load balancing & lightweight OS primitives for threads running on any core ► Time to Market: Resource sharing handled by OS reduces design & code complexity However: ► May have legacy code issues
8 QNX Confidential. All content copyright QNX Software Systems. Multiprocessing Models QNX Bound Multiprocessing (BMP) QNX innovation that extends SMP, providing the ability to “bind” processes and threads to a single core Support legacy code base and multi-core capable applications ► Supports bound and symmetric operation, selectable by process / thread Dedicate processing where needed ► Designer has full control over where applications run ► Applications and/or threads can be “bound” to a specific core OS A2A1A5A3A4 Core 1 Core 2 Cache Interconnect The QNX Neutrino ® RTOS supports asymmetric, symmetric and bound multiprocessing - offering the flexibility to choose.
January 7, 2007 Multi-Core Development Tools
10 QNX Confidential. All content copyright QNX Software Systems. The Role of Tools The right toolset eases the transition to multi-core Assess current software when moving to multi-core ► Should processes be separated between cores? Determine how closely coupled the current processes are ► Where can concurrent processing help? Show the current processing bottlenecks Debugging in a multi-core environment ► Characterize and debug interaction between threads on multiple CPUs Tuning and optimization in a multi-core environment ► Move processes and threads between cores ► Examine processing bottle necks ► Examine inter-process communications
11 QNX Confidential. All content copyright QNX Software Systems. Visual system analysis QNX Momentics on development host Capture Kernel Event Trace Target system running QNX Neutrino instrumented kernel Upload Trace File Display Trace Information System Profiler Single step to capture, upload and view system trace Quickly visualize system interaction and behavior View interrupts, thread states, event timing, CPU usage, partitions, IPC, and much more….
12 QNX Confidential. All content copyright QNX Software Systems. Find Opportunities for Parallelism Multi-core chips offer true hardware parallelism ► But how do you take advantage of it? Determine where parallelism will have the biggest payoff 1. Use the Momentics System Profiler to pinpoint processes or threads that consume the most CPU 2. Once you’ve identified CPU-intensive processes, use the Momentics Application Profiler to... Analyze function-level performance within individual processes Determine which code inside a process or thread consumes the most CPU
13 QNX Confidential. All content copyright QNX Software Systems. Application Profiler – Finding Bottlenecks fill_array() function consumes 97% of CPU in this test run Application Profiler uses sampling
14 QNX Confidential. All content copyright QNX Software Systems. Execution Time – Single CPU System Profiler – high resolution timestamps Total Elapsed Time: s
15 QNX Confidential. All content copyright QNX Software Systems. Single CPU Approach Updating Large Array Non-threaded Single CPU float array[NUM_ROWS][NUM_COLUMNS]; void fill_array() { int i, j; for ( i = 0; i < NUM_ROWS; i++ ) { for ( j = 0; j < NUM_COLUMNS; j++ ) { array[i][j] = ((i/2 * j) / 3.2) + 1.0; } Application QNX Neutrino RTOS CPU 1
16 QNX Confidential. All content copyright QNX Software Systems. CPU 2 CPU 1 Threads Process Pool of worker threads Dispatch “work” to worker threads Scales very well & easily with SMP ► Simply adjust number of worker threads as you add or remove CPUs ► No code change required ► Very lightweight OS primitives to synchronize Making Software Parallel Main thread Worker thread Main thread Worker thread
17 QNX Confidential. All content copyright QNX Software Systems. Increasing Speed with Multi-Core & SMP Updating Large Array – Create Worker Threads void multi_thread_fill_array() { int thread, rc; pthread_tworker_threads[NUM_CPUS]; int thread_index[NUM_CPUS]; for (thread = 0; thread < NUM_CPUS; ++thread) { thread_index[thread] = thread; rc = pthread_create(&worker_threads[thread], NULL, &fill_array_fragment, (void *)&thread_index[thread]); if (rc) { // handle error } return; } Uses POSIX pthread model to realize multi-core performance gains Application Thread 2Thread 1 Main Thread Thread 1 Thread 2 QNX Neutrino RTOS CPU 1 CPU 2
18 QNX Confidential. All content copyright QNX Software Systems. But Wait… It’s Broken! Main thread (thread 1) references array before worker threads complete Worker threads busy filling array
19 QNX Confidential. All content copyright QNX Software Systems. Adding Synchronization – Main Thread Main thread (Thread 1) Thread 2 Thread 3 POSIX barrier synchronizes threads. Thread 1 does not continue until Threads 2 & 3 are finished their work. void multi_thread_fill_array() { int thread, rc; pthread_tworker_threads[NUM_CPUS]; int thread_index[NUM_CPUS]; // Sync threads, main + worker threads pthread_barrier_init(&barrier, NULL, NUM_CPUS+1); for (thread = 0; thread < NUM_CPUS; ++thread) { thread_index[thread] = thread; rc = pthread_create(&worker_threads[thread], NULL, &fill_array_fragment, (void *)&thread_index[thread]); if (rc) { // handle error } pthread_barrier_wait(&barrier); pthread_barrier_destroy(&barrier); } return; }
20 QNX Confidential. All content copyright QNX Software Systems. Adding Synchronization – Worker Thread void *fill_array_fragment(int *thread_index) { int col = 0; int start_row = 0; int end_row = 0; start_row = *thread_index * (NUM_ROWS/NUM_CPUS); end_row = start_row + (NUM_ROWS/NUM_CPUS) - 1; while (start_row <= end_row) { for (col = 0; col < NUM_COLUMNS; col++) { array[start_row][col] = ((start_row/2 * col) / 3.2) + 1.0; } ++start_row; } pthread_barrier_wait(&barrier); return NULL; } Main thread (Thread 1) Thread 2 Thread 3 POSIX barrier synchronizes threads. Thread 1 does not continue until Threads 2 & 3 are finished their work.
21 QNX Confidential. All content copyright QNX Software Systems. Finished! All threads waiting at barrier Thread 1 resumes execution Worker threads dynamically scheduled across both CPU cores. Color coding indicates different cores.
22 QNX Confidential. All content copyright QNX Software Systems. Speed Increase with 2 cores Total Elapsed Time: sec Linear region: sec Parallel region (2 threads): – = sec Total Elapsed Time for single CPU (Non-threaded): sec Overall Performance Increase: 1.66x fill_array() speed increase: 1.999x
23 QNX Confidential. All content copyright QNX Software Systems. Reducing Unnecessary Thread Migration Neutrino uses “soft” processor affinity ► Scheduler always tries to dispatch a thread to the core where the thread last ran Allows core to fetch thread’s instructions directly from L1 cache Eliminates need to reload instructions from L2 cache or main memory Enables optimal performance Sometimes, however, threads will still migrate unnecessarily from core to core ► Threads overwrite one another’s cached instructions ► Forces L1 cache to be continuously reloaded Can result from high-priority threads interrupting low-priority threads ► Every time interrupt occurs, there is a possibility the OS will schedule the interrupt-handling thread on another core ► The more cores on the chip, the more likely this will occur QNX Neutrino’s bound multiprocessing capability can be used as a cure
24 QNX Confidential. All content copyright QNX Software Systems. Reducing Unnecessary Thread Migration System-tracing tool can help diagnose this condition ► Provide statistics on core-to-core migrations for a particular thread ► Provide a system-wide perspective, allowing the developer to “zoom in” on potential problem areas Thread migration
25 QNX Confidential. All content copyright QNX Software Systems. Application Thread 2Thread 1 Main Thread Use CPU affinity to bind threads void *fill_array_fragment(int *thread_index) { int col = 0; int start_row = 0; int end_row = 0; int runmask = 0; RMSK_SET(*thread_index, &runmask); ThreadCtl( _NTO_TCTL_RUNMASK, runmask); start_row = *thread_index * (NUM_ROWS / NUM_CPUS); end_row = start_row + (NUM_ROWS / NUM_CPUS) - 1; while (start_row <= end_row) { for (col = 0; col < NUM_COLUMNS; col++) { array[start_row][col] = ((start_row/2 * col) / 3.2) + 1.0; } ++start_row; } pthread_barrier_wait(&barrier); return NULL; } Bind Thread 1 to CPU 1 Bind Thread 2 to CPU 2 Thread 1 & CPU 1 Thread 2 & CPU 2 QNX Neutrino RTOS CPU 1 CPU 2
26 QNX Confidential. All content copyright QNX Software Systems. Reduce Thread Migration – Thread Affinity No thread migration of worker threads Worker threads stay on same CPU Total Elapsed Time for dual core: Bound: sec Unbound: sec
27 QNX Confidential. All content copyright QNX Software Systems. The only multi-core development platform QNX ® Momentics ® development suite QNX ® Neutrino ® RTOS Debug Analyze Optimize Develop Deploy
January 7, 2007 Thank You Kerry Johnson