Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto Center for Semiconductor Research and Development Toshiba Corporation
2DAC2013 Background Requirements for embedded processors –Various types of processing Video Codecs (HEVC, H.264 , MPEG-2 , WMV ,...) Face Detection/Recognition, Audio/Video playback, Mobile TV –Wide range of required processing performance Should deal with various types of products from mobile phone to Tablets or more –Example: video decoding from QVGA 15fps to 1080p 60fps or more –Low cost and short time development that meets market requirement Reuse existing software to reduce development cost
3DAC2013 Challenges What kind of hardware architecture to employ? –The number of cores should be easily increased/decreased How can we realize the scalable performance? –Parallelized application program that utilizes multiple cores efficiently How can we realize the transparency? –Hiding the number of cores from application program Multiple Core Architecture [xu2012low] Our Proposed Scheduler [xu2012low] A low power many-core SoC with two 32-core clusters connected by tree based NoC for multimedia applications, H. Xu, et al. VLSI Symposium 2012
4DAC2013 Our approach A simple multiple core architecture + An application program independent of # of cores + An efficient parallel processing scheme Achieving Scalable performance
5DAC2013 Strategy to realize our approach Strategy –Developing an application independent of # of cores transparency –Running the developed application on a multiple-core processor and achieving scalable performance proportional to # of cores scalable performance Scheme –Designed an efficient thread scheduler efficient management of threads may achieve scalable performance the number of cores may be hidden if a thread scheduler abstracts the cores Challenges –Minimizing overheads for execution –Hiding the number of cores from application program
6DAC2013 How to minimize overheads Defined unique properties for threads –A Thread never suspends –A Thread never suspends to wait for data eliminate the overhead of thread switchingeliminate the overhead of thread switching when necessary data are all available –A Thread becomes ready to run when necessary data are all available Managed a thread status using simple counters “the number of dependency“ –Simplify the dependency into “the number of dependency“ this can be realized by simple operationsthis can be realized by simple operations
7DAC2013 How to hide the number of cores Designed a distributed scheduler with a shared queue –ONLY ready threads a shared queue –ONLY ready threads are placed in a shared queue runs on each core –A Thread dispatcher runs on each core fetches a thread from the shared queue –The dispatcher fetches a thread from the shared queue and executes it To reduce access conflict for a shared queue CAS (Compare And Swap) instructionWe use CAS (Compare And Swap) instruction Core search Thread fetch & execute Core Thread fetch & execute Core Thread search fetch & execute Thread Dispatcher
8DAC2013 Implemented thread scheduler Our Thread Scheduler consists of three components –Dependency Controller, Thread Pool, and Thread Dispatcher Our Thread Scheduler... Scalable Performance –is low overhead for Scalable Performance Transparency –hides the number of cores from application for Transparency Dependency Controller Thread Pool Thread Dispatcher Core Thread Scheduler Thread Dispatcher core Appl. register Core Thread Dispatcher 1 0 Thread 3 1 ・・ Thread available necessary fetch & execute ready
9 Design goals for a many-core processor –Achieve scalable performance –Reuse existing software for a multi-core processor a many-core processor has to execute existing software efficiently absolutely necessaryknowledge of the software is absolutely necessary Software engineers and Hardware engineers collaborated closely to design a many-core processor Design cycles –use “Plan – Evaluate – Analyze – Improve” cycle –existing software is used through out evaluation –At 1 st cycle,: detect issues of existing architecture –At 2 nd cycle, improve and optimize Main design features from our development cycle –CAS instruction, multi-bank L2 cache, tree-based network on chip, Designing a many-core processor DAC2013 Plan Evaluate using Simulation Analyze Improve
10 Used SAME application binary even if the number of cores is changed proposed thread scheduler achieves scalable performance with transparency! These results confirms proposed thread scheduler achieves scalable performance with transparency! Evaluation results DAC2013 H.264 Decoding 1080p Super resolution (full HD to 4K2K) Scalable Performance Lack of READY threads # of ready threads < # of MPEs
11 Conclusions Proposed a low-overhead thread scheduler –It achieves scalable performance and transparency –Reduces thread execution overheads defined unique properties for a thread –A thread never suspends –A thread becomes ready when all necessary data are available managed thread status by the number of dependencies –Hides the number of core designed a distributed scheduler with a shared queue Confirmed performance scalability and transparency –Evaluated on a real 32-core many-core processor –A scalable performance is achieved without modification of the application program DAC2013 Our scheduler contributes to the reduction of the software development cost