Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

Parallel Scalability and Efficiency of HEVC Parallelization Approaches

Tuning of Loop Cache Architectures to Programs in Embedded System Design Susan Cotterell and Frank Vahid Department of Computer Science and Engineering.

Parallel H.264 Decoding on an Embedded Multicore Processor

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Oct 3, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Introduction)

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Requirements on the Execution of Kahn Process Networks Marc Geilen and Twan Basten 11 April 2003 /e.

In God We Trust Class presentation for the course: “Custom Implementation of DSP systems” Presented by: Mohammad Haji Seyed Javadi May 2013 Instructor:

Software Architecture of High Efficiency Video Coding for Many-Core Systems with Power- Efficient Workload Balancing Muhammad Usman Karim Khan, Muhammad.

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution.

Spring 2008 Network On Chip Platform Instructor: Yaniv Ben-Itzhak Students: Ofir Shimon Guy Assedou.

1 Multi-Core Debug Platform for NoC-Based Systems Shan Tang and Qiang Xu EDA&Testing Laboratory.

Improving the Efficiency of Memory Partitioning by Address Clustering Alberto MaciiEnrico MaciiMassimo Poncino Proceedings of the Design,Automation and.

Real-Time Kernels and Operating Systems. Operating System: Software that coordinates multiple tasks in processor, including peripheral interfacing Types.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Presenter : Cheng-Ta Wu Antti Rasmus, Ari Kulmala, Erno Salminen, and Timo D. Hämäläinen Tampere University of Technology, Institute of Digital and Computer.

Experience with K42, an open- source, Linux-compatible, scalable operation-system kernel IBM SYSTEM JOURNAL, VOL 44 NO 2, 2005 J. Appovoo 、 M. Auslander.

A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.

ECE 510 Brendan Crowley Paper Review October 31, 2006.

Performance and Power Efficient On-Chip Communication Using Adaptive Virtual Point-to-Point Connections M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol.

Introduction to Android Platform Overview

COMPUTER ORGANIZATIONS CSNB123 May 2014Systems and Networking1.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Computer System Architectures Computer System Software

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.

Instituto de Informática and Dipartimento di Automatica e Informatica Universidade Federal do Rio Grande do Sul and Politecnico di Torino Porto Alegre,

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Mahesh Sukumar Subramanian Srinivasan. Introduction Embedded system products keep arriving in the market. There is a continuous growing demand for more.

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.

NVIDIA Tesla GPU Zhuting Xue EE126. GPU Graphics Processing Unit The "brain" of graphics, which determines the quality of performance of the graphics.

RF network in SoC1 SoC Test Architecture with RF/Wireless Connectivity 1. D. Zhao, S. Upadhyaya, M. Margala, “A new SoC test architecture with RF/wireless.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

A Utility-based Approach to Scheduling Multimedia Streams in P2P Systems Fang Chen Computer Science Dept. University of California, Riverside

1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 2 Parallel Hardware and Parallel Software An Introduction to Parallel Programming Peter Pacheco.

Teaching The Principles Of System Design, Platform Development and Hardware Acceleration Tim Kranich

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Assoc. Prof. Dr. Ahmet Turan ÖZCERİT.  What Operating Systems Do  Computer-System Organization  Computer-System Architecture  Operating-System Structure.

Lecture on Central Process Unit (CPU)

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Multiprocessor SoC integration Method: A Case Study on Nexperia, Li Bin, Mengtian Rong Presented by Pei-Wei Li.

Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳宸.

System Architecture Directions for Networked Sensors.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

Computer Structure 2015 – Intel ® Core TM μArch 1 Computer Structure Multi-Threading Lihu Rappoport and Adi Yoaz.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 4: Processes Process Concept Process Scheduling Types of shedulars Process.

Background Computer System Architectures Computer System Software.

Introduction Goal: connecting multiple computers to get higher performance – Multiprocessors – Scalability, availability, power efficiency Job-level (process-level)

DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.

Current Generation Hypervisor Type 1 Type 2.

Microarchitecture.

Computer Structure Multi-Threading

Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling

Architecture Background

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Model-Driven Analysis Frameworks for Embedded Systems

Superscalar Processors & VLIW Processors

Simultaneous Multithreading in Superscalar Processors

Hardware Counter Driven On-the-Fly Request Signatures

Prof. Leonardo Mostarda University of Camerino

Chip&Core Architecture

Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat

Author: Xianghui Hu, Xinan Tang, Bei Hua Lecturer: Bo Xu

Presentation transcript:

Copyright 2013, Toshiba Corporation. DAC2013 Designer/User Track Scalability Achievement by Low-Overhead, Transparent Threads on an Embedded Many-Core Processor Takeshi Kodaka, Akira Takeda, Shunsuke Sasaki, Akira Yokosawa, Toshiki Kizu, Takahiro Tokuyoshi, Hui Xu, Toru Sano, Hiroyuki Usui, Jun Tanabe, Takashi Miyamori and Nobu Matsumoto Center for Semiconductor Research and Development Toshiba Corporation

2DAC2013 Background Requirements for embedded processors –Various types of processing Video Codecs (HEVC, H.264 ， MPEG-2 ， WMV ，...) Face Detection/Recognition, Audio/Video playback, Mobile TV –Wide range of required processing performance Should deal with various types of products from mobile phone to Tablets or more –Example: video decoding from QVGA 15fps to 1080p 60fps or more –Low cost and short time development that meets market requirement Reuse existing software to reduce development cost

3DAC2013 Challenges What kind of hardware architecture to employ? –The number of cores should be easily increased/decreased How can we realize the scalable performance? –Parallelized application program that utilizes multiple cores efficiently How can we realize the transparency? –Hiding the number of cores from application program Multiple Core Architecture [xu2012low] Our Proposed Scheduler [xu2012low] A low power many-core SoC with two 32-core clusters connected by tree based NoC for multimedia applications, H. Xu, et al. VLSI Symposium 2012

4DAC2013 Our approach A simple multiple core architecture + An application program independent of # of cores + An efficient parallel processing scheme  Achieving Scalable performance

5DAC2013 Strategy to realize our approach Strategy –Developing an application independent of # of cores  transparency –Running the developed application on a multiple-core processor and achieving scalable performance proportional to # of cores  scalable performance Scheme –Designed an efficient thread scheduler efficient management of threads may achieve scalable performance the number of cores may be hidden if a thread scheduler abstracts the cores Challenges –Minimizing overheads for execution –Hiding the number of cores from application program

6DAC2013 How to minimize overheads Defined unique properties for threads –A Thread never suspends –A Thread never suspends to wait for data eliminate the overhead of thread switchingeliminate the overhead of thread switching when necessary data are all available –A Thread becomes ready to run when necessary data are all available Managed a thread status using simple counters “the number of dependency“ –Simplify the dependency into “the number of dependency“ this can be realized by simple operationsthis can be realized by simple operations

7DAC2013 How to hide the number of cores Designed a distributed scheduler with a shared queue –ONLY ready threads a shared queue –ONLY ready threads are placed in a shared queue runs on each core –A Thread dispatcher runs on each core fetches a thread from the shared queue –The dispatcher fetches a thread from the shared queue and executes it To reduce access conflict for a shared queue CAS (Compare And Swap) instructionWe use CAS (Compare And Swap) instruction Core search Thread fetch & execute Core Thread fetch & execute Core Thread search fetch & execute Thread Dispatcher

8DAC2013 Implemented thread scheduler Our Thread Scheduler consists of three components –Dependency Controller, Thread Pool, and Thread Dispatcher Our Thread Scheduler... Scalable Performance –is low overhead for Scalable Performance Transparency –hides the number of cores from application for Transparency Dependency Controller Thread Pool Thread Dispatcher Core Thread Scheduler Thread Dispatcher core Appl. register Core Thread Dispatcher 1 0 Thread 3 1 ・・ Thread available necessary fetch & execute ready

9 Design goals for a many-core processor –Achieve scalable performance –Reuse existing software for a multi-core processor a many-core processor has to execute existing software efficiently absolutely necessaryknowledge of the software is absolutely necessary Software engineers and Hardware engineers collaborated closely to design a many-core processor Design cycles –use “Plan – Evaluate – Analyze – Improve” cycle –existing software is used through out evaluation –At 1 st cycle,: detect issues of existing architecture –At 2 nd cycle, improve and optimize Main design features from our development cycle –CAS instruction, multi-bank L2 cache, tree-based network on chip, Designing a many-core processor DAC2013 Plan Evaluate using Simulation Analyze Improve

10 Used SAME application binary even if the number of cores is changed proposed thread scheduler achieves scalable performance with transparency! These results confirms proposed thread scheduler achieves scalable performance with transparency! Evaluation results DAC2013 H.264 Decoding 1080p Super resolution (full HD to 4K2K) Scalable Performance Lack of READY threads # of ready threads < # of MPEs

11 Conclusions Proposed a low-overhead thread scheduler –It achieves scalable performance and transparency –Reduces thread execution overheads defined unique properties for a thread –A thread never suspends –A thread becomes ready when all necessary data are available managed thread status by the number of dependencies –Hides the number of core designed a distributed scheduler with a shared queue Confirmed performance scalability and transparency –Evaluated on a real 32-core many-core processor –A scalable performance is achieved without modification of the application program DAC2013 Our scheduler contributes to the reduction of the software development cost