ISO-NET Hardware Based Job Queue Management for Many Core Architecture.

Slides:



Advertisements
Similar presentations
Computer Architecture
Advertisements

Field Programmable Gate Array
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
An Introduction To PARALLEL PROGRAMMING Ing. Andrea Marongiu
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.
Chapter 6: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 6: CPU Scheduling Basic.
Trevor Burton6/19/2015 Multiprocessors for DSP SYSC5603 Digital Signal Processing Microprocessors, Software and Applications.
Chapter 17 Parallel Processing.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Processes.
Single-Chip Multi-Processors (CMP) PRADEEP DANDAMUDI 1 ELEC , Fall 08.
Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-Core Architectures
1 Multi-core processors 12/1/09. 2 Multiprocessors inside a single chip It is now possible to implement multiple processors (cores) inside a single chip.
TRIPS – An EDGE Instruction Set Architecture Chirag Shah April 24, 2008.
Multi-core Programming Introduction Topics. Topics General Ideas Moore’s Law Amdahl's Law Processes and Threads Concurrency vs. Parallelism.
1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
Operating Systems Lecture 02: Computer System Overview Anda Iamnitchi
C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
임규찬. 1. Abstract 2. Introduction 3. Design Goals 4. Sample-Based Scheduling for Parallel Jobs 5. Implements.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Hardware/Software Co-design Design of Hardware/Software Systems A Class Presentation for VLSI Course by : Akbar Sharifi Based on the work presented in.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.
Processor Architecture
Multi-core processors. 2 Processor development till 2004 Out-of-order Instruction scheduling Out-of-order Instruction scheduling.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
LOGIC OPTIMIZATION USING TECHNOLOGY INDEPENDENT MUX BASED ADDERS IN FPGA Project Guide: Smt. Latha Dept of E & C JSSATE, Bangalore. From: N GURURAJ M-Tech,
CENTRAL PROCESSING UNIT. CPU Does the actual processing in the computer. A single chip called a microprocessor. Composed of an arithmetic and logic unit.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
EKT303/4 Superscalar vs Super-pipelined.
Chapter 1: The Queueing Paradigm The behavior of complex electronic system  Statistical prediction Paradigms are fundamental models that abstract out.
By Islam Atta Supervised by Dr. Ihab Talkhan
Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Page 1 2P13 Week 1. Page 2 Page 3 Page 4 Page 5.
The University of Adelaide, School of Computer Science
Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)
Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Presented by: Nick Kirchem Feb 13, 2004
Lynn Choi School of Electrical Engineering
Computer Architecture: Parallel Task Assignment
Task Scheduling for Multicore CPUs and NUMA Systems
Overview Parallel Processing Pipelining
Hyperthreading Technology
Operating Systems (CS 340 D)
Reconfigurable Hardware Scheduler for RTS
Chapter 4: Threads.
Using Packet Information for Efficient Communication in NoCs
Chapter 1 Introduction.
Multithreaded Programming
Operating Systems (CS 340 D)
Chapter 4: Threads & Concurrency
Chapter 3: Process Management
Presentation transcript:

ISO-NET Hardware Based Job Queue Management for Many Core Architecture

In this Project, I present Iso-Net, hardware- based conflict-free dynamic load distribution and balancing engine. For balanced distribution of workload based Job Queue Management for Many- Core Architectures.

 COMPLEX and Monolithic Superscalar Microprocessor designs have recently begun to give way to arrays of leaner and simpler processing cores working in unison to exploit thread-level parallelism. This paradigm shift has marked the genesis of the multicore era, the embodiment of which is the chip multiprocessor (CMP).  The most popular programming model for multicore systems is multithreading, whereby a programmer can parallelize an application by spawning a separate thread for each parallel task.

 Due to the small size of the task overhead of spawning the switching between Jobs and Thread become unwarranted whereas Thread comprises a set of instructions and states of execution of a program and Job is composed of set of data to be processed by Thread.

Fine-grained parallelism, a programmer can parallelize an application by spawning a separate thread for each parallel task. Job queue technique, it may be centralized, or distributed, and it may be implemented in hardware or software. SOFTWARE TECHNIQUE: All operations are orchestrated in software, with minimal reliance on hardware support. Atomicity of operation is highly depends on software. HARDWARE TECHNIQUE: It aims to tackle this problem by essentially reducing the probability of conflicts also scalability certainly improves.

BLOCK DESCRIPTION: They are three job queues are assigned each queue is having specific operation that queue is connected with the balancer, that the balancer is used to give a instruction for processor to identify which queue is required to access first according to that instruction the processor will handle the queue than the final output is received from balancer.

 The thread creation comes at a cost and execution time of each thread is relatively short. Due to the small size of these tasks, the overhead of spawning new threads and switching between them become unwarranted.  The hardware’s role is usually limited, so the fundamental impediment to scalability faced by these techniques is the occurrence of conflicts.  Conflicts are still not completely eliminated and they begin to dominate performance when the number of processing elements increases to many core levels.  The execution driven simulators is that they become prohibitively slow as the number of simulated processing cores increases beyond one hundred. Therefore, it is practically impossible to simulate such systems.

The IsoNet architecture is fully implemented in a hardware description language (HDL). It was designed to complete any job transfer within a single IsoNet clock cycle. It is then passed through a detailed application specific integrated circuit design flow using commercial standard-cell libraries in 45-nm VLSI technology. Load distribution and balancing modules, which can swiftly transfer jobs between any two cores, based on prevailing load conditions.

Overview of the enhanced IsoNet design that supports multiple job transfers per IsoNet cycle.

 To implement the local balancing approach, a local balancer must be added to the ISONET logic as shown.  The local balancer begins the job count from all the nearest MUX.  A comparator compares the job count and selects the smallest job count.  If the smallest job count is smaller than that of its own job count it accepts the requestor job count.  The highest job count is given neighbour MUX.

 The requestor issues the request to the neighbour MUX which has the highest job count else the current node is selected as source or destination by the selected tree by the highest priority If the neighbor replies with a grant, the requester pops a job from the dual-clock stack and transfers it to the neighbor.  The Arbiter in the local balancer decides as to which request it should serve.

IsoNet clock cycle is much longer than a CPU clock cycle. We use a trace-driven cycle-accurate simulator to compare Carbon and IsoNet for core counts. By exploiting a micro-network of load-balancing modules, the proposed mechanism is shown to effectively reinforce concurrent computation in many-core environments. The realization of IsoNet, a lightweight on-chip micro-network of load distribution and balancing modules. The single cycle implementation can be achieved with minimal hardware cost.

It is used for processor based applications such as medical field. Technologies lead vehicles such as electric cars. Back-up power supplies for alarm and smaller computer systems. Electric wheelchairs. Marine applications. Mining-Calculate various skeleton parameters inc ash and calorific value to determine the grades of coal

 Xilinx ISE 9.1

References Junghee Lee, Student Member, IEEE, Chrysostomos Nicopoulos, Member, IEEE,Hyung Gyu Lee, Member, IEEE, Shreepad Panth, Student Member, IEEE,Sung Kyu Lim, Senior Member, IEEE, and Jongman Kim, Member, IEEE“IsoNet: Hardware-Based Job Queue Management for Many-Core Architectures” IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 21, NO. 6, JUNE 2013 S. Kumar, C. Hughes, and A. Nguyen, “Carbon:Architectural support for fine-grained parallelism on chip multiprocessors,” in Proc. 34 th Annu. Int. Symp. Comput. Arch., 2007, pp. 162– 173. L. Soares, C. Menier, B. Raffin, and J. L. Roch, “Work stealing for timeconstrained octree exploration: Application to real- time 3D modeling,”in Proc. Eurograph. Symp. Parallel Graph. Visualizat., 2007, pp. 1–9.

References D. Sanchez, R. M. Yoo, and C. Kozyrakis, “Flexible architectural support for fine-grain scheduling,” in Proc. Int. Conf. Arch. Support Program. Lang. Operat. Syst., 2010, pp. 311– 322. P. Dubey, “Recognition, mining and synthesis moves computers to the era of tera,” Magazine, Feb