Bottlenecks of SIMD Haibin Wang Wei tong. Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE.

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Philips Research ICS 252 class, February 3, The Trimedia CPU64 VLIW Media Processor Kees Vissers Philips Research Visiting Industrial Fellow
DSPs Vs General Purpose Microprocessors
Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Computer Organization and Architecture
Streaming SIMD Extension (SSE)
ARM Cortex A8 Pipeline EE126 Wei Wang. Cortex A8 is a processor core designed by ARM Holdings. Application: Apple A4, Samsung Exynos What’s the.
ARCHITECTURE OF APPLE’S G4 PROCESSOR BY RON WEINWURZEL MICROPROCESSORS PROFESSOR DEWAR SPRING 2002.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
The University of Adelaide, School of Computer Science
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.
CHIMAERA: A High-Performance Architecture with a Tightly-Coupled Reconfigurable Functional Unit Kynan Fraser.
SISD—Single Instruction Single Data Xin Meng Tufts University School of Engineering.
 Understanding the Sources of Inefficiency in General-Purpose Chips.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Chapter 12 Pipelining Strategies Performance Hazards.
RISC. Rational Behind RISC Few of the complex instructions were used –data movement – 45% –ALU ops – 25% –branching – 30% Cheaper memory VLSI technology.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Efficient Support for Interactive Browsing Operations in Clustered CBR Video Servers IEEE Transactions on Multimedia, Vol. 4, No.1, March 2002 Min-You.
Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
An Intelligent Cache System with Hardware Prefetching for High Performance Jung-Hoon Lee; Seh-woong Jeong; Shin-Dug Kim; Weems, C.C. IEEE Transactions.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Yongjoo Kim*, Jongeun Lee**, Jinyong Lee*, Toan Mai**, Ingoo Heo* and Yunheung Paek* *Seoul National University **UNIST (Ulsan National Institute of Science.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
HOCT: A Highly Scalable Algorithm for Training Linear CRF on Modern Hardware presented by Tianyuan Chen.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส
1 Advance Computer Architecture CSE 8383 Ranya Alawadhi.
What have mr aldred’s dirty clothes got to do with the cpu
ARM for Wireless Applications ARM11 Microarchitecture On the ARMv6 Connie Wang.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Compilers for Embedded Systems Ram, Vasanth, and VJ Instructor : Dr. Edwin Sha Synthesis and Optimization of High-Performance Systems.
The TM3270 Media-Processor. Introduction Design objective – exploit the high level of parallelism available. GPPs with Multi-media extensions (Ex: Intel’s.
Crosscutting Issues: The Rôle of Compilers Architects must be aware of current compiler technology Compiler Architecture.
Introduction to MMX, XMM, SSE and SSE2 Technology
CS/EE 5810 CS/EE 6810 F00: 1 Multimedia. CS/EE 5810 CS/EE 6810 F00: 2 New Architecture Direction “… media processing will become the dominant force in.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Pipelining and Parallelism Mark Staveley
OPTIMIZING DSP SCHEDULING VIA ADDRESS ASSIGNMENT WITH ARRAY AND LOOP TRANSFORMATION Chun Xue, Zili Shao, Ying Chen, Edwin H.-M. Sha Department of Computer.
The Alpha Thomas Daniels Other Dude Matt Ziegler.
Von Neumann Computers Article Authors: Rudolf Eigenman & David Lilja
Jason Jong Kyu Park, Yongjun Park, and Scott Mahlke
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Parallel Processing Presented by: Wanki Ho CS147, Section 1.
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
Case Study: Implementing the MPEG-4 AS Profile on a Multi-core System on Chip Architecture R 楊峰偉 R 張哲瑜 R 陳 宸.
Chao Han ELEC6200 Computer Architecture Fall 081ELEC : Han: PowerPC.
RISC / CISC Architecture by Derek Ng. Overview CISC Architecture RISC Architecture  Pipelining RISC vs CISC.
Xinsong1 Multimedia Extension Technology survey Xinsong Yang Electrical and Computer Engineering 734 Final Project 5/10/2002.
Varun Mathur Mingwei Liu Sanghyun Park, Aviral Shrivastava and Yunheung Paek.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.
Chapter One Introduction to Pipelined Processors.
ARM7 Architecture What We Have Learned up to Now.
Advanced Architectures
EECE571R -- Harnessing Massively Parallel Processors ece
A Memory Aliased Instruction Set Architecture
The University of Adelaide, School of Computer Science
Vector Processing => Multimedia
Spare Register Aware Prefetching for Graph Algorithms on GPUs
MMX Multi Media eXtensions
STUDY AND IMPLEMENTATION
The ARM Instruction Set
Lesson Objectives A note about notes: Aims
Presentation transcript:

Bottlenecks of SIMD Haibin Wang Wei tong

Paper Bottlenecks in Multimedia Processing with SIMD Style Extensions and Architectural Enhancements One IEEE TRANSACTIONS ON COMPUTERS, VOL. 52, NO. 8, AUGUST 2003 Deepu Talla, Member, IEEE,Lizy Kurian John, Senior Member, IEEE, and Doug Burger, Member, IEEE

Outline Introduction Bottlenecks Analysis MediaBreeze Architecture Summary

Introduction It is popular to use multimedia SIMD extensions to speed up media processing, but the efficiency is not very high. 75 to 85 percent of the dynamic instructions in the processor instruction stream are supporting instructions.

Introduction The bottlenecks are caused by the loop structure and the access patterns of the media program. So instead of exploiting more data-level parallelism, the paper focuses on improving the efficiency of the instructions supporting the core computation.

Introduction This paper has two major contributions: Firstly, it focuses on the supporting instructions to enhance the performance of SIMD which is an innovation. Secondly, it gives a method to reduce and eliminate supporting instructions with the MediaBreeze architecture.

Nested Loop

The analysis of loop architecture The sub-block is very small which leads to the limited DLP because it needs many supporting instructions. There are 5 loops for every block which waste so much time on braches. You need to reorganize the data to use SIMD

Access patterns

The addressing sequences are complex and big part which need lots of supporting instructions to generate them. Using general-purpose instruction sets to generate multiple addressing sequences is not very efficient.

The overhead instructions Address generation: address calculation Address transformation: data movement, data reorganization Loads and Stores: memory Branches : control transfer, for-loop

Architecture

Instruction Structure

Breeze Instruction Mapping of 1D-DCT

Full Map. five branches,. three loads and one store,. four address value generation (one on each stream with each address generation representing multiple RISC instructions),. one SIMD operation (2-way to 16-way parallelism depending on each data element size),. one accumulation of SIMD result and one SIMD reduction operation, four SIMD data reorganization (pack/unpack, permute, etc.) operations, and. shifting and saturation of SIMD results.

Performance Evaluation cfa,dct, motest,scale G711, decrypt Aud, jpeg, ijpeg

Any improvement? Why not higher efficiency in cfa? Memory latency! Solution? Prefetch!

Evaluation Advantage: Eliminating and reducing overhead. Much better than normal SIMD extension. 0.3% processor area, less 1% total power consumption. Drawback: Complicated instruction. Who will design a compiler for this?