Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer.

Slides:

Advertisements

Similar presentations

DSPs Vs General Purpose Microprocessors

Advertisements

PIPELINE AND VECTOR PROCESSING

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.

Instruction Set Design

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

1 Code Optimization Code produced by compilation algorithms can often be improved (ideally optimized) in terms of run-time speed and the amount of memory.

1 Lecture 10: Static ILP Basics Topics: loop unrolling, static branch prediction, VLIW (Sections 4.1 – 4.4)

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

Superscalar Organization Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by John P. Shen Updated by Mikko Lipasti.

Intel’s MMX Dr. Richard Enbody CSE 820. Michigan State University Computer Science and Engineering Why MMX? Make the Common Case Fast Multimedia and Communication.

Chapter 11 Instruction Sets

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

EECS 470 Cache Systems Lecture 13 Coverage: Chapter 5.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Unit-1 PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE Advance Processor.

Intel Pentium 4 Processor Presented by Presented by Steve Kelley Steve Kelley Zhijian Lu Zhijian Lu.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.

Reduced Instruction Set Computers (RISC) Computer Organization and Architecture.

Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,

Software Data Prefetching Mohammad Al-Shurman & Amit Seth Instructor: Dr. Aleksandar Milenkovic Advanced Computer Architecture CPE631.

Streaming SIMD Extensions CSE 820 Dr. Richard Enbody.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Lecture#14. Last Lecture Summary Memory Address, size What memory stores OS, Application programs, Data, Instructions Types of Memory Non Volatile and.

RICE UNIVERSITY Implementing the Viterbi algorithm on programmable processors Sridhar Rajagopal Elec 696

1 Copyright © 2011, Elsevier Inc. All rights Reserved. Appendix A Authors: John Hennessy & David Patterson.

© 2009, Renesas Technology America, Inc., All Rights Reserved 1 Course Introduction  Purpose:  This course provides an overview of the SH-2 32-bit RISC.

The Arrival of the 64bit CPUs - Itanium1 นายชนินท์วงษ์ใหญ่รหัส นายสุนัยสุขเอนกรหัส

CUDA Performance Considerations (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

History of Microprocessor MPIntroductionData BusAddress Bus

Optimizing Data Compression Algorithms for the Tensilica Embedded Processor Tim Chao Luis Robles Rebecca Schultz.

CUDA Performance Patrick Cozzi University of Pennsylvania CIS Fall

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

QCAdesigner – CUDA HPPS project

CUDA Performance Considerations (2 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2011.

COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Introdution to SSE or How to put your algorithms on steroids! Christian Kerl

DSP Architectures Additional Slides Professor S. Srinivasan Electrical Engineering Department I.I.T.-Madras, Chennai –

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

Implementation of MPEG2 Codec with MMX/SSE/SSE2 Technology Speaker: Rong Jiang, Xu Jin Instructor: Yu-Hen Hu.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

SIMD Implementation of Discrete Wavelet Transform Jake Adriaens Diana Palsetia.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Backprojection Project Update January 2002

Protection in Virtual Mode

Embedded Systems Design

Introduction to Pentium Processor

Implementation of DWT using SSE Instruction Set

STUDY AND IMPLEMENTATION

VLIW DSP vs. SuperScalar Implementation of a Baseline H.263 Encoder

Lecture 11: Machine-Dependent Optimization

Presentation transcript:

Improvement of CT Slice Image Reconstruction Speed Using SIMD Technology Xingxing Wu Yi Zhang Instructor: Prof. Yu Hen Hu Department of Electrical & Computer Engineering University of Wisconsin, Madison

Motivation CT Slice Image Reconstruction is a very important part which will affect the reconstructed image quality and scanning speed CT Slice Image Reconstruction is very time- consuming Traditional methods for speedup: Specially designed hardware Parallel algorithm running on super computer Explore a new method: SIMD implementation

Parallel-Beam FBP Image Reconstruction Algorithm The Algorithm consists on three parts: data rebinning : data filtering back-projection

Parallel-Beam FBP Image Reconstruction Algorithm Projection: Data Rebinning: Data Filtering: Data Backprojection:

CT Slice Image Reconstruction Is Very Time Consuming A Whole Head Spiral Scanning will generate several GB projection data

Function Profiling

Can FBP Algorithm Benefit from SIMD? The Algorithm has the following features: Small, highly repetitive loops that operate on sequential arrays of integers and floating-point values Frequent multiplies and accumulates Computation-intensive algorithms Inherently parallel operations Wide dynamic range, hence floating-point based Regular memory access patterns Data independent control flow

Analysis of Data Dynamic Range and Quantization Errors Wide Dynamic Range Relative Error Metric 32-Bit Single-Precision Floating Point and SSE2

Updated Algorithm to Fit SIMD Update the algorithm to eliminate some conditional branches Reduce the on-the-fly calculations which are not suitable for the SIMD implementation

Parallel Implementation of Data Filtering In SIMD A0A1A2A3A4A5A6A7B0B1B2B3B4B5B6B7 Rebinned Data Weight A0*B0+A4*B4A1*B1+A5*B5A2*B2+A6*B6A3*B3+A7*B7 Filtered Data * * * * * * * *

Parallel Implementation of Backprojection in SIMD A0A1A2A B0B1B2B3C0C1C2C3D0D1D2D3E0E1E2E3F0F1F2F3G0G1G2G3H0H1H2H3 Index Calculation Index Ceil (index)Floor (index) Filtered Data Weight Reconstructed Image (fetch data)

Optimization of The Implementation Optimize Memory Access Ensure proper alignment to prevent data split across cache line boundary: data alignment, stack alignment, code alignment Observe store-forwarding constraints Optimize data structure layout and data locality to ensure efficient use of 64-byte cache line size and also reduce the frequency of memory loading and storing Use prefetching cacheability instructions control appropriately Minimize bus latency by segmenting the reads and writes into phases Replace Branches with Logic Operations Optimize Instruction Scheduling Optimize the Parallelism Loop Unrolling Break dependence chains

Optimization of The Implementation Optimize Instruction Selection avoid longer latency instruction avoid instructions that unnecessarily introduce dependence-related stalls Optimize the Floating-point Performance avoid exceeding the representable range avoid change floating-point control/status register enable flush-to-zero and DAZ mode

Improvement of Performance The differences of the reconstructed image pixel values between C implementation and SIMD implementation are less than 0.01