Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.

Slides:

Advertisements

Similar presentations

Noise-Predictive Turbo Equalization for Partial Response Channels Sharon Aviran, Paul H. Siegel and Jack K. Wolf Department of Electrical and Computer.

Advertisements

Company LOGO F OUNTAIN C ODES, LT C ODES AND R APTOR C ODES Susmita Adhikari Eduard Mustafin Gökhan Gül.

Multi-dimensional Packet Classification on FPGA: 100Gbps and Beyond

Error Correction and LDPC decoding CMPE 691/491: DSP Hardware Implementation Tinoosh Mohsenin 1.

(speaker) Fedor Groshev Vladimir Potapov Victor Zyablov IITP RAS, Moscow.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Houshmand Shirani-mehr 1,2, Tinoosh Mohsenin 3, Bevan Baas 1 1 VCL Computation Lab, ECE Department, UC Davis 2 Intel Corporation, Folsom, CA 3 University.

Improving BER Performance of LDPC Codes Based on Intermediate Decoding Results Esa Alghonaim, M. Adnan Landolsi, Aiman El-Maleh King Fahd University of.

Near Shannon Limit Performance of Low Density Parity Check Codes

Computer Architecture Project

Low Density Parity Check Codes LDPC ( Low Density Parity Check ) codes are a class of linear bock code. The term “Low Density” refers to the characteristic.

Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research

Interconnect Efficient LDPC Code Design Aiman El-Maleh Basil Arkasosy Adnan Al-Andalusi King Fahd University of Petroleum & Minerals, Saudi Arabia Aiman.

Generalized Communication System: Error Control Coding Occurs In Right Column. 6.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

The Role of Specialization in LDPC Codes Jeremy Thorpe Pizza Meeting Talk 2/12/03.

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

CS774. Markov Random Field : Theory and Application Lecture 10 Kyomin Jung KAIST Oct

GPGPU platforms GP - General Purpose computation using GPU

Massively LDPC Decoding on Multicore Architectures Present by : fakewen.

Massively Parallel LDPC Decoding on GPU

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

Low Density Parity Check (LDPC) Code Implementation Matthew Pregara & Zachary Saigh Advisors: Dr. In Soo Ahn & Dr. Yufeng Lu Dept. of Electrical and Computer.

Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Wireless Mobile Communication and Transmission Lab. Theory and Technology of Error Control Coding Chapter 7 Low Density Parity Check Codes.

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.

Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.

Andrea Montanari and Ruediger Urbanke TIFR Tuesday, January 6th, 2008 Phase Transitions in Coding, Communications, and Inference.

Introduction of Low Density Parity Check Codes Mong-kai Ku.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Channel Coding Binit Mohanty Ketan Rajawat. Recap…  Information is transmitted through channels (eg. Wires, optical fibres and even air)  Channels are.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

LDPC Decoding: VLSI Architectures and Implementations

CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.

Part 1: Overview of Low Density Parity Check(LDPC) codes.

Low Density Parity Check codes

Multi-Split-Row Threshold Decoding Implementations for LDPC Codes

Semi-Parallel Reconfigurable Architecture for Real-time LDPC decoding Karkooti, M.; Cavallaro, J.R.; Information Technology: Coding and Computing, 2004.

FEC Linear Block Coding

Code Construction and FPGA Implementation of a Low-Error-Floor Multi-Rate Low-Density Parity-Check Code Decoder Lei Yang, Hui Liu, C.-J Richard Shi Transactions.

Memory-efficient Turbo decoding architecture for LDPC codes

Error-Correcting Code

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

3/12/2013Computer Engg, IIT(BHU)1 CUDA-3. GPGPU ● General Purpose computation using GPU in applications other than 3D graphics – GPU accelerates critical.

Doc.: IEEE / n Submission March 2004 PCCC Turbo Codes for IEEE n B. Bougard; B. Van Poucke; L. Van der Perre {bougardb,

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.

1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.

Waseda University Low-Density Parity-Check Code: is an error correcting code which achieves information rates very close to the Shanon limit. Message-Passing.

Large-scale geophysical electromagnetic imaging and modeling on graphical processing units Michael Commer (LBNL) Filipe R. N. C. Maia (LBNL-NERSC) Gregory.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Hiba Tariq School of Engineering

… General Decoder for a Linear Block Code … …

A Scalable Architecture for LDPC Decoding

Rate 7/8 (1344,1176) LDPC code Date: Authors:

Progress report of LDPC codes

An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes

Sridhar Rajagopal and Joseph R. Cavallaro Rice University

High Throughput LDPC Decoders Using a Multiple Split-Row Method

Low-Density Parity-Check Codes (LDPC Codes)

Information Redundancy Fault Tolerant Computing

Chris Jones Cenk Kose Tao Tian Rick Wesel

Final Project presentation

Low-Density Parity-Check Codes

CS 325: CS Hardware and Software Organization and Architecture

Presentation transcript:

Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal

Salt Lake City, Feb 21 st 2008 PPoPP082 MOTIVATION LDPC Decoding Intensive computation Irregular accesses to memory LDPC decoding using VLSI dedicated hardware Low area, low power consumption High throughputs (Mbps) and low latency Fixed-point arithmetic LDPC decoding on GPUs GPUs processing horse power available CUDA programming interface Medium to high throughputs (Mbps) Floating-point arithmetic Software based flexible solution!

Salt Lake City, Feb 21 st 2008 PPoPP083 OUTLINE Motivation LDPC codes Bit Node processing (BN) Check Node processing (CN) GPUs CUDA interface Experimental results Conclusions and future work

Salt Lake City, Feb 21 st 2008 PPoPP084 LDPC CODES Advantages: Linear block codes Perform close to Shannon limit capacity High throughputs (Mbps) Very low Bit Error Rate (BER) Disadvantages: Good performance implies large H matrices Computationally intensive operations Large amounts of hardware VLSI dedicated solutions are expensive Bottom line: Why not using the horse power available on GPUs, instead of developing expensive VLSI?

Salt Lake City, Feb 21 st 2008 PPoPP085 LDPC CODES Parity check matrix defines the LDPC code Tanner Graph represents connections between BNs and CNs CN1 BN1

Salt Lake City, Feb 21 st 2008 PPoPP086 LDPC DECODER BNs and CNs exchange messages (i.e., probabilities) allowing reliable decision on a bit value

Salt Lake City, Feb 21 st 2008 PPoPP087 CHECK NODE PROCESSING - CN 1. Calculates message going from CN m to BN n : BNi BNj BNk BNn q im q jm q km r mn CNm

Salt Lake City, Feb 21 st 2008 PPoPP088 BIT NODE PROCESSING – BN 2. Calculates the message sent from BN n to CN m including channel information P n : 3. Then computes the a posteriori pseudo-probabilities and performs hard decoding: BNn r in r jn r kn q nm PnPn CNi CNm CNj CNk

Salt Lake City, Feb 21 st 2008 PPoPP089 INTENSIVE COMPUTING "If you were plowing a field, which would you rather use? Two strong oxen or 1024 chickens?" -- Seymore Cray

Salt Lake City, Feb 21 st 2008 PPoPP0810 GRAPHICS PROCESSING UNITS (GPUs) Raw compute power increasing rapidly Manycores architecture Can be programmed outside the graphics framework Exposing parallelism Multi-threaded architecture using CUDA Interest in GPP on GPUs Hard programming Needs efficient interface GPU wins when arithmetic intensity is maximized… GPU looses with memory accesses!

Salt Lake City, Feb 21 st 2008 PPoPP0811 SUM PRODUCT ALGORITHM (SPA) Kernel 1 - Computes the messages sent from CN m to BN n probability of BN n being 0 or 1 Kernel 2 – Computes the messages from BN n to CN m

Salt Lake City, Feb 21 st 2008 PPoPP0812 COMPACT DATA STRUCTURES – H MATRIX H mapped into compact H BN and H CN data structures for all CN m do: (rows in H) for all BN n do: (columns in H) If H mn ==1 then p next = j:H mn ==1, // with n+1< j <(n+N) mod N H BN =p next

Salt Lake City, Feb 21 st 2008 PPoPP0813 COMPUTING KERNELS ON THE GPU A novel SPA multi-thread computing approach SPA iteratively performed by several KERNELS on GPU Flow control and execution management of KERNELS performed by the CUDA programming interface

Salt Lake City, Feb 21 st 2008 PPoPP0814 CUDA INTERFACE FOR GPGPU C based programming interface for NVIDIAs 8x series and next generation CUDA enables efficient use of their massive parallelism Multi-threading hides latency problems Allows transparent programming Slow global memory and fast shared memory acess Avoid non-coalesced memory accesses Significant speedups depending on the algorithm Hard challenge: irregular memory access patterns!

Salt Lake City, Feb 21 st 2008 PPoPP0815 MULTI-THREAD COMPUTING APPROACH Multi-thread strategy and architecture

Salt Lake City, Feb 21 st 2008 PPoPP0816 MULTI-THREAD COMPUTING APPROACH Circular addressing mechanism allows increase of parallelism

Salt Lake City, Feb 21 st 2008 PPoPP0817 MULTI-THREAD COMPUTING APPROACH

Salt Lake City, Feb 21 st 2008 PPoPP0818 EXPERIMENTAL RESULTS Matrix size CPUGPUCPUGPUCPUGPU 25 iterations50 iterations100 iterations 512x x x Main conclusions ( … obtained from the matrices we considered using CUDA): Much faster processing than on top notch CPUs Supports floating-point operations Achieves medium to large throughputs BUT MOST DEFINITELLY NOT AS GREAT AS WE HOPED!

Salt Lake City, Feb 21 st 2008 PPoPP0819 CONCLUSIONS AND FUTURE WORK GPGPU approach for LDPC decoding New compact data structures to represent the H matrix Multi-thread algorithm for LDPC decoding Significant speedups achieved with the CUDA programming interface Up to 22 GPUs allow a software based, scalable and low cost solution Trading task parallelism by data parallelism Adoption/generalization of the proposed approach (algorithms and data structures) for irregular processing in graphs

Salt Lake City, Feb 21 st 2008 PPoPP0820 CONCLUSIONS Gabriel Falcão, University of Coimbra Technical University of Lisbon Portugal