Massively LDPC Decoding on Multicore Architectures Present by : fakewen.

Slides:



Advertisements
Similar presentations
Massive Parallel LDPC Decoding on GPU Gabriel Falcão, Leonel Sousa, Vitor Silva Univ. of Coimbra and T. Univ. of Lisbon, Portugal.
Advertisements

List Ranking on GPUs Sathish Vadhiyar. List Ranking on GPUs Linked list prefix computations – computations of prefix sum on the elements contained in.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Dynamic Bayesian Networks (DBNs)
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.
Weekly Report Ph.D. Student: Leo Lee date: Oct. 9, 2009.
Challenge the future Delft University of Technology Evaluating Multi-Core Processors for Data-Intensive Kernels Alexander van Amesfoort Delft.
Programming Multiprocessors with Explicitly Managed Memory Hierarchies ELEC 6200 Xin Jin 4/30/2010.
Near Shannon Limit Performance of Low Density Parity Check Codes
Team Members: Tyler Drake Robert Wrisley Kyle Von Koepping Justin Walsh Faculty Advisors: Computer Science – Prof. Sanjay Rajopadhye Electrical & Computer.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Weekly Report Start learning GPU Ph.D. Student: Leo Lee date: Sep. 18, 2009.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Network coding on the GPU Péter Vingelmann Supervisor: Frank H.P. Fitzek.
© John A. Stratton, 2014 CS 395 CUDA Lecture 6 Thread Coarsening and Register Tiling 1.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Massively Parallel LDPC Decoding on GPU
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance Models of Data-Parallel Applications Published in: Cluster.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Scalable Data Clustering with GPUs Andrew D. Pangborn Thesis Defense Rochester Institute of Technology Computer Engineering Department Friday, May 14 th.
Wireless Mobile Communication and Transmission Lab. Theory and Technology of Error Control Coding Chapter 7 Low Density Parity Check Codes.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Extracted directly from:
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 3, 2011outline.1 ITCS 6010/8010 Topics in Computer Science: GPU Programming for High Performance.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
 Genetic Algorithms  A class of evolutionary algorithms  Efficiently solves optimization tasks  Potential Applications in many fields  Challenges.
CUDA Basics. Overview What is CUDA? Data Parallelism Host-Device model Thread execution Matrix-multiplication.
Program Optimizations and Recent Trends in Heterogeneous Parallel Computing Dušan Gajić, University of Niš Program Optimizations and Recent Trends in Heterogeneous.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
CS/EE 217 GPU Architecture and Parallel Programming Midterm Review
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
Christopher M. Bishop, Pattern Recognition and Machine Learning 1.
CS 732: Advance Machine Learning
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Sunpyo Hong, Hyesoon Kim
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
Parallel Computing Presented by Justin Reschke
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Stencil-based Discrete Gradient Transform Using
Parallel Computing Lecture
CS/EE 217 – GPU Architecture and Parallel Programming
Multicore and GPU Programming
6- General Purpose GPU Programming
Multicore and GPU Programming
Presentation transcript:

Massively LDPC Decoding on Multicore Architectures Present by : fakewen

Authors Gabriel Falcao Leonel Sousa Vitor Silva

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

Introduction LDPC decoding on multicore architectures LDPC decoders were developed on recent multicores, such as off-the-shelf general- purpose x86 processors, Graphics Processing Units (GPUs), and the CELL Broadband Engine (CELL/B.E.).

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

BELIEF PROPAGATION Belief propagation, also known as the SPA, is an iterative algorithm for the computation of joint probabilities

LDPC Decoding exploit probabilistic relationships between nodes imposed by parity-check conditions that allow inferring the most likely transmitted codeword.

LDPC Decoding(cont.) White Gaussian noise

LDPC Decoding(cont.)

Complexity

Forward and Backward recursions memory access operations is registered, which contributes to in- crease the ratio of arithmetic operations per memory access

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

DATA STRUCTURES AND PARALLEL COMPUTING MODELS compact data structures to represent the H matrix

Data Structures separately code the information about H in two independent data streams, and

remind r mn : 是 CNm->BNn q nm : 是 BNn->CNm

Parallel Computational Models Parallel Features of the General-Purpose Multicores Parallel Features of the GPU Parallel Features of the CELL/B.E.

Parallel Features of the General- Purpose Multicores #pragma omp parallel for

Parallel Features of the GPU

Throughput

Parallel Features of the CELL/B.E.

Throughput

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

PARALLELIZING THE KERNELS EXECUTION The Multicores Using OpenMP The GPU Using CUDA The CELL/B.E.

The Multicores Using OpenMP

The GPU Using CUDA Programming the Grid Using a Thread per Node Approach

The GPU Using CUDA(cont.) Coalesced Memory Accesses

The CELL/B.E. Small Single-SPE Model(A B C) Large Single-SPE Model

Why Single-SPE Model In the single-SPE model, the number of communications between PPE and SPEs is minimum and the PPE is relieved from the costly task of reorganizing data (sorting procedure in Algorithm 4) between data transfers to the SPE.

Outline Introduction BELIEF PROPAGATION DATA STRUCTURES AND PARALLEL COMPUTING MODELS PARALLELIZING THE KERNELS EXECUTION EXPERIMENTAL RESULTS

LDPC Decoding on the General-Purpose x86 Multicores Using OpenMP LDPC Decoding on the CELL/B.E. – Small Single-SPE Model – Large Single-SPE Model LDPC Decoding on the GPU Using CUDA

LDPC Decoding on the General- Purpose x86 Multicores Using OpenMP

LDPC Decoding on the CELL/B.E.

LDPC Decoding on the CELL/B.E.(cont.)

LDPC Decoding on the GPU Using CUDA

The end Thank you~

Forward backward I can do better than that. I can send you a MSc thesis of a former student of ours who graduated 5 years ago. She explains the basic concept in detail. Basically, when you are performing the horizontal processing (the same applies for the vertical one) and you have a CN updating all the BNs connected to it, the F&B optimization exploits the fact that you only have to read all the BNs information (probabilities in the case of SPA) once for each CN, which gives you tremendous gains in computation time since you save many memory accesses, which, as you know, are the main bottleneck in parallel computing. Quite shortly, imagine you have one CN updating 6 BNs BN0 to BN5 (horizontal processing) and that BN0 holds information A, BN1= B, BN2=C,..., BN5=F. Then, to update the corresponding rmn elements, for each BN you have to calculate: BN0=BxCXDXEXF BN1=AXCXDXEXF BN2=AXBXDXEXF... BN5=AXBXCXDXE, where each BN contributes to update its neighbors, but it does not contribute to the update of itself. So, the F&B optimization allows you to read A, B, C, D, E and F data only once from memory and produce all the necessary intermediate values necessary to update all the BNs connected to that CN. You save memory accesses (very important!) and processing too.