HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.

Slides:



Advertisements
Similar presentations
Timothy Blattner and Shujia Zhou May 18, This project is sponsored by Lockheed Martin We would like to thank Joseph Swartz, Sara Hritz, Michael.
Advertisements

(speaker) Fedor Groshev Vladimir Potapov Victor Zyablov IITP RAS, Moscow.
GPU System Architecture Alan Gray EPCC The University of Edinburgh.
Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
An Effective GPU Implementation of Breadth-First Search Lijuan Luo, Martin Wong and Wen-mei Hwu Department of Electrical and Computer Engineering, UIUC.
Sparse LU Factorization for Parallel Circuit Simulation on GPU Ling Ren, Xiaoming Chen, Yu Wang, Chenxi Zhang, Huazhong Yang Department of Electronic Engineering,
Improving BER Performance of LDPC Codes Based on Intermediate Decoding Results Esa Alghonaim, M. Adnan Landolsi, Aiman El-Maleh King Fahd University of.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
OpenFOAM on a GPU-based Heterogeneous Cluster
K-means clustering –An unsupervised and iterative clustering algorithm –Clusters N observations into K clusters –Observations assigned to cluster with.
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
2009/04/07 Yun-Yang Ma.  Overview  What is CUDA ◦ Architecture ◦ Programming Model ◦ Memory Model  H.264 Motion Estimation on CUDA ◦ Method ◦ Experimental.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
GCSE Computing - The CPU
Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.
TPB Models Development Status Report Presentation to the Travel Forecasting Subcommittee Ron Milone National Capital Region Transportation Planning Board.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
Massively LDPC Decoding on Multicore Architectures Present by : fakewen.
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
General Purpose Computing on Graphics Processing Units: Optimization Strategy Henry Au Space and Naval Warfare Center Pacific 09/12/12.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Genetic Programming on General Purpose Graphics Processing Units (GPGPGPU) Muhammad Iqbal Evolutionary Computation Research Group School of Engineering.
Large-scale Deep Unsupervised Learning using Graphics Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Radar Pulse Compression Using the NVIDIA CUDA SDK
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Introduction of Low Density Parity Check Codes Mong-kai Ku.
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
GPU-based Computing. Tesla C870 GPU 8 KB / multiprocessor 1.5 GB per GPU 16 KB up to 768 threads () up to 768 threads ( 21 bytes of shared memory and.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.
Part 1: Overview of Low Density Parity Check(LDPC) codes.
Low Density Parity Check codes
Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)
ACCELERATING QUERY-BY-HUMMING ON GPU Pascal Ferraro, Pierre Hanna, Laurent Imbert, Thomas Izard ISMIR 2009 Presenter: Chung-Che Wang (Focus on the performance.
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
GFlow: Towards GPU-based High- Performance Table Matching in OpenFlow Switches Author : Kun Qiu, Zhe Chen, Yang Chen, Jin Zhao, Xin Wang Publisher : Information.
CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.
MIT Lincoln Laboratory 1 of 4 MAA 3/8/2016 Development of a Real-Time Parallel UHF SAR Image Processor Matthew Alexander, Michael Vai, Thomas Emberley,
My Coordinates Office EM G.27 contact time:
Parallel Computers Today Oak Ridge / Cray Jaguar > 1.75 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating.
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro
GCSE Computing - The CPU
CS427 Multicore Architecture and Parallel Computing
EECE571R -- Harnessing Massively Parallel Processors ece
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
GCSE Computing - The CPU
6- General Purpose GPU Programming
Presentation transcript:

HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas B. Chang, and Stephen Leung This work is sponsored by the Department of the Air Force under Air Force contract FA C Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government.

MIT Lincoln Laboratory HPEC_GPU_DECODE-2 ADC 8/6/2015 Outline Introduction –Wireless Communications –Performance vs. Complexity –Candidate Implementations Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation Results Summary

MIT Lincoln Laboratory HPEC_GPU_DECODE-3 ADC 8/6/2015 Wireless Communications CodingModulationDemodulationDecoding Data Generic Communication System CodingModulationDemodulationDecoding... Channel Multipath Noise Interference

MIT Lincoln Laboratory HPEC_GPU_DECODE-4 ADC 8/6/2015 Wireless Communications Our Focus CodingModulationDemodulationDecoding Data Generic Communication System CodingModulationDemodulationDecoding... Channel Multipath Noise Interference

MIT Lincoln Laboratory HPEC_GPU_DECODE-5 ADC 8/6/2015 Wireless Communications Our Focus CodingModulationDemodulationDecoding Data Generic Communication System CodingModulationDemodulationDecoding Simple Error Correcting Code Coding Channel Multipath Noise Interference

MIT Lincoln Laboratory HPEC_GPU_DECODE-6 ADC 8/6/2015 Wireless Communications Our Focus CodingModulationDemodulationDecoding Data Generic Communication System CodingModulationDemodulationDecoding Simple Error Correcting Code Coding Allow for larger link distance and/or higher data rate Superior performance comes with significant increase in complexity... High Performance Codes Channel Multipath Noise Interference

MIT Lincoln Laboratory HPEC_GPU_DECODE-7 ADC 8/6/2015 Performance vs. Complexity Increased performance requires larger computational cost Off the shelf systems require 5dB more power to close link than custom system Best custom design only require 0.5dB more power than theoretical minimum 4-State STTC 8-State STTC 16-State STTC 32-State STTC GF(2) LDPC BICM-ID 64-State STTC Direct GF(256) LDPC Commercial Interests Custom Designs

MIT Lincoln Laboratory HPEC_GPU_DECODE-8 ADC 8/6/2015 Performance vs. Complexity 4-State STTC 8-State STTC 16-State STTC 32-State STTC GF(2) LDPC BICM-ID 64-State STTC Direct GF(256) LDPC Commercial Interests Custom Designs Our Choice Increased performance requires larger computational cost Off the shelf systems require 5dB more power to close link than custom system Best custom design only require 0.5dB more power than theoretical minimum

MIT Lincoln Laboratory HPEC_GPU_DECODE-9 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) Performance (decode time) Hardware Cost Half second burst of data –Decode complexity 37 TFLOPS –Memory transferred during decode 44 TB

MIT Lincoln Laboratory HPEC_GPU_DECODE-10 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - Performance (decode time) 9 days Hardware Cost ≈ $2,000 Half second burst of data –Decode complexity 37 TFLOPS –Memory transferred during decode 44 TB

MIT Lincoln Laboratory HPEC_GPU_DECODE-11 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - 1 person year (est.) Performance (decode time) 9 days5.1 minutes (est.) Hardware Cost ≈ $2,000≈ $90,000 Half second burst of data –Decode complexity 37 TFLOPS –Memory transferred during decode 44 TB

MIT Lincoln Laboratory HPEC_GPU_DECODE-12 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. Performance (decode time) 9 days5.1 minutes (est.) 0.5s (if possible) Hardware Cost ≈ $2,000≈ $90,000≈ $1 million Half second burst of data –Decode complexity 37 TFLOPS –Memory transferred during decode 44 TB

MIT Lincoln Laboratory HPEC_GPU_DECODE-13 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. ? Performance (decode time) 9 days5.1 minutes (est.) 0.5s (if possible) ? Hardware Cost ≈ $2,000≈ $90,000≈ $1 million≈ $5,000 (includes PC cost) Half second burst of data –Decode complexity 37 TFLOPS –Memory transferred during decode 44 TB

MIT Lincoln Laboratory HPEC_GPU_DECODE-14 ADC 8/6/2015 Outline Introduction Low Density Parity Check (LDPC) Codes –Decoding LDPC GF(256) Codes –Algorithmic Demands of LDPC Decoder NVIDIA GPU Architecture GPU Implementation Results Summary

MIT Lincoln Laboratory HPEC_GPU_DECODE-15 ADC 8/6/2015 Low Density Parity Check (LDPC) Codes LDPC GF(256) Parity-Check Matrix Codeword N Checks × N Symbols N Checks N Symbols Syndrome Proposed by Gallager (1962) –codes were binary Not used until mid 1990s GF(256) Davey and McKay (1998) –symbols in [0, 1, 2, …, 255] Parity check matrix defines graph –Symbols connected to checks nodes Has error state feedback –Syndrome = 0 (no errors) –Potential to re-transmit LDPC Codes SYMBOLSSYMBOLS CHECKSCHECKS Tanner Graph

MIT Lincoln Laboratory HPEC_GPU_DECODE-16 ADC 8/6/2015 SYMBOLSSYMBOLS CHECKSCHECKS Tanner Graph Decoding LDPC GF(256) Codes Solved using LDPC Sum-Product Decoding Algorithm –A belief propagation algorithm Iterative Algorithm –Number of iterations depends on SNR Symbol Update Check Update ML Probabilities Symbol Update Check Update Codeword Syndrome decoded symbols Hard Decision Hard Decision

MIT Lincoln Laboratory HPEC_GPU_DECODE-17 ADC 8/6/2015 Algorithmic Demands of LDPC Decoder Decoding of a entire sequence 37 TFLOPS –Single frame 922 MFLOPS –Check update 95% of total Data moved between device memory and multiprocessors 44.2 TB –Single codeword 1.1 GB Data moved within Walsh Transform and permutations (part of check update) 177 TB –Single codeword 4.4 GB Computational Complexity Memory Requirements Decoder requires architectures with high computational and memory bandwidths

MIT Lincoln Laboratory HPEC_GPU_DECODE-18 ADC 8/6/2015 Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture –Dispatching Parallel Jobs GPU Implementation Results Summary

MIT Lincoln Laboratory HPEC_GPU_DECODE-19 ADC 8/6/2015 NVIDIA GPU Architecture GPU contains N multiprocessors –Each Performs Single Instruction Multiple Thread (SIMT) operations Constant and Texture caches speed up certain memory access patterns NVIDIA GTX 295 (2 GPUs) –N = 30 multiprocessors –M = 8 scalar processors –Shared memory 16KB –Device memory 896MB –Up to 512 threads on each multiprocessor –Capable of 930 GFLOPs –Capable of 112 GB/sec NVIDIA CUDA Programming Guide Version 2.1 GPU

MIT Lincoln Laboratory HPEC_GPU_DECODE-20 ADC 8/6/2015 Dispatching Parallel Jobs Program decomposed into blocks of threads –Thread blocks are executed on individual multiprocessors –Threads within blocks coordinate through shared memory –Threads optimized for coalesced memory access grid.x = M; threads.x = N; kernel_call >>(variables); Grid Block 1Block 2Block 3Block M … Calling parallel jobs (kernels) within C function is simple Scheduling is transparent and managed by GPU and API Thread blocks execute independently of one another Threads allow faster computation and better use of memory bandwidth Threads work together using shared memory

MIT Lincoln Laboratory HPEC_GPU_DECODE-21 ADC 8/6/2015 Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation –Check Update Implementation Results Summary

MIT Lincoln Laboratory HPEC_GPU_DECODE-22 ADC 8/6/2015 GPU Implementation of LDPC Decoder Software written using NVIDIA’s Compute Unified Device Architecture (CUDA) Replaced Matlab functions with CUDA MEX functions Utilized static variables –GPU Memory allocated first time function is called –Constants used to decode many codewords only loaded once Codewords distributed across the 4 GPUs Quad GPU Workstation –2 NVIDIA GTX 295s –3.20GHz Intel Core i7 965 –12GB of 1330MHz DDR3 memory –Costs $5000

MIT Lincoln Laboratory HPEC_GPU_DECODE-23 ADC 8/6/2015 Check Update Implementation Permutation Walsh Transform Permutation Walsh Transform Inverse Walsh Trans. Permutation Inverse Walsh Trans. Permutation X X X grid.x = num_checks; threads.x = 256; check_update >>(Q,R); SYMBOLSSYMBOLS CHECKSCHECKS Thread blocks assigned to each check independently of all other checks and are 256 element vectors –Each thread is assigned to element of GF(256) –Coalesced memory reads –Threads coordinate using shared memory Structure of algorithm exploitable on GPU

MIT Lincoln Laboratory HPEC_GPU_DECODE-24 ADC 8/6/2015 Outline Introduction Low Density Parity Check (LDPC) Codes NVIDIA GPU Architecture GPU Implementation Results Summary

MIT Lincoln Laboratory HPEC_GPU_DECODE-25 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. ? Performance (time to decode half second of data) 9 days5.1 minutes0.5s (if possible) ? Hardware Cost ≈ $2,000≈ $90,000≈ $1 million≈ $5,000 (includes PC cost)

MIT Lincoln Laboratory HPEC_GPU_DECODE-26 ADC 8/6/2015 Candidate Implementations PC9x FPGAASIC4x GPU Implementation Time (beyond development) - 1 person year (est.) 1.5 person years (est.) +1 year to Fab. 6 person weeks Performance (time to decode half second of data) 9 days5.1 minutes0.5s (if possible) 5.6 minutes Hardware Cost ≈ $2,000≈ $90,000≈ $1 million≈ $5,000 (includes PC cost)

MIT Lincoln Laboratory HPEC_GPU_DECODE-27 ADC 8/6/2015 Implementation Results Decoded test data in 5.6 minutes instead of 9 days –A 2336x speedup 15 iterations SNR (dB) Decode Time (Minutes) Decode Time 15 Iterations GPU Matlab GPU implementation supports analysis, testing, and demonstration activities

MIT Lincoln Laboratory HPEC_GPU_DECODE-28 ADC 8/6/2015 Implementation Results Decoded test data in 5.6 minutes instead of 9 days –A 2336x speedup Hard coded version runs fixed number of iterations –Likely used on FPGA or ASIC versions 15 iterations SNR (dB) Decode Time (Minutes) Decode Time 15 Iterations GPU Matlab Hard Coded GPU implementation supports analysis, testing, and demonstration activities

MIT Lincoln Laboratory HPEC_GPU_DECODE-29 ADC 8/6/2015 Summary GPU architecture can provide significant performance improvement with limited hardware cost and implementation effort GPU implementation is a good fit for algorithms with parallel steps that require systems with large computational and memory bandwidths –LDPC GF(256) decoding is a well-suited algorithm Implementation allows reasonable algorithm development cycle –New algorithms can be tested and demonstrated in minutes instead of days

MIT Lincoln Laboratory HPEC_GPU_DECODE-30 ADC 8/6/2015 Backup

MIT Lincoln Laboratory HPEC_GPU_DECODE-31 ADC 8/6/2015 Experimental Setup Data rate 1.3 Gb/s Half second burst –40,000 codewords (6000 symbols) Parity check matrix 4000 checks × 6000 symbols –Corresponds to rate 1/3 code Performance numbers based on 15 iterations of decoding algorithm Test System Parameters