1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance.

Slides:



Advertisements
Similar presentations
Noise-Predictive Turbo Equalization for Partial Response Channels Sharon Aviran, Paul H. Siegel and Jack K. Wolf Department of Electrical and Computer.
Advertisements

Parallel Computing in Matlab
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
SE263 Video Analytics Course Project Initial Report Presented by M. Aravind Krishnan, SERC, IISc X. Mei and H. Ling, ICCV’09.
Error Correction and LDPC decoding CMPE 691/491: DSP Hardware Implementation Tinoosh Mohsenin 1.
Inserting Turbo Code Technology into the DVB Satellite Broadcasting System Matthew Valenti Assistant Professor West Virginia University Morgantown, WV.
GPU Computing with Hartford Condor Week 2012 Bob Nordlund.
HPCC Mid-Morning Break High Performance Computing on a GPU cluster Dirk Colbry, Ph.D. Research Specialist Institute for Cyber Enabled Discovery.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Improving BER Performance of LDPC Codes Based on Intermediate Decoding Results Esa Alghonaim, M. Adnan Landolsi, Aiman El-Maleh King Fahd University of.
1 Channel Coding in IEEE802.16e Student: Po-Sheng Wu Advisor: David W. Lin.
OpenFOAM on a GPU-based Heterogeneous Cluster
Empowering visual categorization with the GPU Present by 陳群元 我是強壯 !
Acceleration of the Smith– Waterman algorithm using single and multiple graphics processors Author : Ali Khajeh-Saeed, Stephen Poole, J. Blair Perot. Publisher:
Turbo Codes Azmat Ali Pasha.
NORM BASED APPROACHES FOR AUTOMATIC TUNING OF MODEL BASED PREDICTIVE CONTROL Pastora Vega, Mario Francisco, Eladio Sanz University of Salamanca – Spain.
Computer Architecture Project
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
EE436 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
EEE377 Lecture Notes1 EEE436 DIGITAL COMMUNICATION Coding En. Mohd Nazri Mahmud MPhil (Cambridge, UK) BEng (Essex, UK) Room 2.14.
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Computing Platform Benchmark By Boonyarit Changaival King Mongkut’s University of Technology Thonburi (KMUTT)
Why to Apply Digital Transmission?
OpenSSL acceleration using Graphics Processing Units
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
SAGE: Self-Tuning Approximation for Graphics Engines
AGENT SIMULATIONS ON GRAPHICS HARDWARE Timothy Johnson - Supervisor: Dr. John Rankin 1.
Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM THEORY * By: Ricardo A. Garcia *Research done at: University.
DIGITAL WATERMARKING OF AUDIO SIGNALS USING A PSYCHOACOUSTIC AUDITORY MODEL AND SPREAD SPECTRUM THEORY By: Ricardo A. Garcia University of Miami School.
GPU-accelerated Evaluation Platform for High Fidelity Networking Modeling 11 December 2007 Alex Donkers Joost Schutte.
Low Density Parity Check (LDPC) Code Implementation Matthew Pregara & Zachary Saigh Advisors: Dr. In Soo Ahn & Dr. Yufeng Lu Dept. of Electrical and Computer.
Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.
Daphne Koller Message Passing Loopy BP and Message Decoding Probabilistic Graphical Models Inference.
By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.
A New Method For Developing IBIS-AMI Models
1 © 2012 The MathWorks, Inc. Parallel computing with MATLAB.
Developing a SDR Testbed Alex Dolan Mohammad Khan Ahmet Unsal Project Advisor Dr. Aditya Ramamoorthy.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Tracking with CACTuS on Jetson Running a Bayesian multi object tracker on a low power, embedded system School of Information Technology & Mathematical.
Fast Support Vector Machine Training and Classification on Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Parallel Computing Laboratory,
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
MIT Lincoln Laboratory Slides - 1 (HPEC) 11Aug10 NBL Nancy List, Ryan Shoup, Tommy Royster HPEC September 2010 Evaluating the Performance of DVB-S2.
Design of a High-Throughput Low-Power IS95 Viterbi Decoder Xun Liu Marios C. Papaefthymiou Advanced Computer Architecture Laboratory Electrical Engineering.
GPU-Accelerated Beat Detection for Dancing Monkeys Philip Peng, Yanjie Feng UPenn CIS 565 Spring 2012 Final Project – Final Presentation img src:
Experiences Accelerating MATLAB Systems Biology Applications Heart Wall Tracking Lukasz Szafaryn, Kevin Skadron University of Virginia.
Coding Theory. 2 Communication System Channel encoder Source encoder Modulator Demodulator Channel Voice Image Data CRC encoder Interleaver Deinterleaver.
JPEG-GPU: A GPGPU IMPLEMENTATION OF JPEG CORE CODING SYSTEMS Ang Li University of Wisconsin-Madison.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Part 1: Overview of Low Density Parity Check(LDPC) codes.
SR: 599 report Channel Estimation for W-CDMA on DSPs Sridhar Rajagopal ECE Dept., Rice University Elec 599.
Space Charge with PyHEADTAIL and PyPIC on the GPU Stefan Hegglin and Adrian Oeftiger Space Charge Working Group meeting –
Sunpyo Hong, Hyesoon Kim
Canny Edge Detection Using an NVIDIA GPU and CUDA Alex Wade CAP6938 Final Project.
Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.
Channel Coding and Error Control 1. Outline Introduction Linear Block Codes Cyclic Codes Cyclic Redundancy Check (CRC) Convolutional Codes Turbo Codes.
Two-Dimensional Phase Unwrapping On FPGAs And GPUs
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
OptiSystem applications: BER analysis of BPSK with RS encoding
VLSI Architectures For Low-Density Parity-Check (LDPC) Decoders
Q. Wang [USTB], B. Rolfe [BCA]
Rate 7/8 (1344,1176) LDPC code Date: Authors:
January 2004 Turbo Codes for IEEE n
An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes
GPU Implementations for Finite Element Methods
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Physical Layer Approach for n
Presentation transcript:

1 1 © 2011 The MathWorks, Inc. Accelerating Bit Error Rate Simulation in MATLAB using Graphics Processors James Lebak Brian Fanous Nick Moore High-Performance Embedded Computing Workshop (HPEC 2011) 22 September 2011

2 Outline  Application Area  MATLAB Tools  DVB-S.2 Subset benchmark  DVB-S Subset benchmark  Conclusions

3 Bit Error Rate Simulation  Begin with a model of a communications system  Vary the signal-to-noise ratio  Calculate bit error rate through Monte Carlo simulation  Obtain a curve like that at the right  Expensive (millions of trials, days of simulation)

4 Example System Model Digital Video Broadcast Standard, Second Generation (DVB-S.2) BCH Encoder Demodulator and Deinterleaver LDPC Decoder LDPC Encoder BCH Decoder Channel Model Modulator and Interleaver Data In Data Out Transmitter Model Receiver Model LDPC = Low-Density Parity Check Simulation Goal: Verify transmitter and receiver achieve a given bit error rate in the presence of noise introduced by the channel model

5 Outline  Application Area  MATLAB Tools  DVB-S.2 Subset benchmark  DVB-S Subset benchmark  Conclusions

6 System Objects Represent dynamic systems  Algorithms with input values and states that change over time Optimized for iterative execution  One-time checking of parameter values, input arguments  Data streaming between objects Many algorithms implemented in C++ for speed Support for automatic C code generation System objects allow design of dynamic systems in MATLAB

7 System Object Receiver Model %***Receiver Object Creation *** hDemod = comm.PSKDemodulator(demodulatorArgs{:}); hDeintrlv = comm.BlockDeinterleaver(dvb.InterleaveOrder); hDec = comm.LDPCDecoder(decoderArgs{:}); {'ModulationOrder', 2^dvb.BitsPerSymbol,... 'BitOutput', true,... 'PhaseOffset', dvb.PhaseOffset,... 'SymbolMapping', 'Custom',... 'CustomSymbolMapping', dvb.SymbolMapping,... 'DecisionMethod', 'Approximate log-likelihood ratio',... 'Variance', dvb.NoiseVar} %***Receiver Model Execution***% demodOut = step(hDemod, chanOut); dexlvrOut = step(hDeintrlv, demodOut); ldpcdecOut = step(hDec, dexlvrOut);

8 Graphics Processing Unit (GPU) Acceleration in MATLAB  Requirements –Parallel Computing Toolbox –NVIDIA GPU with Compute Capability 1.3 or greater  Requirements –Parallel Computing Toolbox –NVIDIA GPU with Compute Capability 1.3 or greater Level of control Minimal Some Extensive Use GPU arrays with MATLAB built-in functions Execute custom functions on elements of GPU arrays Create kernels from existing CUDA code and PTX files Parallel optionsRequired effort None Straightforward Involved

9 LDPC Decoder c0c1c2c3c4c5c6c7c8 f0f1f2f3f4 c9 fjfj cici r ji q ij LDPC Decoder Algorithm for ii=1:maxIterations, Update all r messages (in parallel) Update all q messages (in parallel) end form hard/soft decision (in parallel) LDPC Decoder was implemented using custom CUDA kernels in MATLAB release R2011a Received word Check nodes Iteratively determine message bits that best satisfy system constraints (64800 nodes) (32400 nodes) ~226,000 edges

10 GPU-Accelerated Receiver Model %***Receiver Object Creation *** hDemod = comm.PSKDemodulator(demodulatorArgs{:}); hDeintrlv = comm.BlockDeinterleaver(dvb.InterleaveOrder); hDec = comm.gpu.LDPCDecoder(decoderArgs{:}); %***Receiver Model Execution***% demodOut = step(hDemod, chanOut); dexlvrOut = step(hDeintrlv, demodOut); ldpcdecOut = step(hDec, dexlvrOut); Minimal code changes to use GPU System Object GPU System object is 20x faster than original System object Overall system simulation is 5x faster

11 Outline  Application Area  MATLAB Tools  DVB-S.2 Subset benchmark  DVB-S Subset benchmark  Conclusions

12 DVB-S.2 Subset Benchmark Digital Video Broadcast Standard, Second Generation (DVB-S.2) BCH Encoder Demodulator and Deinterleaver LDPC Decoder LDPC Encoder BCH Decoder Channel Model Modulator and Interleaver Data In Data Out Transmitter Model Receiver Model Subset benchmark (can execute entirely on GPU) New objects implemented using MATLAB for faster development

13 Using GPU System objects  GPU System objects are drop-in replacements for standard System objects –Normal MATLAB arrays in  normal MATLAB arrays out –Requires two data transfers %***Receiver Object Creation *** hDemod = comm.gpu.PSKDemodulator(demodulatorArgs{:}); hDeintrlv = comm.gpu.BlockDeinterleaver(dvb.InterleaveOrder); hDec = comm.gpu.LDPCDecoder(decoderArgs{:}); %***Receiver Model Execution***% demodOut = step(hDemod, chanOut); dexlvrOut = step(hDeintrlv, demodOut); ldpcdecOut = step(hDec, dexlvrOut);

14 Using GPU System objects  GPU System objects are drop-in replacements for standard System objects –Normal MATLAB arrays in  normal MATLAB arrays out –Requires two data transfers  GPU System objects also accept GPUArrays as inputs –In this case the output is also a GPU Array –Avoids data transfer overhead %***Receiver Object Creation *** hDemod = comm.gpu.PSKDemodulator(demodulatorArgs{:}); hDeintrlv = comm.gpu.BlockDeinterleaver(dvb.InterleaveOrder); hDec = comm.gpu.LDPCDecoder(decoderArgs{:}); %***Receiver Model Execution***% demodOut = step(hDemod, gpuArray(chanOut)); dexlvrOut = step(hDeintrlv, demodOut); ldpcdecOut = gather(step(hDec, dexlvrOut));

15 DVB-S.2 Subset Benchmarks Using GPUArray  Baseline: speed of subset with all objects on the CPU  Using GPUArray objects as inputs accelerates both versions of the benchmark  Using the GPU-accelerated LDPC provides a large performance boost  Adding GPU-accelerated versions of other objects slows down the simulation Benchmark Machine Specs CPU: 2.5 GHz Intel Core 2 Quad GPU: NVIDIA Tesla C1060 Windows 7 MATLAB R2011b

16 Performance vs. Input Size for GPU System Objects  Computational density is important for obtaining good GPU performance  Many GPU System objects are slower for small data sizes  In R2011b, crossover point is generally around 10 5  Frame size is elements System Frame Size

17 DVB-S.2 Subset Multi-Frame Benchmarks  Process 8 frames of data on the GPU simultaneously to get better performance  LDPC performance begins to saturate, but other objects become faster

18 Outline  Application Area  MATLAB Tools  DVB-S.2 Subset benchmark  DVB-S Subset benchmark  Conclusions

19 Digital Video Broadcast – Satellite (DVB-S) benchmark Demodulator Viterbi Decoder Convolutional Encoder Channel Model Modulator Data In Data Out Transmitter Model Receiver Model  DVB-S is a similar system –uses different encoder/decoder pair than DVB-S.2  The entire chain executes on the GPU –including message generation and validation  Viterbi Decoder is implemented as a custom CUDA kernel –all others implemented using MATLAB GPU support

20 DVB-S Subset Benchmarks  DVB-S frame size is 1632 elements –single-frame performance on the GPU is slower than the CPU –multi-frame performance is much faster

21 Conclusions  The GPU can be a powerful tool for accelerating Bit Error Rate simulation in MATLAB  Parallel Computing Toolbox allows a trade-off between work and performance –Implement the kernels with highest computational requirements as custom CUDA kernels –Implement kernels with lower computational requirements using MATLAB  For best performance using the GPU, process multiple data frames simultaneously

22 Acknowledgments The following people at MathWorks contributed to this work or to the presentation. Andy Grace Witek Jachimczyk Amit Kansal Darel Linebarger Mike Longfritz Roy Lurie Mike McLernon Don Orofino Paul Pacheco Narfi Stefansson Bharath Venkataraman