A Scalable Architecture for LDPC Decoding

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

Computer Organization and Architecture
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Cooperative Multiple Input Multiple Output Communication in Wireless Sensor Network: An Error Correcting Code approach using LDPC Code Goutham Kumar Kandukuri.
Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik
Threshold Phenomena and Fountain Codes
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.
Code and Decoder Design of LDPC Codes for Gbps Systems Jeremy Thorpe Presented to: Microsoft Research
Generalized Communication System: Error Control Coding Occurs In Right Column. 6.
Data Compression Basics & Huffman Coding
HPEC_GPU_DECODE-1 ADC 8/6/2015 MIT Lincoln Laboratory GPU Accelerated Decoding of High Performance Error Correcting Codes Andrew D. Copeland, Nicholas.
Inside The CPU. Buses There are 3 Types of Buses There are 3 Types of Buses Address bus Address bus –between CPU and Main Memory –Carries address of where.
Chapter 1 Algorithm Analysis
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
PARALLEL TABLE LOOKUP FOR NEXT GENERATION INTERNET
Wireless Mobile Communication and Transmission Lab. Theory and Technology of Error Control Coding Chapter 7 Low Density Parity Check Codes.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
Application of Finite Geometry LDPC code on the Internet Data Transport Wu Yuchun Oct 2006 Huawei Hisi Company Ltd.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Tinoosh Mohsenin and Bevan M. Baas VLSI Computation Lab, ECE Department University of California, Davis Split-Row: A Reduced Complexity, High Throughput.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
User Cooperation via Rateless Coding Mahyar Shirvanimoghaddam, Yonghui Li, and Branka Vucetic The University of Sydney, Australia IEEE GLOBECOM 2012 &
Threshold Phenomena and Fountain Codes Amin Shokrollahi EPFL Joint work with M. Luby, R. Karp, O. Etesami.
Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.
ECEG-3202 Computer Architecture and Organization Chapter 3 Top Level View of Computer Function and Interconnection.
Real-Time Turbo Decoder Nasir Ahmed Mani Vaya Elec 434 Rice University.
Part 1: Overview of Low Density Parity Check(LDPC) codes.
Low Density Parity Check codes
Multi-Split-Row Threshold Decoding Implementations for LDPC Codes
Semi-Parallel Reconfigurable Architecture for Real-time LDPC decoding Karkooti, M.; Cavallaro, J.R.; Information Technology: Coding and Computing, 2004.
Team LDPC, SoC Lab. Graduate Institute of CSIE, NTU Implementing LDPC Decoding on Network-On-Chip T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin.
Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.
Doc.: aj SubmissionSlide 1 LDPC Coding for 45GHz Date: Authors: July 2014 NameAffiliationsAddressPhone Liguang LiZTE CorporationShenzhen.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Cache Memory. Reading From Memory Writing To Memory.
Construction of Optimal Data Aggregation Trees for Wireless Sensor Networks Deying Li, Jiannong Cao, Ming Liu, and Yuan Zheng Computer Communications and.
Tinoosh Mohsenin 2, Houshmand Shirani-mehr 1, Bevan Baas 1 1 University of California, Davis 2 University of Maryland Baltimore County Low Power LDPC Decoder.
1 Aggregated Circulant Matrix Based LDPC Codes Yuming Zhu and Chaitali Chakrabarti Department of Electrical Engineering Arizona State.
Computer systems Quiz. The CPU What does CPU stand for?(1) Which 3 step cycle does the CPU follow?(1) In order to run, name 3 things that the CPU needs?(3)
Computer Architecture Chapter (14): Processor Structure and Function
The Viterbi Decoding Algorithm
William Stallings Computer Organization and Architecture 8th Edition
Edexcel GCSE Computer Science Topic 15 - The Processor (CPU)
Introduction to Computer Architecture
Assembly Language for Intel-Based Computers, 5th Edition
Cache Memory Presentation I
Factor Graphs and the Sum-Product Algorithm
Rate 7/8 (1344,1176) LDPC code Date: Authors:
Progress report of LDPC codes
An Improved Split-Row Threshold Decoding Algorithm for LDPC Codes
ECEG-3202 Computer Architecture and Organization
Introduction to Computer Architecture
High Throughput LDPC Decoders Using a Multiple Split-Row Method
Channel coding architectures for OCDMA
IV. Convolutional Codes
Computer Architecture
Memory System Performance Chapter 3
Efficient Huffman Decoding
Chapter 11 Processor Structure and function
August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab
Cache Memory.
IV. Convolutional Codes
Presentation transcript:

A Scalable Architecture for LDPC Decoding Cocco, M.; Dielissen, J.; Heijligers, M.; Hekstra, A.; Huisken, J. Design, Automation and Test in Europe Conference and Exhibition, 2004. Proceedings , Volume: 3 , Feb. 16-20, 2004 Pages:88 - 93

Outline Introduction Serial approach UMP algorithm Dataset in check nodes Check operation Computation skill Memory reduction Computation for Iteration

Introduction High code rate (=0.9) LDPC code K (avg.=30):Row-weight High code rate, codeword length and High SNR Memory reduction (1/10) MacKay [3] has shown that for high rate R applications and intermediate or longer codeword lengths this brings no advantage. In fact, the error performance for higher SNR values becomes worse.

Serial Approach Storage media application (optical or magnetic) Relaxed delay requirement Process from first bit node to last bit node Memory storage for message

UMP Algorithm "FOR 40 ITERATIONS DO" "NEXT ITERATION" "FOR ALL BIT NODES DO" "FOR EACH INCOMING ARC X" "SUM ALL INCOMING LLRs EXCEPT OVER X" "SEND THE RESULT BACK OVER X" "NEXT ARC" "NEXT BIT NODE" "FOR ALL CHECK NODES DO" "TAKE THE ABS MINIMUM OF THE INCOMING LLRs EXCEPT OVER X" “TAKE THE XOR OF THE INCOMING LLRs EXCEPT OVER X” "NEXT ARC“ "NEXT CHECK NODE" "NEXT ITERATION"

UMP algorithm Not needed knowledge of SNR of channel Robust performance Not needed complex mathematical function (tanh x) area saving

Dataset in check nodes Minimum: Overall minimum value One-but-minimum Index Check Node 4

Check operation Compute exclusive or of all hard bits output by connected bit nodes, except jth. Compute the minimum of all K absolute value of LLRs of bit nodes to which the check node is connected, except jth.

Computation skill Minimum: LLRj is not minimum, minimum=overall minimum. Otherwise, minimum=second-to-minimum

Memory reduction Original size Reduced size Address=index

Memory unit inside Check node

Computation for Iteration "FOR 40 ITERATIONS DO" "FOR ALL BIT NODES DO" “CALCULATE THE OUTPUT MESSAGES FROM THE 3 CONNECTED CHECK NODES“ “DO RUNNING CHECK NODE UPDATES ON THE 3 CHECK NODES” “NEXT BIT NODES” "NEXT ITERATION"

Computation for Iteration NEW | OLD NEW | OLD NEW | OLD NEW | OLD

Time folded architecture FSM & PC μROM R/W & address Control Serial input Serial output Computational Kernel Prefetcher Memory

Prefetch Every dataset is statically used for 30 consecutive cycles. Every clock cycle an average of 2R and 2W operations are required. Delayed writeback Datasets caching

Tiled architecture FSM & PC μROM Computational Kernel Prefetcher Memory

Result and area distribution N=1020 R=0.5, 57 tiles 36mm2 with 0.13μm @1GHz, 300Mb/s

Conclusion Speedup & Simultaneously multiple access  Prefetch Reduce memory access latency Memory hierarchy Increase performance N-tiled architecture Modified version can be pipelined