An Efficient FPGA Implementation of IEEE e LDPC Encoder Speaker: Chau-Yuan-Yu Advisor: Mong-Kai Ku
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoder scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
Low-Density Parity-Check Code Benefit of LDPC Codes. Approaching Shannon limit Low error floor LDPC code is adopted by various standards (e.g. DVB-S2, n, e)
Low-Density Parity-Check Code Parity check matrix H is sparse Very few 1’s in each row and column Null space of H is the codeword space Valid Codeword
Low-Density Parity-Check Code In (n, k) block codes, k-bit information data can be encoded as n-bit codeword. In systematic block codes, the information bits directly exist in the bits of codeword. Systematic Part Parity Part
Low-Density Parity-Check Code General encoding of systematic linear block codes Finding generator matrix G via H. C = sG = [s | p] Issues with LDPC codes The size of G is very large. G is not generally sparse. Encoding complexity will be very high.
Structured LDPC Codes Quasi-Cyclic LDPC Codes In QC-LDPC, H can be partitioned into square sub-blocks of size z x z. Each sub-blocks can be Z x Z zero sub-block or identity matrix with permutation.
QC Codes With Dual-Diagonal Structure In IEEE standards QC-LDPC Codes have Dual-Diagonal parity structure. We take e code rate ½ matrix for example. Structured LDPC Codes 0 represent identity matrix.
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
General Encoding for LDPC Codes Richardson and Urbanke (RU) algorithm Partition the H matrix into several sub-matrix. In H, the part T is a low triangle matrix.
Richardson and Urbanke (RU) algorithm General Encoding for LDPC Codes O(n+g 2 ) p0 p1 O(n+g 2 )
A valid codeword c = [s|p] must satisfy Replace by dual-diagonal matrix Define lambda value as Efficient Encoding for Dual-Diagonal LDPC Codes Information bitsParity bits From equation, we obtained
Related Work (1) Sequential Encoding Encoding scheme Step 1 Compute lambda value by doing matrix operation x = HsS Step 2 Determines parity vector P 0 by adding all the lambda value Step 3 Rest of parity vector is obtained by exploiting dual-diagonal matrix T One-way derivation
Related Work (2) Arbitrary Bit-generation and Correction Encoding In [1], an alternative encoding for standard matrix was presented. Replace with zero cyclic shift Matrix will be modify by parity portion of weight-3 column set. H can be sectorized into three sub matrices The information bit region A The parity bit region Q for bit-flipping operation The parity bit region U for non bit-flipping. [1] C. Yoon, E. Choi, M. Cheong, and S.-K. Lee, "Arbitrary bit generation and correction technique for encoding QC-LDPC codes with dual-diagonal parity structure," IEEE Wireless Communications and Networking Conference, (WCNC 2007), pp , March A QU
Encoding scheme Step 1 Compute lambda value by doing matrix operation x = As Step 2 Set P 0 as arbitrary binary values. solve unknown parity bits Step 3 Computed correction vector f from P 0 Step 4 Add correction vector to parity bits in region Q to correct them One-way derivation Related Work (2) Arbitrary Bit-generation and Correction Encoding
Advantage Low-complexity encoding The number of addition required is less than RU scheme Drawback Can not directly applicable to standard code Modifying matrix will decrease code performance Related Work (2) Arbitrary Bit-generation and Correction Encoding
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
Better encoding scheme Advantages of the encoding scheme proposed in [2] Low-complexity encoding Can directly applicable to matrices defined in IEEE standards without any modification Achieve higher level parallelism [3] C.-Y. Lin, C.-C. Wei, and M.-K. Ku, "Efficient Encoding for Dual-Diagonal Structured LDPC Code Based on Parity bits Prediction and Correction," IEEE Asia Pacific Conference on Circuits and Systems (APPCCAS), pp , Dec
Better Encoding Scheme Step 1 Set P 0 ’ as any binary vector Step 2 Compute lambda value by doing matrix operation Hs Step 3 [Forward Derivation] Step 4 [ Backward Derivation] Step 5 Compute the P 0 by adding prediction parity vector Step 6 Compute the correction vector f Step 7 Correct prediction parity by adding f Compute P 0 by adding prediction vector Compute correction vector f Correct prediction vector by f f = (P 0 ) d
Better Encoding Scheme Two-way derivation Reduce encoding delay !! Step 1 Set P 0 ’ as any binary vector. Step 2 Compute lambda value by doing matrix operation Hs. Step 3 [Forward Derivation] Step 4 [ Backward Derivation] Step 5 Compute the P 0 by adding prediction parity vector. Step 6 Compute the correction vector f. Step 7 Correct prediction parity by adding f.
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
LDPC Encoder Architecture Based on the encoding scheme proposed bedore, we design both parallel and serial architecture. Parallel architecture Achieve higher level parallelism High-speed Serial architecture
Parallel architecture Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 1) In this stage, matrix select the shift values and multiply specific value according to the code length. Benefit: 1.When the input data is coming, it can work immediately without all the input data are coming. 2.Reduce the numbers of barrel shifter.
Shifter Value Computation Equation for computing shift value Code rate 2 ∕ 3 A code : Normal code rate : Two type of matrix implement result with multiple rate and length SliceFFsLUTs CLK (MHz) Total gate count One matrix + calculate IP 14,1794,07126, ,076 Using matrices to save shifter value 41,40912,07876, ,691
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 2) Divide the datas from matrix. This module used to save the input data. These data are used in barrel shifters.
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 3) These module are used to circulated shift the input data Shifter value This module records the row position of the shifter values Lambda position = 3 Lambda position = 8 Lambda position = 11
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 4) Computed the lambda value by accumulating the shifted data after K b clock cycle KbKb According to the lambda position, in this clock cycle λ 1, λ 2, λ 5, λ 8, λ 9, λ 11 need to be accumulated.
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 5) Computed the prediction vector P i ‘ by equation
Parallel architecture (Stage 5) P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; For saving the hardware area, we use one architecture to compute the prediction values for four different code rate. In code rate 1 / 2, P_0 ~ P_11 are the prediction In code rate 2 / 3, P_0 ~ P_3 P_8~P_11are the prediction
P_0 <= acc_out0; P_1 <= acc_out0 ^ acc_out1; P_2 <= acc_out0 ^ acc_out1 ^ acc_out2; P_3 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3; P_4 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4; P_5 <= acc_out0 ^ acc_out1 ^ acc_out2 ^ acc_out3 ^ acc_out4 ^ acc_out5; P_6 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7 ^ acc_out6; P_7 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8 ^ acc_out7; P_8 <= acc_out11 ^ acc_out10 ^ acc_out9 ^ acc_out8; P_9 <= acc_out11 ^ acc_out10 ^ acc_out9; P_10 <= acc_out11 ^ acc_out10; P_11 <= acc_out11; Parallel architecture (Stage 5) For saving the hardware area, we use one architecture to compute the prediction values for four different code rate. In code rate 3 / 4, P_0 ~ P_2 P_9~P_11 are the prediction vectors In code rate 5 / 6, P_0 ~ P_1 P_10~P_11are the prediction vectors
Matrix Input data register lambda position Accumulator Correct Prediction Parity memory Barrel shifter#6 Barrel shifter#1 divider Parallel architecture (Stage 6) Step1: Compute the P 0. In code rate = 1 / 2, P 0 = P 5 ^ P 6 Step2: Correct the other P i. Using the equation P i = P i ’^ P 0
Serial architecture (Stage 1) As the stage1 in parallel architecture. In the first Kb clock cycle, encoder order are from top->middle and down ->middle, column by column Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 1) Reason: 1.Prepare the input data 2.Reduce the slice In the last clock cycle, encoder order are from left->right, row by row
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 2) Choose the corresponding input value to barrel shifter (Take clock cycle #2 for example) Divide the datas from matrix.
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 3) Shift the input data according to the shifter value chosen form Mux
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 4) In this module, there are three works: 1.Compute λ i 2.Compute P i ’ 3.Compute P 0 In normal, this module accumulate the shifted data to compute λ i. When the data is the last value in this row, also compute P i ’.
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 4) When all Pi have been computed, compute the P 0 by Xor P x ’ and P x+1 ’ which are the middle prediction vector in the matrix.
Matrix Input data register Barrel shifter#1 Barrel shifter#2 Correct Accumulator & Predict memory Input control divider Serial architecture (Stage 5) Correct the other P i. Using the equation P i = P i ’^ P 0
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Better Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
Implementation Results The proposed encoder based on IEEE e LDPC codes can encode the code with code rate 1/2 2/3 3/4 5/6 and code length ranging from 576 to The hardware implementation was performed and verification on Xilinx Virtex-4 and Altera Stratix Field Programmable Gate Array (FPGA) device.
Implementation Results Parallel architecture Information throughput ranging from to Gbps The encoder area is constant in any code rate or code length. For a given code rate, an increase in the code length will increase the throughput. Rate 1/2Rate 2/3Rate 3/4Rate 5/6 ZNSliceFFsLUTsCLK (MHz)IT (Gbps) ,1794,07126,
Implementation Results Serial architecture Information throughput ranging from to Gbps For a given code rate, an increase in the code length will increase the throughput.
Implementation Results Parallel architecture using row by row Area comparison
Implementation Results IT comparison IT/Area comparison
Compare to Related Work We compare implementation with [3]. Code LengthArea (LE)Clk (MHz)IT (Gbps) IT/Total Area (Mb per Le) [2] Table 4.5a The synthesis result of [22] at code rate 1/2 Code LengthArea (LE)Clk (MHz) IT (Gbps) Rate 1/2 IT/Total Area (Mb per Le) rate1/2 Proposed Better throughput for longer code length Using less area to implement multiple code length and code rate The clock cycle is shorter the [3]. [3] S. Kopparthi and D. M. Gruenbacher, "Implementation of a fiexible encoder for structured low-density parity-check codes," IEEE Pacic Rim Conference on Communications, Computers and Signal Processing (PacRim 2007), pp , Aug
Compare to Related Work The comparison of throughput The proposed encoder outperforms the work in [3] in terms of throughput when the code length longer then 1200 The proposed encoder architecture provides better throughput for a longer code length while the work in [3] does not have this kind of speed-up
Compare to Related Work The proposed encoder outperforms the work in [3] in terms of throughput/area ratio by to times The proposed encoder utilizes hardware resources more efficiently The comparison of throughput/area ratio
Compare to Related Work We compare implementation with [2].
Compare to Related Work The comparison of throughput The throughput in our proposed encoder is higher then [2] in all code rate and code length The proposed encoder outperforms the work in [2] in terms of throughput ratio by to times
Compare to Related Work The comparison of throughput/area The proposed encoder outperforms the work in [2] in terms of throughput ratio by to times The result shows that our proposed encoder utilizes hardware resources efficiently
Compare to Related Work (Serial) We compare implementation with [4]. SlicesFFsLUTsBlock ramsCLKIT [4]4,7241,8078, Proposed12,5673,88522, Our proposed encoder achieve higher IT in low clock. In our proposed encoder, the matrix information are built in it without additional blockrams. The IT/Area of our serial encoder is (Mbps) per slice and the IT/Area of [4] is [4] Jeong Ki KIM 1, Hyunseuk YOO 1 and Moon Ho LEE 1, "Efficient Encoding Architecture for IEEE e LDPC Codes, " IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences 2008.
Outline Introduction Low-Density Parity-Check Codes Related work General encoding for LDPC codes Efficient encoding for Dual-Diagonal matrix Proposed Encoding scheme LDPC Encoder Architecture Parallel Encoder Serial Encoder Result Conclusion
An efficient encoding architecture for IEEE e LDPC codes with multiple code lengths and code rates are implemented. In our design, change between different code rate or code length only to change the type in information data. This architecture is also suitable the IEEE n standard. Our encoder achieve higher throughput and better throughput/area ratio than conventional encoding scheme when code length longer than 1200.
Thank you!!