Suman Das Rice University

Advanced Wireless Receivers: Algorithmic and Architectural Optimizations
Suman Das Rice University Department of Electrical and Computer Engineering & Center for Multimedia Communication

Introduction Wireless is one of the fastest growing industries
100 200 300 400 500 600 700 1993 1994 1995 1996 1997 1998 1999 2000 2001 millions of cell-phone users Year Source: Ericsson “By 2002, a lot more cellular phones are going to have internet access than PCs.” Larry Ellison , CEO, Oracle. Wireless industry is in the midst of a phenomenal growth curve. Currently there are over 400 mil. Wireless users in the whole world and this number is expected to be quadrupled in the next four years. Larry Ellison, the chair of Oracle has predicted that by the year 2002, there will be more cellular phone with internet access than PC-s.

Ubiquitous wireless connectivity
Ad-hoc Network Wireless Cellular Bluetooth/ Home Networks Wireless LAN But popularity of the cell-phones is only part of the wireless story. In the future we are going to be covered by multiple overlapping layers of wireless connection. At office we will connect to the high-speed wireless LAN, at home our refrigerator will talk to our coffee-machine using Blue-tooth technology while at remote locations wireless technology will allow devices to form ad-hoc networks with others. Connectivity will be “anywhere anytime”. But in order to achieve this dream we will have to overcome several challenges. In today’s presentation I will talk about some of these challenges, especially in the context of cellular systems and our solution.

Why advanced receiver algorithms?
The number of wireless subscribers growing Multimedia data replacing voice traffic Higher and varied data rate (144Kbps - 2Mbps) Stricter quality of service (QOS) Wireless bandwidth remains a critical resource Current generation receivers are suboptimal As already noted, the volume of cell-phone subscribers is growing at an amazing rate. Not only that, media rich-data is slowly replacing the traditional voice traffic. For this multimedia data the next generation cellular system have to support data rates ranging from 144 Kbps in the highways up to 2Mbps in the office environment. The quality of service also need to improve many fold. Unfortunately the wireless bandwidth is not growing at the same pace and it is found that the current generation wireless solution will be inadequate for such high volume of data. Researchers have been looking for alternate receiver algorithms for this new system.The good news is it has been shown with more complex algorithms we can achieve higher performance.

Performance of advanced receivers
6 8 10 12 14 16 4 -4 -2 bit error rate SNR (dB) Current receiver Advanced receiver Theoretical limit This viewgraph is to show how much improvement in performance we can expect from these advanced algorithms. On the X-axis we have plotted the Signal to Noise ratio, which is a rough measure of the transmitter’s signal power and on the Y-axis we have plotted the bit-error rate. Lower the bit-error rate better is the performance. The black dotted curve at the top represents the performance of currently deployed algorithms, the red-curve at the bottom is the theoretical limit of the system (when there is only one user) and the intermediate green curve is the performance of one of the representative advanced receiver algorithms. It is evident that several order of performance improvement can be obtained through the use of these new algorithms. So the question is why aren’t we using them? Huge performance improvement

Computational requirements of advanced receivers
15 user system transmitting at 0.5Mbps needs ~20 Billion additions per second ~15 Billion multiplications per second Requires 32 bit floating point precision Need fifty 200 MHz floating point DSP-s! The caveat is that,these proposed algorithms are also several orders of magnitude more complex. For example, in order to support 15 users transmitting data at the rate of 0.5Mbps these algorithms will require 20billion additions and 15 billion multiplications. Moreover all these operations need to be performed with 32 bit floating point arithmetic units. In short, we will require 50 floating point DSP-s running at 200MHz to sustain such a data rate.

My research Receiver design High performance Low complexity Approach
Algorithmic simplification Efficient architectural mapping This presents a dilemma. On one hand we will need to deploy some form of advanced receiver algorithms to support multimedia traffic but on the other hand the solution needs to be feasible for physical implementation in terms of power and hardware requirements. Our approach will be to re-examine these class of advanced algorithms and propose new computationally efficient algorithms and their corresponding hardware implementations

Wireless channel model
Channel Effects Background noise Fading Multiple paths Multiple Users Multiple Access Interference(MAI) Noise Reflected Paths Base Station Direct Path Before we propose our solution lets take a closer look at the wireless channel. The characteristic of the wireless channel is quite different than that of a wired channel. The transmitted signal is affected by background noise. Moreover the wireless channel has a fading characteristic. By that we mean the attenuation the transmitted signal experiences change over time and frequency. Moreover none of the wireless transmitters are in-directional. Hence multiple copies of the same signal will arrive after reflections. This is called the multi-path effect. To further complicate the matter, there will be multiple users sharing the same channel. Hence a user’s signal will be affected by the presence of other user. And this is known as MAI. User 1 User 2

Code Division Multiple Access (CDMA)
S(t) Wideband CDMA - technology of choice Users distinguished by spreading sequence chip Spreading gain = 7 time bit Since multiple users share the same common channel, there needs to be a multiplexing scheme to distinguish between them. For next generation wireless system, CDMA has been chosen as the multiple access scheme. In this method the users are distinguished by a unique signature waveform. Each user modulates the transmitted bit by this spreading sequence. In this particular example in order to transmit the data bit 1, one would transmit the spreading waveform given by [ ]. The smallest unit of transmitted data is called a chip. The number of chips transmitted per bit is called the spreading gain of the system. Received signal K: # of users P: # of paths w: attenuation t: delay b: data bits

detected bits of all K users
CDMA system TRANSMITTER ENCODING SPREADING MODULATION OTHER USERS data DECODING DETECTION DEMODULATION CHANNEL ESTIMATION RECEIVER detected bits of all K users A CDMA transceiver has two components. At the transmitter end, the data bits are first encoded by error-control coding and then spreaded using the spreading code. It is then modulated using carrier frequency for transmission. The receiver consists of 3 modules, channel estimation or synchronization which estimates the delays and amplitudes of each path of each user. The detection module uses these information to detect the transmitted bits. The decoder is used to recover from any error incurred during the estimation process. The advanced receivers employ multiuser algorithms. By that I mean, the data bits of all the users will be estimated simultaneously. However the effort in the research community so far has been to isolate the design of three modules. I will show that this approach is not only suboptimal in terms of performance, but also in terms of computational requirement. Proposed advanced/multiuser receiver modules Designed in isolation Suboptimal design

Integrated receiver design
DECODING DETECTION DEMODULATION CHANNEL ESTIMATION RECEIVER detected bits of all K users Joint channel estimation and detection Joint detection and decoding My proposed solution can be broadly categorized as integrated receiver design approach. I will present my results in two steps. At first I will present solutions for a joint channel-estimation and detection and then I will extend the results to incorporate decoding.

Why separate channel estimation and detection?
Received signal Chip-matched filter Channel Estimation Code-matched filter Detection delay time bi+1 The natural question obviously is why have people tried to compartmentalize the design. The reason is that the multiuser algorithms used for channel estimation and detection use two different statistics. For both the processes the received signal needs to be discretized. But while the detector works on the output of code-matched filter (which is output) at the bit level, the channel estimator takes in as input the output of chip-level matched filter. The processing window for the channel estimation starts at any arbitrary point. The delay of the path is essentially the distance between the start of the bit from this processing window. However in order to compute the code-matched filter output, one has to start the filtering process from the bit boundaries. Since before channel estimation we don’t know the delays we cannot expect the arbitrary processing window for channel estimation will line up with the bit boundary. And the code matched filter has to start processing at the bit boundary of each user. Since each user has independent delay corresponding to each path, it will be incorrect to assume that the channel estimation processing window will match with all bit boundaries. Hence even though the chip matched filtering and code-matched filtering has some common operations, as also the computations done at the back-end channel estimation and detection algorithm one cannot take advantage of that since the two front ends need to process different data. bi ri Processing Window for Chan. Est. Operate on different statistics

Towards an integrated solution
Reuse computation from channel estimation step Use same discretized filter output Avoid alignment to bit interval of each user Reduce computation Save hardware Our aim is to come up with a framework that will increase the efficiency of the two modules by more efficient data sharing as well as computation sharing.

Components of the observation vector
bit i = +1 bit i+1 = -1 wk,p -wk,p delay wk,p attenuation +

Matrix representation
bit i = +1 bit i+1 = +1 Uk Zk bk(i) + other users r = U Z bpreamble

Efficient statistics Parametric approach
Build channel model (number of paths) Estimate delay, attenuation Produce the code matched filter output Our approach Estimate effective spreading code (UZ) Code matched filter y = (UZ)T r

Simulation parameters
System parameters 15 users 3 paths Spreading gain - 31 Hardware platform TI C62 and C67 EVM boards 64 KB each internal program & data memory 256 KB SBSRAM, 8 MB SDRAM (external) Code-composer 1.0 to profile code

Effectiveness of integrated design
10 Single User 10 -1 Multiuser algorithms Parametric approach UZ approach Actual Parameters bit error rate 10 -2 -3 10 10 -4 -4 -2 2 4 6 8 10 12 14 16 SNR (dB) 2dB gain in performance

Computational savings
Avoid extraction of actual channel parameters Avoid realignment of data for code-matched filtering Reduce intermediate storage requirement Avoid divisions (28 cycles) and square-root (38 cycles) in DSP.

Fixed point behavior Fixed point advantages Speed Power Cost
Fixed point analysis 12 bit of precision required instead of 32 bits! Pack two16 bit operations in 32 bit registers More packing with Saturation arithmetic User power control!

Time requirement 2.39 X speedup 68.5 41.8 100 90 80 70 Normalized time
60 41.8 2.39 X speedup 50 40 30 20 10 Unified Synch + Detect Original 16 bit fixed-point

DECODING DETECTION DEMODULATION CHANNEL ESTIMATION RECEIVER detected bits of all K users Joint channel estimation and detection Effective spreading code approach Optimized detector design Joint detection and decoding

Linear multiuser detector
Received signal r = (UZ) b + n Channel estimation (UZ) Matched filter output y = (UZ)T r Linear detector R b + n= y solve R = ((UZ)TUZ) Size of the linear system (NK) Direct inverse takes O((NK)3) operation N block-length K # of Users

Outline of the Kronecker algorithm
Correlation matrix is block-Toeplitz Approximate it as a block-circulant system

Outline of the Kronecker algorithm
Kronecker representation Isolates structure and the matrix blocks Fourier transform converts it to a block-diagonal system Computationally optimal Solve N independent order K system iteratively

Speedup in detector 90 80 83.1 Kbps Complexity O(N2K3) Vs
O(NK2 + KNlogN) 70 60 50 Achievable data rate (Kbps) 40 30 20 10.4 Kbps 10 Decorrelator Kronecker

Pipelining and parallelization
Mostly matrix based operations Detector - iterative algorithm Pipeline various iterations Parallelize operations Add more functional units Distribute data across functional units Distribute computations

Projected computation time
600 30 adders and multipliers 564.5 Kbps. DSP + Coprocessor support 500 400 Achievable data rate (Kbps) 300 DSP only 200 154.3 Kbps 100 20.75 Kbps Base Multiuser Algorithm Hardware Pipelining Pipelining + Parallelization

DECODING DETECTION DEMODULATION CHANNEL ESTIMATION RECEIVER detected bits of all K users Joint channel estimation and detection Effective spreading code approach Optimized detector design Joint detection and decoding

Maximum a-posteriori (MAP) decoding
TRANSMITTER ENCODING SPREADING MODULATION OTHER USERS b d Received signal: r = UZd + n Optimum decoding rule Constrained optimization problem Decode all users simultaneously Exponential complexity in number of users

Single-user detection and decoding
Suboptimum alternatives Isolate detection and decoding y1 yK b1 ^ bK r MF K MF 1 Decoder 1 Decoder K . .

Decoding matched filter outputs
10 MF+Viterbi Optimal 10 -1 -2 BER 10 10 -3 -4 10 1 2 3 4 5 6 7 8 SNR(dB) Huge performance loss!

Iterative detection and decoding
^ yc = (UZ)Tc(r- (UZ)IdI) User of concern c, interfering users I r = (UZ)cdc + (UZ)IdI + z Estimate dI Eliminate interference: Estimate dc for the next step Complexity linear in number of users

Reduction in decoding complexity
Convolutional code Coded bits depend on past data bits Performance improves with memory length Viterbi algorithm for decoding Complexity exponential in memory length Our suboptimal approach Maximal weight basis decoding Complexity quadratic in memory length

Joint detection and decoding performance
10 MF + Viterbi -1 10 Iter1 + Subopt Iter1 + Viterbi -2 Iter3 + Subopt 10 BER Optimal -3 10 Rate = 1/2 k = 7 -4 10 -5 10 1 2 3 4 5 6 7 8 SNR (dB)

Joint detection and decoding
Huge performance gain. Suboptimal approximation - Insignificant performance loss Significant computational gain Architecture for suboptimal decoding? Viterbi algorithm - butterfly architecture Have a sliding window implementation

Summary of contributions
Integrated channel estimation and detection model [wcnc] Optimized detection algorithm [PIMRC, Tr. Com] Fixed point implementation [ICASSP, SPIE] Parallel architecture [Asilomar] Joint detection and decoding [Globecom,Tr. Com] Suboptimal decoding algorithm [Asilomar, Tr. Inf. Th.]

Future research Wireless Cellular Ad-hoc Network Wireless LAN
Bluetooth/ Home Networks Wireless LAN

Future research Universal wireless receiver Reconfigurable solution
Power efficient Automate design? Network level interaction Resource allocation Quality of service guarantee Application level interaction

Further details http://www.ece.rice.edu/~suman

UZ Method

System Model - Received Signal
tk,p = q + g (integer and fraction part of delay) bit i = +1 bit i+1 = +1 q g ri(m) = wk,p rk,p(m) g Right + wk,p rk,p(m) (1-g) + wk,p rk,p(m) g Left + wk,p rk,p(m) (1-g) Chip Asynchronous

A(m) = [a1R a1L … akR akL … aKR aKL] Tc: chip period; t/Tc=q+g (integer and fraction part) Columns of A(m): linear combinations of ck[q]-s

Channel Response Model
Hence, columns of A(m) for each user k : akR = UkR zk(m) , akL = UkL zk(m) where, UkR = [ckR[0] … ckR[q] … ckR[N-1]] UkL = [ckL[0] … ckL[q] … ckL[N-1]] zk(m) = hk o pk(m) spreading code shifted by delay multipath attenuation and delay components array response components

Structure of pk(m) and hk: Path1 qk,1th element (qk,1+1)th qk,Pth element (qk,P+1)th : wk,1(1-gk,1) wk,1 gk,1 wk,P (1-gk,P) wk,P gk,P : rk,1,(m) rk,1 (m) rk,P (m) rk,P(m) hk = pk (m) =

Channel Estimation - Maximum Likelihood Algorithm
U: spreading codes of all the users (known) Z: all unknown parameters of all paths of all users time delay, attenuation, array response Goal: Estimate Z using spreading codes and preamble

Channel estimation - Maximum Likelihood Algorithm
Given: L observations r1,r2, …,rL. Joint conditional probability density function of r1,r2,…,rL Goal: Maximize above w. r. t. channel parameters (Z).

Multi-step Optimization Process
Define y = UZ. Form ML estimate y, of y. Estimate zk by a least squares fit between y=UZ and y. Extraction of individual parameters from zk-s ^ ^ ^

Kronecker algorithm

Iterative methods To solve Rb = y Solution at step k is bk
Calculate error ek = Rbk - y Modify solution from error and earlier estimate Cost matrix-vector product takes O(n2) operations Each iteration takes O(n2) steps Can we do better?

Matrix vector product Circulant matrix diagonalized by Fourier matrix
C x = (F t L F) x Matrix vector product through Fourier transforms Takes O(n log n) steps. To calculate Ax when A is Toeplitz Embed Toeplitz matrix in a circulant system Compute product by Fourier transforms

Example

Outline of the algorithm
Correlation matrix is block-Toeplitz Approximate it as a block-circulant system Shift matrix as a building block Kronecker representation of the system Fourier transform to convert it to a block-diagonal system

Shift matrix Shift matrix Fourier matrix diagonalizes shift matrix:
F t s F = D F t s2 F = D2

In search of efficient solvers
Correlation matrix is block-Toeplitz Rewrite correlation matrix as RN1 - block-circulant

Kronecker product representation
Kronecker product is defined by Thus

Kronecker algorithm The system is a sum of Kronecker products
Property of Kronecker product Thus is a block diagonal system

Architectural mapping

Computations Involved
delay ri bi bi-1 time Model Compute Correlation Matrices Bits of K async. users aligned at times I and I-1 Received bits of spreading length N for K users

Solve for the channel estimate, Ai
Multishot Detection Solve for the channel estimate, Ai Multishot Detection

Differencing Multistage Detection
Successive Stages S=diag(AHA) y - soft decision d - detected bits (hard decision)

Block Bi-Diagonal Matrix
Structure of AHA Block Bi-Diagonal Matrix

Correlation Matrices (Per Bit)
Task Decomposition Block I Block II Block III Task B Correlation Matrices (Per Bit) Inverse Matrix Products Block IV M U X d A0HA1 O(K2N) Multistage Detection (Per Window) Rbr[R] O(KN) RbbAH = Rbr[R] O(K2N) b A0HA0 O(K2N) Rbr[I] O(KN) Data’ M U X RbbAH = Rbr[I] O(K2N) d O(DK2Me) Rbb O(K2) A1HA1 O(K2N) Pilot AHr O(KND) Data Channel Estimation Multistage Detection Task A

Sequential / Pipeline A B
Task A Block IV AHr O(KND) d Data O(DK2Me) Task B

(Parallel A) B Task A Task B 3367*Me cycles 885 cycles Block IV AHr
O(ND) 1 Data O(DK2Me) d K Task B 3367*Me cycles 885 cycles

At this step Task A Task B Multistage Detection Block I &II 1 Data K
Stage Stage Stage3… Block IV Block III Task B

Joint detection and decoding

Convolutional codes Rate : 1/2 memory (k) : 2 d2 = d1 d4 = d1 + d3
b deven dodd Rate : /2 memory (k) : 2 d2 = d1 d4 = d1 + d3 d6 = d1 + d3 + d5 d8 = d3 + d5 + d7 d10 = d5 + d7 + d9 dodd systematic bits deven parity bits

Iterative Interference Cancellation
^ d2 MF for user 2 spread - Estimate coded bits . .. . .. + MF for user 1 r - y1 ^ MF for user K + spread ^ Estimate coded bits dK dK ^ d1 Iterate the process among users

Interference Cancellation
User of concern c, interfering users I r = ScAcdc + SIAIdI + z Estimate dI Eliminate interference Estimate dc ^ ^ yc = STc(r- SIAIdI)

Prior Updates + . .. . .. - - Updated prior p(d1) from previous step
MF for user 2 spread - y1 ^ Estimate coded bits Estimate coded bits . .. . .. + MF for user 1 r - MF for user 1 + spread ^ dK ^ ^ p(d1|y1), d1 Prior p(d1) for next step

Algorithm Start with uniform prior p0(dc) for all users
Obtain an initial estimate d0 from matched filter output In the ith iteration, for all users, compute: Calculate dic and pi(dc| ) Use pi(dc| ) as the prior pi+1(dc) Stop iteration when no further change in d ^ ^ yic = S’c(r- SIAIdi-1I) ^ ^ yic ^ yic

Convergence Study Optimal performance achieved after a few iterations

Maximum weight basis decoding

Suboptimal single user channel decoder
y = (y1, …yN) d = (d1, …dN) Viterbi algorithm: Complexity grows exponentially with k If no codeword constraint d = sgn(y) Estimated d may not be a codeword !!

Maximum weight basis decoding
More variables than equations NR independent variables N: block-length R: Rate Choice depends on yi y= d = 1 y= d = -1 y = d = ? d2 = d1 d4 = d1 + d3 d6 = d1 + d3 + d5 d8 = d3 + d5 + d7 d10 = d5 + d7 + d9 Want to choose maximally independent subset with largest total weight

Selection of maximally independent subset
Set I = Given y, sort the weights |yi | : i = {1..N} While | I | < NR Choose location from {1..N} with largest weight such that I U e is still an independent subset of {1..N} Set I = I U e .

Suboptimal decoding algorithm
Chose M maximum independent subset For each independent subset Compute the codeword dI Compute the likelihood p ( y|dI ) Chose codeword with largest likelihood Decoding complexity reduced from O(2k) to O(k2) If de = sgn(ye)

Complexity Issues Iterative detection and decoding
Linear complexity in the number of users Viterbi decoding for each user Exponential complexity in the constraint length k Cannot use large constraint length codes for real time systems

Suboptimal Channel Decoder
Received signal r Select codeword d which has the largest likelihood. If no codeword constraint d = sgn(r) Complexity independent of k Systematic code d = (ds,dp) ds can be chosen in an unconstrained manner Choice of ds uniquely determines dp

Suboptimal Channel Decoder (cont.)
(2,3) even parity code code bits transmitted bits received signal Pr(d=1|r=7.5) = 1 Pr(d=1|r=0.5) = 0.6 ML decoder likelihood = (1)x(0.4)x(1) = 0.4 Unconstrained detection likelihood = (1)x(0.6)x(1) = high !

code bits received signal decide based on info bits likelihood = (1)x(0.6)x(0) = 0 possible error! Chose top 2 paths likelihood = (1)x(0.4)x(1) = 0.4 Final decision based on likelihood value

Any 2 of the 3 bits can be chosen independently That uniquely determines the third bit. code bits received signal Should first detect

Suboptimal Convolutional Decoder
Convolution code of rate R and block length N NR bits can be chosen independently Choice of NR bits not as simple Choose the ones we believe in most Choose the bits bi with largest received |ri|

Algorithm Analysis Algorithm gives maximal independent subset with largest weight. Complexity Radix sort O(N) at most k equations for each unknown Total algorithm complexity O(NKk)

Suman Das Rice University

Similar presentations

Presentation on theme: "Suman Das Rice University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Suman Das Rice University

Similar presentations

Presentation on theme: "Suman Das Rice University"— Presentation transcript:

Similar presentations

About project

Feedback