1 Enabling Protocol Coexistence: High-Level Hardware-Software Co-design of Flexible Modern Wireless Transceivers Benjamin Drozdenko Graduate Research Assistant & Ph.D. Candidate Advisors: Prof. Leeser (RCL) & Prof. Chowdhury (GENESYS) Northeastern University, Boston, MA Northeastern ECE Ph.D. Student Seminar Series (NEPSSS) March 30, 2016 Enabling Protocol Coexistence: High-Level Hardware-Software Co-design of Flexible Modern Wireless Transceivers Benjamin Drozdenko Graduate Research Assistant & Ph.D. Candidate Advisors: Prof. Leeser (RCL) & Prof. Chowdhury (GENESYS) Northeastern University, Boston, MA Northeastern ECE Ph.D. Student Seminar Series (NEPSSS) March 30, 2016
2 So What-Who Cares? Wireless Transceivers: Y’all Got ‘em! Surge in wireless devices 10B devices today, 50B by 2050 $14 trillion business over next 10 years Challenges: Times are changing C1: Adapt to changing protocols to handle contention C2: Maintain/increase bit rates C3: Decrease energy consumption and error rates LTEWi-Fi
3 Another Challenge: Spectrum Scarcity C4: Change center frequency to use new bandwidths MHz: af TV Whitespace Reuse GHz: Military RADAR Reuse 2.4, 5.8 GHz: a/b Designated ISM Bands
4 Modeling Environment Barriers: Why Making Such Wireless Transceivers Is Hard Comms protocols evolve; transceiver HW/SW must evolve too! SW f c =5.8 GHz f s =20 Msps HW B2: HW & SW must be reconfigurable B4: Map behaviors to HW or SW B3: Each processing block (PB) must be same on HW&SW B1: HW-SW modeling environment y=FFT(x) module FFT(x,y) function FFT(x,y) == ProcBlk3ProcBlk1 ProcBlk2 Effective Bus FPGA
5 (B1)Signal Processing for Wireless Comms Understanding (B3)Implementation Portability: equivalent functionality on HW&SW (C2)Time Synchronization (C3)Low Energy/Error Rate (C4)Spectrum scarcity: Changing bandwidths (B1)HW & SW Joint Modeling Environment (B2)SW Control of Radio Parameters Enabling Technologies System goals and challenges Fundamental Research (C1)Adapt to changing protocols, contention (B2)Partially Reconfig- urable HW (FPGA) (B1)Time & Energy Optimization Techniques (B4)Identify ideal mapping of wireless behaviors to HW & SW Top Down-Bottom Up Approach to Modeling Wireless Transceivers for Protocol Coexistence R1 R2R3 R4 T1 T2T3
6 What Does a HW-SW Modeling Environment Need? 1.Provide a HW-SW Prototyping Platform 1.Modeling for Wireless Processing Blocks (PBs) 2.Hardware (HW) Components 3.Software (SW) Tools 2.Model a HW-SW Divide Point 3.Enact HW-SW Interfacing 4.Exhibit Reusability & Adaptability to Modern Standards What Does a HW-SW Modeling Environment Need? 1.Provide a HW-SW Prototyping Platform 1.Modeling for Wireless Processing Blocks (PBs) 2.Hardware (HW) Components 3.Software (SW) Tools 2.Model a HW-SW Divide Point 3.Enact HW-SW Interfacing 4.Exhibit Reusability & Adaptability to Modern Standards T2: HW & SW Joint Modeling Environment: Testbed Requirements
7 HW-SW Prototyping Platform: Modeling for Wireless Processing Blocks PBTransmitter (Tx)Receiver (Rx) 1ScramblingPreamble Detection 2Convolutional CodingOFDM Demodulation 3Block InterleavingBPSK Demodulation 4BPSK ModulationBlock De-interleaving 5OFDM ModulationViterbi Decoding 6Preamble InsertionDescrambling Tx: Data BitsTx: Samples Rx: Samples Rx: Data Bits Simulink Model for Tx or Rx path Simulink: Design Synchronous Dataflow (SDF) Models Integrated Profiling: Look at Entire a PHY Layer Processing Chain
8 HW-SW Prototyping Platform: Hardware Components: Xilinx Zynq Zynq-Based Heterogeneous Computing System Zynq-7000 series System-on-Chip (SoC) Processing System: ARM Cortex-A9 CPU Programmable Logic: FPGA with DSPs & BRAM We prototype on 2 varieties: ZC706 & Zedboard FPGA Zynq SoC CPU FPGA Zynq SoC CPU FPGA Zynq SoC CPU
9 JTAG (to FPGA) HW-SW Prototyping Platform: Hardware Components Host PC: Runs SW Tools RF Front End: ADI FMComms3 FPGA Zynq SoC CPU Receive Path Transmit Path Ethernet (to CPU) 2Tx 2Rx AD9361 FMC Slot Zynq-Based Heterogeneous Computing System Radio Frequency (RF) Front End Host Personal Computer (PC) Zynq-Based Heterogeneous Computing System
10 HW-SW Prototyping Platform: Software Tools FPGA Zynq SoC CPU Receive Path Transmit Path JTAG (to FPGA) Ethernet (to CPU) MathWorks Simulink™ Model HDL Code Xilinx Vivado ® C Code ARM Executable FPGA Bitstream Embedded Coder™ HDL Coder™ Zynq-Based Heterogeneous Computing System Host PC: Runs SW Tools Embedded Coder: Generate C code for ARM Processor HDL Coder: Create HW Description Language (HDL) code Vivado: Synthesize, Implement, and Generate FPGA Bitstream
11 Modeling a HW-SW Divide Point FPGA Zynq SoC CPU Receive Path Transmit Path V1 SW HW V2 SW HW V3 SW HW V4 SW HW V5 SW HW V6 SW HW V7 SW HW V1: SW-only model V2: Adds Tx F6 & Rx F1 to HW V3: Adds Tx F5 & Rx F2 to HW V4: Adds Tx F4 & Rx F3 to HW V5: Adds Tx F3 & Rx F4 to HW V6: Adds Tx F2 & Rx F5 to HW V7: HW-only model Zynq-Based Heterogeneous Computing System
12 Advanced eXtensible Interface (AXI): Bus to Connect CPU & FPGA Direct Memory Access (DMA): To Hold Data Sent b/w CPU & FPGA First-In First-Out (FIFO): Queue to Buffer Bits in Transit HW-SW Interfacing: Bus Details Note: the data to transfer between CPU & FPGA has a different size and class for each model variant! FPGA Zynq SoC CPU Receive Path Transmit Path Tx 2Rx DAC: I 1,2, Q 1,2 AD9361 ADC: I 1,2, Q 1,2 AXI DMA Controller FIFOunpack FIFOslice FIFOconcat FIFOpack RF Front End: ADI FMComms3 Zynq-Based Heterogeneous Computing System
13 HW-SW Interfacing: Data Transfer Types & Sizes Data to SendData TypeSize of 1#Elements V1SamplesSigned Fixed Point16 bits80 V2SamplesSigned Fixed Point16 bits64 V3SymbolsSigned Integer1-8 bits64 V4Coded BitsBoolean1 bit48 V5Coded BitsBoolean1 bit48 V6Data BitsBoolean1 bit24 V7Data BitsBoolean1 bit24 Before sending data between CPU & FPGA, we translate to a 32-bit unsigned integer format for transfer on AXI interconnect We build a library of bundling blocks to facilitate this transfer
14 Results: CPU Execution Time: Transmitter on Zynq Moving one processing block from SW to HW does not necessarily cause speedup Increase in Tx frame time on ZC706 from V1 to V2 is proof V1 is SW-only, requires no AXI communication Keeps all operations in SW V2 adds small component to HW Time saved < time spent on CPU-FPGA data transfer Our modeling environment can identify location at which HW-SW interface is best placed
15 Results: CPU Execution Time: Receiver on ZC706 Rx maximum CPU frame time decreases as more blocks are moved onto the FPGA Preamble detection is revealed to be the biggest bottleneck in the Rx model Moving it in V2 results in the largest drop in frame time Also drops with FFT in V3 & Viterbi Decoder in V6 Moving Descrambler in V7 does not show decrease, suggesting we can put it in SW Rx maximum CPU frame time decreases as more blocks are moved onto the FPGA Preamble detection is revealed to be the biggest bottleneck in the Rx model Moving it in V2 results in the largest drop in frame time Also drops with FFT in V3 & Viterbi Decoder in V6 Moving Descrambler in V7 does not show decrease, suggesting we can put it in SW
16 Results: FPGA Resource Utilization and Power Usage PBTxRx Transmitter Res Util Receiver Res Util Power
17 Variants of Processing Blocks: Preamble Detection MF VariantDefaultHDL LongHDL Training Data Path Delay (ns) % LUTs % Registers % DSPs Total Power (W) Block uses a matched filter to correlate 2 frames with a fixed set of coefficients 1 st MF manually assembled from adders & multipliers Not ideal: uses 99% of DSPs 2 nd MF correlates with full long preamble But long preamble composed of repetitions of training seq 3 rd MF correlates with only the training sequence 2.38X reduction in path delay 1.12X reduction in power
18 Variants of Processing Blocks: Viterbi Decoder VD VariantDelay- Based BRAM- Based Data Path Delay (ns) % LUTs % Registers BRAM Tiles02 Total Power (W) 2.36 VD Power (W) Block reverses effects of Convolutional Encoder Requires memory to hold intermediate state values 1 st VD uses delay blocks to hold state memory Exhibits lower path delay 2 nd VD uses BRAM tiles to hold state memory Uses fewer LUTs and registers Slightly lower power Illustrates tradeoff between time and power Can dynamically tune design to target either objective
19 Reusability & Adaptability to Modern Wireless Standards Processing Block802.11aWi-Fi (802.11g)Mobile (LTE) 1.Scrambling(1) 2.Convolutional Coding(1) 3.PSK Modulation(B)(DB)(Q) 4.Block Interleaving(1) 5.OFDM(1) (DL, ) 6.Preamble Insert/Detect(1)(2) (1): Equivalent, Reusable (2): Not Yet Implemented, but a variant can be reused
20 Variants of Processing Blocks: OFDM IFFT IFFT Size Data Path Delay (ns) % LUTs % Registers % DSPs Total Power (W) In LTE, OFDM modulation uses different IFFT sizes to spread symbols onto a larger number of subcarriers We vary the IFFT sizes to identify its impact on FPGA metrics Delay, resources, and power rises for higher IFFT sizes Limiting factors: #LUTs for multiple IFFTs on FPGA
21 Conclusions Introduces a method for modeling HW-SW co-designs for wireless transceivers Enables profiling of all processing blocks Identifies bottlenecks such as preamble detection Explores various HW-SW divide points Identifies which model variants are most desirable Details interfacing needed at divide point Shows when variants use more power from data transfer Shows added FPGA power is a fraction of CPU power Improves Preamble detection by fewer MF coefficients Customizes Viterbi decoder to use different resources Introduces a method for modeling HW-SW co-designs for wireless transceivers Enables profiling of all processing blocks Identifies bottlenecks such as preamble detection Explores various HW-SW divide points Identifies which model variants are most desirable Details interfacing needed at divide point Shows when variants use more power from data transfer Shows added FPGA power is a fraction of CPU power Improves Preamble detection by fewer MF coefficients Customizes Viterbi decoder to use different resources
22 Future Work Perform live tests with online radio transmissions Measure link latency and error rates Develop rules to automate HW-SW co-designs Make decisions about HW-SW divide point Automate bundling for data transfer between HW & SW Switch out platform to test newest HW Altera Arria 10® Xilinx Ultrascale+ MPSoC Explore co-existence with modern protocols ( & LTE) OFDM IFFT study is first look at this Perform live tests with online radio transmissions Measure link latency and error rates Develop rules to automate HW-SW co-designs Make decisions about HW-SW divide point Automate bundling for data transfer between HW & SW Switch out platform to test newest HW Altera Arria 10® Xilinx Ultrascale+ MPSoC Explore co-existence with modern protocols ( & LTE) OFDM IFFT study is first look at this
23 Publications & Acknowledgments Extended Abstracts & Posters: BARC 2016, Boston, MA, January 29, IEEE INFOCOM 2016, San Francisco, CA, April 11-14, Submitted, Pending: IEEE Transactions on Emerging Topics in Computing, Special Issue on Next Generation Wireless Computing Systems. Submitted Mar 1, IEEE Field Programmable Logic & Applications (FPL) Submitted Mar 27, Plans: ACM Wireless Network Testbeds, Experimental evaluation, and Characterization (WiNTECH) 2016, October 3, Acknowledgments: Extended Abstracts & Posters: BARC 2016, Boston, MA, January 29, IEEE INFOCOM 2016, San Francisco, CA, April 11-14, Submitted, Pending: IEEE Transactions on Emerging Topics in Computing, Special Issue on Next Generation Wireless Computing Systems. Submitted Mar 1, IEEE Field Programmable Logic & Applications (FPL) Submitted Mar 27, Plans: ACM Wireless Network Testbeds, Experimental evaluation, and Characterization (WiNTECH) 2016, October 3, Acknowledgments:
24 References [1] Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications: High-speed Physical Layer in the 5 GHz Band, IEEE Std a-1999, [2] J. Pendlum, M. Leeser, and K. Chowdhury, “Reducing processing latency with a heterogeneous fpga-processor framework,” in 22 nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, Boston, MA, USA, May 11-13, IEEE Computer Society, 2014, pp. 17–20. [Online]. Available: [3] B. Drozdenko, R. Subramanian, K. Chowdhury, and M. Leeser, Cognitive Radio Oriented Wireless Networks: 10th International Conference, CROWNCOM 2015, Doha, Qatar, April 21-23, 2015, Revised Selected Papers. Cham: Springer International Publishing, 2015, ch. Implementing a MATLAB-Based Self-configurable Software Defined Radio Transceiver, pp. 164–175. [Online]. Available: http: //dx.doi.org/ / [4] National Instruments, Inc. (2016) Real-time lte/wi-fi coexistence testbed. [Online]. Available: [5] MathWorks, Inc. (2016) Zynq sdr support from communications system toolbox. [Online]. Available: [6] Xilinx, Inc. (2016) Vivado design suite - hlx editions. [Online]. Available: [7] Analog Devices, Inc. (2015) Integrated transceivers, transmitters, and receivers. [Online]. Available: transceivers-transmitters-receivers.html [1] Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) specifications: High-speed Physical Layer in the 5 GHz Band, IEEE Std a-1999, [2] J. Pendlum, M. Leeser, and K. Chowdhury, “Reducing processing latency with a heterogeneous fpga-processor framework,” in 22 nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, FCCM 2014, Boston, MA, USA, May 11-13, IEEE Computer Society, 2014, pp. 17–20. [Online]. Available: [3] B. Drozdenko, R. Subramanian, K. Chowdhury, and M. Leeser, Cognitive Radio Oriented Wireless Networks: 10th International Conference, CROWNCOM 2015, Doha, Qatar, April 21-23, 2015, Revised Selected Papers. Cham: Springer International Publishing, 2015, ch. Implementing a MATLAB-Based Self-configurable Software Defined Radio Transceiver, pp. 164–175. [Online]. Available: http: //dx.doi.org/ / [4] National Instruments, Inc. (2016) Real-time lte/wi-fi coexistence testbed. [Online]. Available: [5] MathWorks, Inc. (2016) Zynq sdr support from communications system toolbox. [Online]. Available: [6] Xilinx, Inc. (2016) Vivado design suite - hlx editions. [Online]. Available: [7] Analog Devices, Inc. (2015) Integrated transceivers, transmitters, and receivers. [Online]. Available: transceivers-transmitters-receivers.html