Asynchronous and Synchronous Design Techniques for Communication Systems Applications Faculty of Electronic Engineering, Nis, Serbia Miloš Krstić, Dr.-Ing.

Slides:



Advertisements
Similar presentations
The Bus Architecture of Embedded System ESE 566 Report 1 LeTian Gu.
Advertisements

Presenter : Cheng-Ta Wu Kenichiro Anjo, Member, IEEE, Atsushi Okamura, and Masato Motomura IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39,NO. 5, MAY 2004.
Presenter : Ching-Hua Huang 2013/11/4 Temporal Parallel Simulation: A Fast Gate-level HDL Simulation Using Higher Level Models Cited count : 3 Dusung Kim.
Copyright 2001, Agrawal & BushnellLecture 12: DFT and Scan1 VLSI Testing Lecture 10: DFT and Scan n Definitions n Ad-hoc methods n Scan design  Design.
VSMC MIMO: A Spectral Efficient Scheme for Cooperative Relay in Cognitive Radio Networks 1.
Programmable Interval Timer
Altera FLEX 10K technology in Real Time Application.
A reconfigurable system featuring dynamically extensible embedded microprocessor, FPGA, and customizable I/O Borgatti, M. Lertora, F. Foret, B. Cali, L.
Modern VLSI Design 4e: Chapter 8 Copyright  2008 Wayne Wolf Topics High-level synthesis. Architectures for low power. GALS design.
Synchronous Digital Design Methodology and Guidelines
Kazi Spring 2008CSCI 6601 CSCI-660 Introduction to VLSI Design Khurram Kazi.
1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.
Mehdi Amirijoo1 Power estimation n General power dissipation in CMOS n High-level power estimation metrics n Power estimation of the HW part.
Lab for Reliable Computing Generalized Latency-Insensitive Systems for Single-Clock and Multi-Clock Architectures Singh, M.; Theobald, M.; Design, Automation.
Network-on-Chip: Communication Synthesis Department of Computer Science Texas A&M University.
Low power architecture and HDL coding practices for on-board hardware applications Kaushal D. Buch ASIC Engineer, eInfochips Ltd., Ahmedabad, India
By Praveen Venkataramani Vishwani D. Agrawal TEST PROGRAMMING FOR POWER CONSTRAINED DEVICES 5/9/201322ND IEEE NORTH ATLANTIC TEST WORKSHOP 1.
Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.
Low-Power Wireless Sensor Networks
1 SERIAL PORT INTERFACE FOR MICROCONTROLLER EMBEDDED INTO INTEGRATED POWER METER Mr. Borisav Jovanović, Prof.dr Predrag Petković, Prof.dr. Milunka Damnjanović,
Presenter: Hong-Wei Zhuang On-Chip SOC Test Platform Design Based on IEEE 1500 Standard Very Large Scale Integration (VLSI) Systems, IEEE Transactions.
MOUSETRAP Ultra-High-Speed Transition-Signaling Asynchronous Pipelines Montek Singh & Steven M. Nowick Department of Computer Science Columbia University,
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Logic Synthesis for Low Power(CHAPTER 6) 6.1 Introduction 6.2 Power Estimation Techniques 6.3 Power Minimization Techniques 6.4 Summary.
A New Method For Developing IBIS-AMI Models
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
NTU Confidential Packet Format Report of OFDM-based Agile Baseband Transceiver for Future Spectrum Pooling Wireless Systems Advisor : Tzi-Dar Chiueh Student.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
F. Gharsalli, S. Meftali, F. Rousseau, A.A. Jerraya TIMA laboratory 46 avenue Felix Viallet Grenoble Cedex - France Embedded Memory Wrapper Generation.
CHAPTER 8 Developing Hard Macros The topics are: Overview Hard macro design issues Hard macro design process Physical design for hard macros Block integration.
1 Hardware Description Languages: a Comparison of AHPL and VHDL By Tamas Kasza AHPL&VHDL Digital System Design 1 (ECE 5571) Spring 2003 A presentation.
Development of Programmable Architecture for Base-Band Processing S. Leung, A. Postula, Univ. of Queensland, Australia A. Hemani, Royal Institute of Tech.,
IMPLEMENTATION OF MIPS 64 WITH VERILOG HARDWARE DESIGN LANGUAGE BY PRAMOD MENON CET520 S’03.
Technical University Tallinn, ESTONIA Copyright by Raimund Ubar 1 Raimund Ubar N.Mazurova, J.Smahtina, E.Orasson, J.Raik Tallinn Technical University.
VADA Lab.SungKyunKwan Univ. 1 L5:Lower Power Architecture Design 성균관대학교 조 준 동 교수
Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,
Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.
-1- Soft Core Viterbi Decoder EECS 290A Project Dave Chinnery, Rhett Davis, Chris Taylor, Ning Zhang.
Gopakumar.G Hardware Design Group
ASIC Design Methodology
Welcome To Seminar Presentation Seminar Report On Clockless Chips
REGISTER TRANSFER LANGUAGE (RTL)
Asynchronous Interface Specification, Analysis and Synthesis
ELEC 7770 Advanced VLSI Design Spring 2016 Introduction
1 Input-Output Organization Computer Organization Computer Architectures Lab Peripheral Devices Input-Output Interface Asynchronous Data Transfer Modes.
Architecture & Organization 1
Wireless Sensor Networks 5th Lecture
ELEC 7770 Advanced VLSI Design Spring 2014 Introduction
Introduction to cosynthesis Rabi Mahapatra CSCE617
ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTEMS
Architecture & Organization 1
Timing Analysis 11/21/2018.
ELEC 7770 Advanced VLSI Design Spring 2012 Introduction
ELEC 7770 Advanced VLSI Design Spring 2010 Introduction
Parallel Sequence Spread Spectrum (PSSS)
FPGA Tools Course Answers
Chapter 10 Timing Issues Rev /11/2003 Rev /28/2003
Overview of Computer Architecture and Organization
Parallel Sequence Spread Spectrum (PSSS)
FPGA Glitch Power Analysis and Reduction
Overview of Computer Architecture and Organization
Testing in the Fourth Dimension
Parallel Sequence Spread Spectrum (PSSS)
Clockless Logic: Asynchronous Pipelines
Lecture 26 Logic BIST Architectures
Low Power Digital Design
ECE 448 Lecture 6 Finite State Machines State Diagrams, State Tables, Algorithmic State Machine (ASM) Charts, and VHDL code ECE 448 – FPGA and ASIC Design.
Synchronous Digital Design Methodology and Guidelines
Test Data Compression for Scan-Based Testing
Synchronous Digital Design Methodology and Guidelines
Presentation transcript:

Asynchronous and Synchronous Design Techniques for Communication Systems Applications Faculty of Electronic Engineering, Nis, Serbia Miloš Krstić, Dr.-Ing.

Overview Motivation Synchronous design solutions GALS - State of the Art Introduction to request-driven GALS technique Asynchronous wrapper for request-driven GALS blocks GALSification of the baseband processor for IEEE 802.11a standard Testing our GALS baseband design Design-flow and implementation Experimental results Conclusions 2

Motivation – Key Design Issues for Wireless Systems A system integration framework for the complex digital blocks is needed in order to avoid clock-skew and timing-closure problems. Lowering of the EMI has great importance in the mixed-signal environment. Minimization of the power consumption is a key issue in mobile systems. We are aiming to achieve high data throughput with low latency. 3

Challenges with Synchronous Design Most digital systems today operate synchronously. However, the complexity of wireless communication systems grows enormously. 4

Synchronous solutions There are synchronous solutions for the integration, power and EMI problems. System integration Use of deskewing circuits, hybrid networks, DLLs, PLLs… Reduction of the power consumption Clock gating, Voltage scalling… Reduction of EMI Clock modulation, Clock jittering… 5

System Integration – Synchronous Solutions Increasing challenges in distributing low-jitter clocks in presence of power-supply noise. Power consumption and complexity is very high. Justified only for high-performance ASICs. Clock distribution network with deskewing circuit (Geannopoulos and Dai 1998) 6

Power Reduction – Synchronous Solutions Pdyn=A·Ceff·Vdd2·f Some power saving techniques are based around activity reduction Example is clock-gating The others are trying to reduce supply voltage and/or frequency Examples are Voltage Scaling and Dynamic Voltage Scaling 7

Power Reduction – Clock Gating FUB clock FUB clock enable Pros: Significant power reduction Cons: Increases gate count, needs additional control logic Not effective when used for less than several clock periods Clock-tree design even more harder! 8

Power Reduction – Voltage Scaling Slow Fast High Supply Voltage Low Supply Voltage Pros: Saves power very effectively Cons: Additional power delivery network Needs special care of interface between power domains 9

EMi Reduction One powerful method to reduce EMI is the spread spectrum technique which modulates the signal and spreads the energy over a wider frequency range. The other possibility is the introduction of clock jitter Finally, asynchronous circuit design reduces EMI very effectively. 10

GALS as a design technique We mentioned several methods and tools for menaging each design challenge separately. There are almost no technique that address all these issues in the same time. However, GALS techniques have the potential to solve some of the most challenging design issues of SoC integration of communication systems. 11

What is GALS? GALS is abbreviation for Globally-Asynchronous Locally-Synchronous systems. Req Ack Data 12

GALS as a Powerful Design Technique In the wireless communication systems GALS can approach the main design challenges. GALS makes data transfer between the blocks very easy. Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks. Decoupling of local blocks from central clock source reduces spectral noise considerably. Power saving is automatically integrated in asynchrnous wrapper. 13

Power reduction with GALS Clock signal is the dominant source of power consumption . First estimations showed that about 30% of power savings could be expected in the clock net due to the application of GALS. Recently, some more pessimistic power estimation figures were presented GALS techniques offer independent setting of frequency and voltage levels for each locally synchronous module. When using dynamic voltage scaling (DVS), an average energy reduction of up to 30% can be reached Power distribution in high-performance CPU 14

Potential for reducing EMI with GALS We have simulated noise generated on the power supply line in the synchronous and request-driven GALS system. dB GALS introduces reduction of about 20 dB -20 -40 -60 -80 -100 -120 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Frequency GHz dB -20 -40 -60 -80 -100 -120 -140 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Frequency GHz 15

Classical GALS approach Published in Jens Muttersbach et al., Globally-Asynchronous Locally-Synchronous Architectures to Simplify the Design of On-Chip Systems, In Proc. of ASIC/SOC Conference, pp. 317-321, Sept. 1999. Locally Synchronous Module 1 Output port Input port Locally Synchronous Module 2 Data handshake Local Clock Generator 1 Local Clock Generator 2 Asynchronous Wrapper 1 Asynchronous Wrapper 2 stretch1 stretch2 16

Pausable Clock Generator 17

Main challenges of the typical GALS methods In many solutions, the problems of data transfer and throughput is critical. Most of them can perform data transfer every second clock cycle of the local clock. Some described circuits can theoretically transfer data every clock cycle. However, the intensive stretching of the pausable clock generator will significantly diminish the practical performance. The latency of the transferred data is not known in advance and may vary significantly from one data transfer to the other one. It is not very practical to use the ring oscillators for local clock generation. All solutions are oriented towards a very general application. They are not optimised for specific systems and environmental demands. 18

Basic concept of the request-driven operation This approach covers point-to-point communication with very intensive but bursty data transfer. When receiving input burst, GALS block can operate in a request-driven mode. When there is no input activity, the data stored inside the locally synchronous pipeline has to be flushed out. Then a local clock generator drives the GALS blocks. A Time-out function controls the transition from request driven operation to local clock generation mode. 19

Request-driven asynchronous wrapper Local clock can be generated either internally or externally. 20

What can we gain from this GALS technique? Reliable and fast transfer of large bursts of data is achieved. Data transfer is possible at every clock cycle of synchronous block. In request-driven mode operation there is no arbitration in input port. Consequently, the circuit immediately responds to input requests. The clock speed is determined by the master and not by the slower participant in the communication. The local clock can be generated internally or externally. This proposed architecture offers an efficient power-saving mechanism, similar to clock gating. EMI should be reduced due to varying delays and frequencies in different asynchronous wrappers. 21

Building the wrapper components - input port Input port has to provide control of the dataflow according to a ‘broad’ 4-phase handshake protocol. The input port consists of a speed-independent (SI) input controller along with few additional gates that have to provide glitch-free transitions of the input signals. 22

Input controller specification Idle mode Input controller is modeled as an AFSM (asynchronous finite state machine). The controller is specified according to burst-mode requirements. Burst-mode AFSM is implemented as ‘Huffman Machine’ without explicit latches. Request-driven mode inputs outputs A Hazard-Free Combinational Network X Local clock generation mode B Y C Z Transitional mode State (several bits) State graph of the input controller 23

Input controller implementation Burst-mode input controller is synthesized using 3D tool that supports 2-level hazard-free logic minimization and achieves optimal state assigment: REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN' ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' REQ_I1 = REQ_A1 ST ACKI1' ACKEN' Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0 Logic equations are automatically converted into synthesizable structural VHDL code with our 3DC tool. Formal analysis of the asynchronous wrapper is performed. 24

Externally-driven GALS Wrapper

Clock Menagement Unit

Baseband processor for WLAN The goal of our project is to develop a single-chip wireless broadband communication system in the 5 GHz band. The modem is compliant with the IEEE802.11a WLAN standard . System uses Orthogonal Frequency Division Multiplexing (OFDM) with data rates ranging from 6 to 54 Mbit/s. The synchronous baseband processor was implemented as an ASIC (700k gates). 27

Structure of the synchronous baseband processor Transmitter Receiver Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Descrambler Parallel converter FFT tracking Buffer 20 - 80 Buffer 80 -20 Baseband processor includes receiver and transmitter datapath structure. Very complex blocks are implemented such as Viterbi decoder, FFT, IFFT, CORDIC processors, ... 80 Msps block 20 Msps block 28

Design challenges in the baseband processor Design of the baseband processor involves the challenges as: - several clock domains, - global clock tree generation, - large number of clock leaves (36 k flip- flops), - clock skew handling, - timing closure between the different modules, - clock gating, - power consumption, - EMI. Our request–driven GALS architecture was developed as a possible solution for those problems. 29

(async-sync interface) (async-sync interface) GALS partitioning Baseband Processor Tx_1 Scrambler Tx_2 Tx_3 Pilot scrambler The partitioning process has to take into account possible power saving. Input buffer Encoder (async-sync interface) Tx_int Interleaver Mapper Pilot insertion IFFT Guard interval insertion Preamble insertion Signal field generator Rx_3 Rx_TRA Rx_2 (async-sync interface) Rx_int Parallel converter Descrambler Viterbi decoder Deinterleaver Demapper Token rate adaptation Buffer 20 - 80 Channel estimator FFT 80 Msps block Synchronizer datapath Activation interface 20 Msps block Rate adaption block Encoder Interleaver Mapper FIFO TA Rx_1 Buffer 80 -20 Synchronizer tracking Interface block 30

Test strategy We are using a hardware tester which is strictly cycle based and cannot react to asynchronous output signals of the circuit. The GALS arbitration processes preclude cycle level determinism. We want to have a possibility to run very complex functional tests internally. Applied test technique should support system diagnosis. A test strategy based on Built-In Self-Test (BIST) is proposed. BIST reduces the effort for generating a test program and enables us to use a synchronous tester. 31

Design for Testability in GALS TPG and TDE are based on the linear feedback shift register structure with embedded additional logic. A central BIST controller performs control of the test procedure. We can run hierarchical tests. This BIST technique can be used as a method for prototype verification. In combination with the scan approach, BIST can be even used as a basis for the manufacturing test. 32

Design flow We have used our in-house 0.25 CMOS process. Asynchronous wrappers AFSM specifaction Design flow 3D - Logic synthesis Synchronous blocks Functional specification 3DC tool – translation from 3D to structural VHDL VHDL description We have used our in-house 0.25 CMOS process. Asynchronous wrapper is equivalent to about 1.3 k inverter gates. Only tunable clock generation is 0.9 k gates. Asynchronous wrapper has throughput up to 150 Msps in request driven mode and 100 Msps in local mode. This application needs 80 Msps. LoLA Formal analysis Model Sim Abstract behavioural simulation Synopsys DC Gate mapping Model Sim Realistic behavioural simulation Synopsys DC Timing driven synthesis Model Sim Postsynthesis simulation Prime Power Power estimation Cadence Silicon Encounter Layout Model Sim Back annotation Prime Power Power estimation Tape-out 33

Area and power distribution Area and power statistics are based on the synthesized netlist data. Locally synchronous blocks occupy around 90% of the total area, The BIST circuitry requires around 3.5%, interface blocks 2.9%, and asynchronous wrappers 2%. Based on the switching activities, in the realistic transceiver scenario, power estimation with Prime Power tool has been performed. Synchronous datapath logic uses most of the power (around 52.4%), then local synchronous clock trees are using 34.5%, async-to-sync interfaces 7%, and asynchronous wrappers 2.9%. After layout, the estimated power consumption is 324.6 mW. 34

Implementational results Our GALS baseband processor is fabricated and tested. The total number of pins is 120 and the silicon area including pads is 45.1 mm2. Measured dynamic power dissipated in the pure synchronous baseband processor was 332 mW, and for the GALS baseband processor slightly lower, at 328 mW. 35

Improving System Integration with GALS Synchronous baseband processor challenges: - several clock domains, - global clock tree generation, - large number of clock leaves, - clock skew handling, - timing closure between blocks, - clock gating. Solved by GALS architecture No global clock in GALS Clock leaves distributed over GALS blocks Clock skew is reduced from 660ps to 486 ps Communication between the blocks through handshaking Clock-gating embedded in the asynchronous wrapper 36

~ 5 dB EMI measurement (I) The supply voltage variation spectrum of the inner processor core is measured. ~ 5 dB 37

EMI measurement (II) Additionally, instantaneous supply voltage peaks are reduced from 140 mV (synchronous design) from cycle to cycle to the less than 100 mV (GALS). This reduction can be very important for mixed-signal designs and for secure systems. An application with fine-grained GALS partitioning can lead to results closer to theoretical maximum reduction. 38

Conclusions GALS can be successfully used as a design technique in the wireless communication systems. The main goal of simplifying the system integration was achieved. Furthermore, we achieved a significant reduction of supply noise and a slightly lower dynamic power consumption. 39

Future activities in GALS area Automation of design flow for GALS systems. Further activities in reducing EMI. Modelling & Verification GALS - Synthesis • • Flexible bilding of Model • • System description Netlist, • • Abstract Verification • • Rules for partitioning Layout • • High - level - Datapath model • • Circuit synthesis • • Layout scripts GALS - Libraries Clock jitter • • Gates • • Jitter generators • • asynchronous basic components • • EMI - - Analysis • • parameterized Wrapper FPGA - - Synthesis • • • asynchronous Wrapper FPGA • • • Desynchronisation Thank you very much for your attention. 40