Asynchronous and Synchronous Design Techniques for Communication Systems Applications Faculty of Electronic Engineering, Nis, Serbia Miloš Krstić, Dr.-Ing.
Overview Motivation Synchronous design solutions GALS - State of the Art Introduction to request-driven GALS technique Asynchronous wrapper for request-driven GALS blocks GALSification of the baseband processor for IEEE 802.11a standard Testing our GALS baseband design Design-flow and implementation Experimental results Conclusions 2
Motivation – Key Design Issues for Wireless Systems A system integration framework for the complex digital blocks is needed in order to avoid clock-skew and timing-closure problems. Lowering of the EMI has great importance in the mixed-signal environment. Minimization of the power consumption is a key issue in mobile systems. We are aiming to achieve high data throughput with low latency. 3
Challenges with Synchronous Design Most digital systems today operate synchronously. However, the complexity of wireless communication systems grows enormously. 4
Synchronous solutions There are synchronous solutions for the integration, power and EMI problems. System integration Use of deskewing circuits, hybrid networks, DLLs, PLLs… Reduction of the power consumption Clock gating, Voltage scalling… Reduction of EMI Clock modulation, Clock jittering… 5
System Integration – Synchronous Solutions Increasing challenges in distributing low-jitter clocks in presence of power-supply noise. Power consumption and complexity is very high. Justified only for high-performance ASICs. Clock distribution network with deskewing circuit (Geannopoulos and Dai 1998) 6
Power Reduction – Synchronous Solutions Pdyn=A·Ceff·Vdd2·f Some power saving techniques are based around activity reduction Example is clock-gating The others are trying to reduce supply voltage and/or frequency Examples are Voltage Scaling and Dynamic Voltage Scaling 7
Power Reduction – Clock Gating FUB clock FUB clock enable Pros: Significant power reduction Cons: Increases gate count, needs additional control logic Not effective when used for less than several clock periods Clock-tree design even more harder! 8
Power Reduction – Voltage Scaling Slow Fast High Supply Voltage Low Supply Voltage Pros: Saves power very effectively Cons: Additional power delivery network Needs special care of interface between power domains 9
EMi Reduction One powerful method to reduce EMI is the spread spectrum technique which modulates the signal and spreads the energy over a wider frequency range. The other possibility is the introduction of clock jitter Finally, asynchronous circuit design reduces EMI very effectively. 10
GALS as a design technique We mentioned several methods and tools for menaging each design challenge separately. There are almost no technique that address all these issues in the same time. However, GALS techniques have the potential to solve some of the most challenging design issues of SoC integration of communication systems. 11
What is GALS? GALS is abbreviation for Globally-Asynchronous Locally-Synchronous systems. Req Ack Data 12
GALS as a Powerful Design Technique In the wireless communication systems GALS can approach the main design challenges. GALS makes data transfer between the blocks very easy. Design problems as timing closure or clock-tree generation are limited to the level of much smaller local blocks. Decoupling of local blocks from central clock source reduces spectral noise considerably. Power saving is automatically integrated in asynchrnous wrapper. 13
Power reduction with GALS Clock signal is the dominant source of power consumption . First estimations showed that about 30% of power savings could be expected in the clock net due to the application of GALS. Recently, some more pessimistic power estimation figures were presented GALS techniques offer independent setting of frequency and voltage levels for each locally synchronous module. When using dynamic voltage scaling (DVS), an average energy reduction of up to 30% can be reached Power distribution in high-performance CPU 14
Potential for reducing EMI with GALS We have simulated noise generated on the power supply line in the synchronous and request-driven GALS system. dB GALS introduces reduction of about 20 dB -20 -40 -60 -80 -100 -120 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Frequency GHz dB -20 -40 -60 -80 -100 -120 -140 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Frequency GHz 15
Classical GALS approach Published in Jens Muttersbach et al., Globally-Asynchronous Locally-Synchronous Architectures to Simplify the Design of On-Chip Systems, In Proc. of ASIC/SOC Conference, pp. 317-321, Sept. 1999. Locally Synchronous Module 1 Output port Input port Locally Synchronous Module 2 Data handshake Local Clock Generator 1 Local Clock Generator 2 Asynchronous Wrapper 1 Asynchronous Wrapper 2 stretch1 stretch2 16
Pausable Clock Generator 17
Main challenges of the typical GALS methods In many solutions, the problems of data transfer and throughput is critical. Most of them can perform data transfer every second clock cycle of the local clock. Some described circuits can theoretically transfer data every clock cycle. However, the intensive stretching of the pausable clock generator will significantly diminish the practical performance. The latency of the transferred data is not known in advance and may vary significantly from one data transfer to the other one. It is not very practical to use the ring oscillators for local clock generation. All solutions are oriented towards a very general application. They are not optimised for specific systems and environmental demands. 18
Basic concept of the request-driven operation This approach covers point-to-point communication with very intensive but bursty data transfer. When receiving input burst, GALS block can operate in a request-driven mode. When there is no input activity, the data stored inside the locally synchronous pipeline has to be flushed out. Then a local clock generator drives the GALS blocks. A Time-out function controls the transition from request driven operation to local clock generation mode. 19
Request-driven asynchronous wrapper Local clock can be generated either internally or externally. 20
What can we gain from this GALS technique? Reliable and fast transfer of large bursts of data is achieved. Data transfer is possible at every clock cycle of synchronous block. In request-driven mode operation there is no arbitration in input port. Consequently, the circuit immediately responds to input requests. The clock speed is determined by the master and not by the slower participant in the communication. The local clock can be generated internally or externally. This proposed architecture offers an efficient power-saving mechanism, similar to clock gating. EMI should be reduced due to varying delays and frequencies in different asynchronous wrappers. 21
Building the wrapper components - input port Input port has to provide control of the dataflow according to a ‘broad’ 4-phase handshake protocol. The input port consists of a speed-independent (SI) input controller along with few additional gates that have to provide glitch-free transitions of the input signals. 22
Input controller specification Idle mode Input controller is modeled as an AFSM (asynchronous finite state machine). The controller is specified according to burst-mode requirements. Burst-mode AFSM is implemented as ‘Huffman Machine’ without explicit latches. Request-driven mode inputs outputs A Hazard-Free Combinational Network X Local clock generation mode B Y C Z Transitional mode State (several bits) State graph of the input controller 23
Input controller implementation Burst-mode input controller is synthesized using 3D tool that supports 2-level hazard-free logic minimization and achieves optimal state assigment: REQ_INT = REQ_A1 REQ_INT + ACKC' REQ_INT + REQ_A1 ACKC' ST' ACKEN' ACK_A = ACKC' REQ_INT + REQ_A1 RST +ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' ACKEN = ACKI1 + REQ_A1 ACKEN + ST ACKEN RST = STOP + ACKC' REQ_INT + REQ_A1 RST + ST RST + ACKC' ST ACKI1' ACKEN Z0' + REQ_A1 ACKC' ST' ACKEN' REQ_I1 = REQ_A1 ST ACKI1' ACKEN' Z0 = ACKI1 + REQ_A1' ACKC + REQ_A1' ST' Z0 + ACKC' ACKEN Z0 + ACKC ACKEN' Z0 Logic equations are automatically converted into synthesizable structural VHDL code with our 3DC tool. Formal analysis of the asynchronous wrapper is performed. 24
Externally-driven GALS Wrapper
Clock Menagement Unit
Baseband processor for WLAN The goal of our project is to develop a single-chip wireless broadband communication system in the 5 GHz band. The modem is compliant with the IEEE802.11a WLAN standard . System uses Orthogonal Frequency Division Multiplexing (OFDM) with data rates ranging from 6 to 54 Mbit/s. The synchronous baseband processor was implemented as an ASIC (700k gates). 27
Structure of the synchronous baseband processor Transmitter Receiver Input buffer Scrambler Signal field generator Encoder Interleaver Mapper Pilot insertion Pilot scrambler IFFT Guard interval insertion Preamble insertion Synchronizer datapath Channel estimator Demapper Deinterleaver Viterbi decoder Descrambler Parallel converter FFT tracking Buffer 20 - 80 Buffer 80 -20 Baseband processor includes receiver and transmitter datapath structure. Very complex blocks are implemented such as Viterbi decoder, FFT, IFFT, CORDIC processors, ... 80 Msps block 20 Msps block 28
Design challenges in the baseband processor Design of the baseband processor involves the challenges as: - several clock domains, - global clock tree generation, - large number of clock leaves (36 k flip- flops), - clock skew handling, - timing closure between the different modules, - clock gating, - power consumption, - EMI. Our request–driven GALS architecture was developed as a possible solution for those problems. 29
(async-sync interface) (async-sync interface) GALS partitioning Baseband Processor Tx_1 Scrambler Tx_2 Tx_3 Pilot scrambler The partitioning process has to take into account possible power saving. Input buffer Encoder (async-sync interface) Tx_int Interleaver Mapper Pilot insertion IFFT Guard interval insertion Preamble insertion Signal field generator Rx_3 Rx_TRA Rx_2 (async-sync interface) Rx_int Parallel converter Descrambler Viterbi decoder Deinterleaver Demapper Token rate adaptation Buffer 20 - 80 Channel estimator FFT 80 Msps block Synchronizer datapath Activation interface 20 Msps block Rate adaption block Encoder Interleaver Mapper FIFO TA Rx_1 Buffer 80 -20 Synchronizer tracking Interface block 30
Test strategy We are using a hardware tester which is strictly cycle based and cannot react to asynchronous output signals of the circuit. The GALS arbitration processes preclude cycle level determinism. We want to have a possibility to run very complex functional tests internally. Applied test technique should support system diagnosis. A test strategy based on Built-In Self-Test (BIST) is proposed. BIST reduces the effort for generating a test program and enables us to use a synchronous tester. 31
Design for Testability in GALS TPG and TDE are based on the linear feedback shift register structure with embedded additional logic. A central BIST controller performs control of the test procedure. We can run hierarchical tests. This BIST technique can be used as a method for prototype verification. In combination with the scan approach, BIST can be even used as a basis for the manufacturing test. 32
Design flow We have used our in-house 0.25 CMOS process. Asynchronous wrappers AFSM specifaction Design flow 3D - Logic synthesis Synchronous blocks Functional specification 3DC tool – translation from 3D to structural VHDL VHDL description We have used our in-house 0.25 CMOS process. Asynchronous wrapper is equivalent to about 1.3 k inverter gates. Only tunable clock generation is 0.9 k gates. Asynchronous wrapper has throughput up to 150 Msps in request driven mode and 100 Msps in local mode. This application needs 80 Msps. LoLA Formal analysis Model Sim Abstract behavioural simulation Synopsys DC Gate mapping Model Sim Realistic behavioural simulation Synopsys DC Timing driven synthesis Model Sim Postsynthesis simulation Prime Power Power estimation Cadence Silicon Encounter Layout Model Sim Back annotation Prime Power Power estimation Tape-out 33
Area and power distribution Area and power statistics are based on the synthesized netlist data. Locally synchronous blocks occupy around 90% of the total area, The BIST circuitry requires around 3.5%, interface blocks 2.9%, and asynchronous wrappers 2%. Based on the switching activities, in the realistic transceiver scenario, power estimation with Prime Power tool has been performed. Synchronous datapath logic uses most of the power (around 52.4%), then local synchronous clock trees are using 34.5%, async-to-sync interfaces 7%, and asynchronous wrappers 2.9%. After layout, the estimated power consumption is 324.6 mW. 34
Implementational results Our GALS baseband processor is fabricated and tested. The total number of pins is 120 and the silicon area including pads is 45.1 mm2. Measured dynamic power dissipated in the pure synchronous baseband processor was 332 mW, and for the GALS baseband processor slightly lower, at 328 mW. 35
Improving System Integration with GALS Synchronous baseband processor challenges: - several clock domains, - global clock tree generation, - large number of clock leaves, - clock skew handling, - timing closure between blocks, - clock gating. Solved by GALS architecture No global clock in GALS Clock leaves distributed over GALS blocks Clock skew is reduced from 660ps to 486 ps Communication between the blocks through handshaking Clock-gating embedded in the asynchronous wrapper 36
~ 5 dB EMI measurement (I) The supply voltage variation spectrum of the inner processor core is measured. ~ 5 dB 37
EMI measurement (II) Additionally, instantaneous supply voltage peaks are reduced from 140 mV (synchronous design) from cycle to cycle to the less than 100 mV (GALS). This reduction can be very important for mixed-signal designs and for secure systems. An application with fine-grained GALS partitioning can lead to results closer to theoretical maximum reduction. 38
Conclusions GALS can be successfully used as a design technique in the wireless communication systems. The main goal of simplifying the system integration was achieved. Furthermore, we achieved a significant reduction of supply noise and a slightly lower dynamic power consumption. 39
Future activities in GALS area Automation of design flow for GALS systems. Further activities in reducing EMI. Modelling & Verification GALS - Synthesis • • Flexible bilding of Model • • System description Netlist, • • Abstract Verification • • Rules for partitioning Layout • • High - level - Datapath model • • Circuit synthesis • • Layout scripts GALS - Libraries Clock jitter • • Gates • • Jitter generators • • asynchronous basic components • • EMI - - Analysis • • parameterized Wrapper FPGA - - Synthesis • • • asynchronous Wrapper FPGA • • • Desynchronisation Thank you very much for your attention. 40