Presentation is loading. Please wait.

Presentation is loading. Please wait.

The following foils are for a presentation in Munich for Siemens.

Similar presentations


Presentation on theme: "The following foils are for a presentation in Munich for Siemens."— Presentation transcript:

1 The following foils are for a presentation in Munich for Siemens.
Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Integrated Circuits and Systems Lab Departments of CS & EE Darmstadt University of Technology Department of Microelectronics Indian Institute of Technology Bombay

2 Outline Pipelines: synchronous, asynchronous, wave pipelined,
and asynchronous wave pipelined (AWP) Comparison: AWPs vs. sync, async, and sync wave pipes AWP Circuit Design Application Example: EC Public Key Crypto Processor: Cryptography background Chip architecture and implementation Conclusion

3 Pipelining Pipelining used as premier technique to
better exploit hardware and boost performance of VLSI chips Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies

4 General Framework for Pipelines
Latch/Reg Latch/Reg Logic Data Clk

5 Some Notations...

6 General Relations

7 Synchronous Pipeline Latch/Reg Latch/Reg Logic Data Clk
Negative side-effects of gate-level pipelining : Increased latency, clock load/skew, power, area, design time More area for clocking and registers than for logic Implementation options: Register- vs. latch-based, explicit latches vs. latchless TSPC vs. local clocks derived from global clock Static vs. dynamic, single-ended vs. dual-rail Throughput determined by longest logic path + clock/register overhead Fine-grain pipelining allows high throughput at the cost of increased clock/register overhead

8 Asynchronous Pipeline
Handshake Handshake Logic Data req_in req_out ack_in ack_out Micropipeline (Sutherland 1989) Synchronous clock replaced by asynchronous handshaking Elastic operation: input and output rate may differ momentarily, and pipeline will buffer Implementation options: 4-phase (level) vs. 2-phase (event) protocol Bundled data (matched delay) vs. completion detection Operation is data dependant, saves power during idle As with fine-grain sync pipelines, throughput can be high; handshake causes high latency and backward stall Plug & Play composability Load on req and ack lines distributed Used by Furber‘s group at Manchester U for AMULET1/2/3

9 Synchronous Wave Pipeline
Latch/Reg Latch/Reg Wave Logic Data Clk Wave pipelining potentially gives higher throughput as conventional pipelines at decreased latency and reduced clock load, area and power However, tuning the logic and the delay elements is difficult Several data waves simultaneously active in the logic Logic has to minimize delay variations over P,T,V corners Global clock used with constructive skew to adjust phases

10 Wave Pipelining: A Short Outline
Wave pipelining occurs when combinational logic is clocked faster than latency would allow Several data waves are then active in the logic without being separated by storage elements Latency remains constant and throughput is determined by delay differences rather than absolute delay Requirement for delay balanced logic and complicated timing are the main hurdles

11 Wave Pipelining: A Little History
Technique stems from the 60s and has had a reputation for being exotic since Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU Some working academic chips exist, mainly datapath Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know

12 Asynchronous Wave Pipeline (AWP)
Wave Latch Wave Latch Wave Logic Data req_in matched delay req_out AWP is special case of the sync wave pipeline with the constructive skew set to worst-case logic delay It is crucial that the delay element accurately tracks the delay behaviour of the logic over P, T, V corners Data words associated with events on request line Several data waves and protocol events simultaneously active in the logic and the matched delay element, respectively

13 AWPs vs. Synchronous Pipelines
No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request Many pipeline registers removed, thus requirements on the clock (request) relaxed Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency

14 AWPs vs. Asynchronous Pipelines
AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead AWPs not elastic: data at output has to be consumed AWPs eliminate hazards as side-effect of delay balancing AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability (though inelastic), no global clock

15 AWPs vs. Synchronous Wave Pipelines
AWPs tackle two main difficulties in sync wave pipes: Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in con- trast to sync wave pipes do AWPs operate at any rate Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much

16 Wave Pipelining Combinational Logic
Overall goal: keep data wave coherent under all possible conditions (data, PTV) Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere First step: pad all short paths to maximum length

17 Example: 64-b Brent-Kung Parallel Adder
pg PG PG G x o r Buffers provide for same depth on every logic path All gates in the same column must have the same delay

18 Circuits Logic style used has to minimize delay variation
Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed Pass transistor logic gives slopy edges thereby introducing delay variation Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge

19 Circuits (cont.) Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge What is needed is a dynamic logic family without precharge overhead: SRCMOS Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.

20 SRCMOS Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced output N inputs

21 Operation of a 2-AND

22 Delay Balancing at Transistor Level
NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices Short paths are padded with dummy devices Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use Every output has to see the same load Lightly loaded outputs are given dummy cap

23 Example: Carry tree in a 64-bit adder

24 Gim Layout

25 Simulation of Gim cell Pulses of 4 possible input situations giving ´1´ at the output are tightly matched Note: in this case never are Pxy=Gxy=1

26 First Pulse Problem

27 Miller Effect

28 64-bit Adder Output Waveforms
latching window

29 N Transistor Sizing output inputs LINEAR SIZING
Wprecharge Wkeeper Cfeedback Cload N Cdrive output inputs Wpd Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING

30 Interconnect: Resistive Effects
0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only R/3, R/3, R/3 R/2, R/2 RC only

31 Interconnect: Coupling Effects
2 adjacent MET2 lines coupled by C=54fF

32 PTV Variations SRCMOS provides some robustness by generating fresh pulses at every gate output Pulsed operation reduces data dependancy, coupling PTV noise is not critical when drift is in the same direction across die Critical are: temperature gradient, supply drop, and local variations What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´

33 Cryptography Background
Cryptography - science of keeping communication private Symmetric schemes - Private key (DES) Asymmetric schemes - Public key (RSA & ECC) Private key schemes are quite fast; public key schemes are more safe

34 Security For comparison : ECC using 261 bits is regarded safer as RSA using 2048 bits For secure data transmission one combines both public and private key schemes. Data is encrypted using private key scheme and the key with public key scheme The frequency with which the key can be changed depends upon speed of public key cryptosystem

35 CISCO Data Encryption Service Adapter
[Cisco Systems]

36 Key Exchange Using Public Key Cryptosystem
For better security it pays to improve both schemes If ECC scheme is fast then DES session keys can be changed more frequently ECC K e y s DES Source ??? ?? ??? Sink

37 DES Key Exchange using Public-Key Cryptosystem based on Elliptic Curves

38 Why is this secure ? Security based upon DLP: in a finite Abelian group we can easily compute given However, is hard to compute out of and DLP extraordinarily hard for point group of elliptic curve: Set of solutions of cubic equation over any field is an abelian group

39 Elliptic Curve Mathematics and Algorithm
Two types - supersingular and non-supersingular Non-supersingular have the highest security EC equation -

40 Choice of the Field The field of the type F m
Having 2 as characteristic of a field helps in hardware implementation Our choice m=261: Existence of Optimal Normal Basis Determines the data path width and security 2

41 Adding Two Points Over Elliptic Curves

42 Switching to Projective Coordinates
The inversions are quite costly in terms of multiplications Projective coordinates have no inversions For m=261: Normal Projective Coordinates Double + Add

43 Projective Coordinates

44 Optimal Normal Basis

45 Multiplication over ONBs

46 The Final Formula

47 Architecture of Multiplier
Pseudo NMOS SRCMOS 1 2 1 1 1 3 abx 2 3_Xor Wave latch abx 1 3 1 abx 3_Xor 3_Xor Wave latch 9 3_Xor 27 3_Xor 259 87 87 abx 3_Xor 260 3_Xor Wave latch 29 delay abx 781 261 782 abx 783 3_Xor request delay

48 Circuit Style Followed
Dual-rail cross-coupled SRCMOS circuit NMOS trees are designed such that there is only one conducting path to ground

49 Pulses after First Stage

50 Delay Variations at various stages

51 The Total Latency

52 Architecture of the Cryptochip

53 Hierarchy of Control Double-and-Add Key generation rate R
left shift x k Double-and-Add Key generation rate R Hamming weight = 40 *(261*7+40*13) If x=1 always EC double EC add EC arithmetic R * 2347 MUL/s 7 13 * 261 Finite field arithmetic R * bit/s ADD MUL LOAD/ STORE

54 Control Unit Architecture
For static operation X R E G R E G OUT IN1 Logic reset IN2 req1 Req_out reqn AWP Request signals trigger the state transitions. Autonomous state transitions are triggered by signal X

55 Highest Level Control Level-based control 1 2 3 4 5 6 8 7
Start/LoadX, ResetZ 1 X=1 2 X=0 LoadY Shift K 3 X=0 X=1 4 If Stop=1/KP_Done If K=0 If K=1 X=1 5 ShiftK, Double 6 X=1 K=0,DoubleDone 8 X=0 K=1,DoubleDone/Add 7 AddDone X=1 Level-based control

56 The Request Signal Generation

57 Middle Level Control : Double Algo
X=0 X=1 X=1 X=1 1 X=0 2 X=1 3 4 5 Start OPAX OPBZ MULT MD X=1 X=1 X=1 58 X=1 59 X=0 60 OPAA X=1 61 Shift 62 OPBA 63 MULT MD Pulse-based control

58 Request Signal Generation

59 Various States in a Pulse based Control

60 Architecture and Implementation

61 Conclusion AWPs presented as alternative approach to high-speed
design, shows potential for GHz throughput without clocks AWPs avoid some problems of conventional wave pipes and (a)synchronous systems 64b adder + test circuit and EC crypto layout in the making Feasibility of having totally asynchronous control To do: support transistor sizing, quantify PTV impact


Download ppt "The following foils are for a presentation in Munich for Siemens."

Similar presentations


Ads by Google