The following foils are for a presentation in Munich for Siemens.

Name: The following foils are for a presentation in Munich for Siemens.
Uploaded: 2017-12-26T18:13:55+00:00
Duration: PTM23S44
Channel: Steven McKenzie
Description: The following foils are for a presentation in Munich for Siemens.

The following foils are for a presentation in Munich for Siemens.
Asynchronous Wave Pipelines for Giga-Hertz VLSI Oliver Hauck Atul Katoch Integrated Circuits and Systems Lab Departments of CS & EE Darmstadt University of Technology Department of Microelectronics Indian Institute of Technology Bombay

Outline Pipelines: synchronous, asynchronous, wave pipelined,
and asynchronous wave pipelined (AWP) Comparison: AWPs vs. sync, async, and sync wave pipes AWP Circuit Design Application Example: EC Public Key Crypto Processor: Cryptography background Chip architecture and implementation Conclusion

Pipelining Pipelining used as premier technique to
better exploit hardware and boost performance of VLSI chips Clocking overhead presents serious threat for deeply pipelined systems built upon sub-micron CMOS processes running at GHz frequencies

General Framework for Pipelines
Latch/Reg Latch/Reg Logic Data Clk

Some Notations...

General Relations

Synchronous Pipeline Latch/Reg Latch/Reg Logic Data Clk
Negative side-effects of gate-level pipelining : Increased latency, clock load/skew, power, area, design time More area for clocking and registers than for logic Implementation options: Register- vs. latch-based, explicit latches vs. latchless TSPC vs. local clocks derived from global clock Static vs. dynamic, single-ended vs. dual-rail Throughput determined by longest logic path + clock/register overhead Fine-grain pipelining allows high throughput at the cost of increased clock/register overhead

Asynchronous Pipeline
Handshake Handshake Logic Data req_in req_out ack_in ack_out Micropipeline (Sutherland 1989) Synchronous clock replaced by asynchronous handshaking Elastic operation: input and output rate may differ momentarily, and pipeline will buffer Implementation options: 4-phase (level) vs. 2-phase (event) protocol Bundled data (matched delay) vs. completion detection Operation is data dependant, saves power during idle As with fine-grain sync pipelines, throughput can be high; handshake causes high latency and backward stall Plug & Play composability Load on req and ack lines distributed Used by Furber‘s group at Manchester U for AMULET1/2/3

Synchronous Wave Pipeline
Latch/Reg Latch/Reg Wave Logic Data Clk Wave pipelining potentially gives higher throughput as conventional pipelines at decreased latency and reduced clock load, area and power However, tuning the logic and the delay elements is difficult Several data waves simultaneously active in the logic Logic has to minimize delay variations over P,T,V corners Global clock used with constructive skew to adjust phases

Wave Pipelining: A Short Outline
Wave pipelining occurs when combinational logic is clocked faster than latency would allow Several data waves are then active in the logic without being separated by storage elements Latency remains constant and throughput is determined by delay differences rather than absolute delay Requirement for delay balanced logic and complicated timing are the main hurdles

Wave Pipelining: A Little History
Technique stems from the 60s and has had a reputation for being exotic since Wave pipelining was long dead before being revived by W. Burleson (U. Mass.) and M. Flynn (Stanford U., PhDs by Wong, Klass, and Nowka) and C. Gray at NCSU Some working academic chips exist, mainly datapath Some commercial memory is wave pipelined (e.g. ULTRA-III cache), but no logic, as far as we know

Asynchronous Wave Pipeline (AWP)
Wave Latch Wave Latch Wave Logic Data req_in matched delay req_out AWP is special case of the sync wave pipeline with the constructive skew set to worst-case logic delay It is crucial that the delay element accurately tracks the delay behaviour of the logic over P, T, V corners Data words associated with events on request line Several data waves and protocol events simultaneously active in the logic and the matched delay element, respectively

AWPs vs. Synchronous Pipelines
No global clock, instead a local clock (request) that is fed through the pipeline and obeys a simple asynchronous protocol, i.e. data is associated with event on request Many pipeline registers removed, thus requirements on the clock (request) relaxed Synchronous pipelines can reach the throughput of AWPs only with excessive cost in area, power and latency

AWPs vs. Asynchronous Pipelines
AWPs deliberately sacrifice the ack and keep only the req to avoid protocol overhead AWPs not elastic: data at output has to be consumed AWPs eliminate hazards as side-effect of delay balancing AWPs have in common with other async methodologies: data dependant operation (avoids redundant transitions), composability (though inelastic), no global clock

AWPs vs. Synchronous Wave Pipelines
AWPs tackle two main difficulties in sync wave pipes: Replacing the constructive skew by worst-case delay removes double-sided timing constraint, i. e. in con- trast to sync wave pipes do AWPs operate at any rate Using dynamic self-resetting logic controls delay variation and doesn´t impact latency much

Wave Pipelining Combinational Logic
Overall goal: keep data wave coherent under all possible conditions (data, PTV) Desirable architecture features: most logic paths have same depth fanin/fanout the same everywhere First step: pad all short paths to maximum length

Example: 64-b Brent-Kung Parallel Adder
pg PG PG G x o r Buffers provide for same depth on every logic path All gates in the same column must have the same delay

Circuits Logic style used has to minimize delay variation
Earlier work focused on bipolar logic (ECL, CML), but CMOS is mainstream Static CMOS is not well suited for wave piping, fixing the problem results in more power and slower speed Pass transistor logic gives slopy edges thereby introducing delay variation Dynamic logic is attractive as only output high transition is data-dependant, output pulldown is done by precharge

Circuits (cont.) Using dynamic logic as in Burleson´s Wave Domino jeopardizes the concept as it needs fine-grain precharge What is needed is a dynamic logic family without precharge overhead: SRCMOS Work done at IBM: classic paper by Chappell et al:``A 2-ns Cycle, 3.8-ns Access 512-kb CMOS ECL SRAM with a Fully Pipelined Architecture,´´ JSSC (26), 11, 1991; or, more recently: ``Implementation of a Self-Resetting CMOS 64-Bit Parallel Adder with Enhanced Testability,´´ JSSC (34), 8, 1999, by Hwang et al.

SRCMOS Distinguishing property of our SRCMOS circuits: precharge feedback is fully local, and NMOS trees are delay balanced output N inputs

Operation of a 2-AND

Delay Balancing at Transistor Level
NMOS tree is designed so that the precharge node is pulled down by a constant number of series devices Short paths are padded with dummy devices Delay variation is minimal when exactly one path is on, i. e. wide fanin OR´s are hard to use Every output has to see the same load Lightly loaded outputs are given dummy cap

Example: Carry tree in a 64-bit adder

Gim Layout

Simulation of Gim cell Pulses of 4 possible input situations giving ´1´ at the output are tightly matched Note: in this case never are Pxy=Gxy=1

First Pulse Problem

Miller Effect

64-bit Adder Output Waveforms
latching window

N Transistor Sizing output inputs LINEAR SIZING
Wprecharge Wkeeper Cfeedback Cload N Cdrive output inputs Wpd Wpd / Cdrive = const Cdrive / (Cload+Cfeedback+Wkeeper) = const Cfeedback / Wprecharge = const Wprecharge / Cdrive = const LINEAR SIZING

Interconnect: Resistive Effects
0.9µm x 900µm MET2 parasitics: C=116fF, R=70 Ohms C only R/3, R/3, R/3 R/2, R/2 RC only

Interconnect: Coupling Effects
2 adjacent MET2 lines coupled by C=54fF

PTV Variations SRCMOS provides some robustness by generating fresh pulses at every gate output Pulsed operation reduces data dependancy, coupling PTV noise is not critical when drift is in the same direction across die Critical are: temperature gradient, supply drop, and local variations What is needed: Rule of thumb like ``For process X, to be on the safe side, keep area between two latches < Y sqmm´´

Cryptography Background
Cryptography - science of keeping communication private Symmetric schemes - Private key (DES) Asymmetric schemes - Public key (RSA & ECC) Private key schemes are quite fast; public key schemes are more safe

Security For comparison : ECC using 261 bits is regarded safer as RSA using 2048 bits For secure data transmission one combines both public and private key schemes. Data is encrypted using private key scheme and the key with public key scheme The frequency with which the key can be changed depends upon speed of public key cryptosystem

CISCO Data Encryption Service Adapter
[Cisco Systems]

Key Exchange Using Public Key Cryptosystem
For better security it pays to improve both schemes If ECC scheme is fast then DES session keys can be changed more frequently ECC K e y s DES Source ??? ?? ??? Sink

DES Key Exchange using Public-Key Cryptosystem based on Elliptic Curves

Why is this secure ? Security based upon DLP: in a finite Abelian group we can easily compute given However, is hard to compute out of and DLP extraordinarily hard for point group of elliptic curve: Set of solutions of cubic equation over any field is an abelian group

Elliptic Curve Mathematics and Algorithm
Two types - supersingular and non-supersingular Non-supersingular have the highest security EC equation -

Choice of the Field The field of the type F m
Having 2 as characteristic of a field helps in hardware implementation Our choice m=261: Existence of Optimal Normal Basis Determines the data path width and security 2

Adding Two Points Over Elliptic Curves

Switching to Projective Coordinates
The inversions are quite costly in terms of multiplications Projective coordinates have no inversions For m=261: Normal Projective Coordinates Double + Add

Projective Coordinates

Optimal Normal Basis

Multiplication over ONBs

The Final Formula

Architecture of Multiplier
Pseudo NMOS SRCMOS 1 2 1 1 1 3 abx 2 3_Xor Wave latch abx 1 3 1 abx 3_Xor 3_Xor Wave latch 9 3_Xor 27 3_Xor 259 87 87 abx 3_Xor 260 3_Xor Wave latch 29 delay abx 781 261 782 abx 783 3_Xor request delay

Circuit Style Followed
Dual-rail cross-coupled SRCMOS circuit NMOS trees are designed such that there is only one conducting path to ground

Pulses after First Stage

Delay Variations at various stages

The Total Latency

Architecture of the Cryptochip

Hierarchy of Control Double-and-Add Key generation rate R
left shift x k Double-and-Add Key generation rate R Hamming weight = 40 *(261*7+40*13) If x=1 always EC double EC add EC arithmetic R * 2347 MUL/s 7 13 * 261 Finite field arithmetic R * bit/s ADD MUL LOAD/ STORE

Control Unit Architecture
For static operation X R E G R E G OUT IN1 Logic reset IN2 req1 Req_out reqn AWP Request signals trigger the state transitions. Autonomous state transitions are triggered by signal X

Highest Level Control Level-based control 1 2 3 4 5 6 8 7
Start/LoadX, ResetZ 1 X=1 2 X=0 LoadY Shift K 3 X=0 X=1 4 If Stop=1/KP_Done If K=0 If K=1 X=1 5 ShiftK, Double 6 X=1 K=0,DoubleDone 8 X=0 K=1,DoubleDone/Add 7 AddDone X=1 Level-based control

The Request Signal Generation

Middle Level Control : Double Algo
X=0 X=1 X=1 X=1 1 X=0 2 X=1 3 4 5 Start OPAX OPBZ MULT MD X=1 X=1 X=1 58 X=1 59 X=0 60 OPAA X=1 61 Shift 62 OPBA 63 MULT MD Pulse-based control

Request Signal Generation

Various States in a Pulse based Control

Architecture and Implementation

Conclusion AWPs presented as alternative approach to high-speed
design, shows potential for GHz throughput without clocks AWPs avoid some problems of conventional wave pipes and (a)synchronous systems 64b adder + test circuit and EC crypto layout in the making Feasibility of having totally asynchronous control To do: support transistor sizing, quantify PTV impact

The following foils are for a presentation in Munich for Siemens.

Similar presentations

Presentation on theme: "The following foils are for a presentation in Munich for Siemens."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The following foils are for a presentation in Munich for Siemens.

Similar presentations

Presentation on theme: "The following foils are for a presentation in Munich for Siemens."— Presentation transcript:

Similar presentations

About project

Feedback