Download presentation
Presentation is loading. Please wait.
1
Random Number Generator Dmitriy Solmonov W1-1 David Levitt W1-2 Jesse Guss W1-3 Sirisha Pillalamarri W1-4 Matt Russo W1-5 Design Manager – Thiago Hersan May 1, 2006
2
2 Why Random Numbers? Real-Time Simulations Encryption Gambling
3
3 Encryption Need random numbers for authentication Key generation Software vs. Hardware –Less power/time per number –Portable Gambling ePoker Rooms SoC Deck Generation Other future casino games
4
4 Business Plan Potential markets Defense and Intelligence Organizations E-Gambling / Casinos Game Consoles Mobile Communication License the IP Our design will be part of a larger ASIC or GPP design
5
5 IBAA Algorithm Uses RC4 encryption algorithm –Cryptographically secure –Deterministic 1024-bit number generated Internally Updated Seed –not user visible = secure
6
6 #define ALPHA (8) #define SIZE (1<<ALPHA) #define ind(x) ((x)&(0x1F)) #define barrel(a) (((a)<<19)^((a)13)) uint32 A, B, Y, X; uint32 M[32], R[32]; … for ( i=0; i<SIZE; i++ ) { X = m[ind(i)]; A = barrel(A) + M[ind(i +16)]; M[ind(i)] = Y = M[ind(X)] + A + B; R[ind(i)] = B = M[ind(Y>>ALPHA)] + X; } The IBAA Algorithm
7
Architecture
8
8 for ( i=0; i<SIZE; i++ ) { X = M[ind(i)]; A = barrel(A) + M[ind(i +16)]; M[ind(i)] = Y = M[ind(X)] + A + B; R[ind(i)] = B = M[ind(Y>>ALPHA)] + X; } IBAA Algorithm to Architecture 4 Reads from M 1 Write to M 1 Write to R dependencies, feedback, and RAW hazards
9
9 Algorithm to Architecture Hardware Limits –Max. of 2 simultaneous reads from memory Can’t do better than two stages Each stage must take multiple cycles to complete
10
10 Chosen Timing –Addition = 1 cycle –Memory Read = 0.5 cycles –Memory is clocked ½ period off phase –Set address and receive data in 1 cycle When forwarding is applied, need 4 cycles per stage Algorithm to Architecture
11
11 SRAM (M) SRAM (R) FSM Adder Counter Control Logic Register Counter Adder (X) Reg (B) Reg (Y) Reg Adder (Y1) Reg Adder (A) Reg Stage 1 -------------------------------------- M1 = M[i+16] -------------------------------------- X = M[i] | A = M1 + barrel (A) -------------------------------------- M3 = M[X] | C 1 = (X==i-1) -------------------------------------- Y1 = A + (C 1 ) ? Y : M3 Stage 2 ------------------------------------ Y = B + Y1 ------------------------------------ M4 = M[Y addr ] | C 2 = (i==Y addr ) ------------------------------------ B = X + (C 2 ) ? Y : M4 ------------------------------------ M[i] = Y | R[i] = B (M4) Reg (M1) Reg (M2) Reg (M3) Reg
12
Design For Manufacture Regular Fabrics
13
13
14
14
15
15
16
16 Why DFM? Ability to print on smaller processes Robust Manufacturability Sacrifice area, speed and metal layers for a regular design
17
17 Sample Layout: Regular Fabrics
18
18 Lithography Simulations
19
Hardware
20
20 Adder Four adders execute 256 times. Hybrid adder Fast and low power. CS4CS18CS6CS4 A[3:0]B[3:0]A[9:4]B[9:4] A[27:10]B[27:10] A[31:28]B[31:28] S[31:28]S[27:10]S[9:4]S[3:0] C’[4]C[10]C’[28]C[32]
21
21 32-Bit Adder: First 4 Bits
22
22 32-Bit Adder: CS6 Block
23
23 32-Bit Adder: CS18 Block
24
24 32 Bit Fast Adder
25
25 Adder Performance Delay: 1.56 ns Energy Consumption –(worst case switching) : 12.4 pJ Power Dissipation –(estimating with our switch factor) : 148 μW
26
26 SRAM Single Bus Cell Double Bus Cell
27
27 SRAM
28
28 Functional Verification Structural Verilog vs. C Code: –Generate numbers under equal load conditions –Compare Numbers Schematic vs. Structural Verilog –Under equal inputs, check if port outputs match LVS
29
29 Verification Schematic and Extracted Parasitic spice simulations of major blocks –Check for clean signals –Check delays and rise/fall times Extracted Parasitic simulation of critical Register-Register Path –Signals are clean –Delay = 2.1 ns Extracted Parasitic simulation of chip clock distribution
30
30 Critical Delay
31
31 Final Layout
32
32 Poly Density 7.52% Metal1 Density 20.85 %
33
33 Metal2 Density 19.89% Metal3 Density 18.76%
34
34 Metal5 Density 6.8% Metal4 Density 9.36%
35
Analysis
36
36 Specifications Pins –36 input pins 32 bit seed input, gen, read, rst, clk –34 output pins 32 bit random output, rdy, done –2 input/output pins vdd, gnd 475 MHz chip speed 436 KHz throughput
37
37 Part Trans Count Area (um 2 ) Density Prop Delay (ns) Power (1x) (mW) 500MHz Power (Avg) (mW) 475 MHz Adders (4) 5,856 (1,464 ea.) 25,200 (6,300 ea.) 0.232 1.45 1.56 0.60 0.62 0.14 0.148 SRAM (M&R) 17,736 (M=10,458 R=7,278) 51,000 (M=35,000 R=16,000 0.348 (M=0.293 R=0.456) 0.735 0.845 W: 0.51 W: 3.25 R: 0.19 R: 1.40 0.27 1.86 Regs (10) 6,400 (640 ea.) 38,400 (3,840 ea.) 0.167 0.220 0.275 0.53 0.59 0.13 0.145 Total 33,371182,0000.194 2.1 ns 475 MHz -----4.1 mW Putting it All Together Schematic ExtractRC
38
38 Performance Comparison Operation Time (ms) ~4,000,000 Runs Intel P4 3.20 GHz (90 nm)5000 W1-2006 475 MHz (180 nm)9000 AMD Opteron Blade 1.005 GHz ()14000 ARM Intel XScale 700 MHz ()125000
39
39 Where to Now ? ERC, tapeout, etc. Thermal noise unit to use as input seed On-Chip Bus Interface HyperTransport™ Interface
40
40 References Jenkins, Robert J. “ISAAC”. http://burtleburtle.net/bob/rand/isaac.html Chirca, Schulte, Glossner, et al. “A Static Low-Power, High-Performance 32-bit Carry Skip Adder”. http://mesa.ece.wisc.edu/publications/cp_2004- 12.pdf “CLA and Ling Adders”. http://umunhum.stanford.edu/~farland/notes.html
41
41 Questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.