At-Speed Test Considering Deep Submicron Effects

At-Speed Test Considering Deep Submicron Effects
D. M. H. Walker Dept. of Computer Science Texas A&M University

Life as a DFT Engineer Test Cost Quality Yield

Outline Introduction KLPG Results on Silicon Supply Noise & Power
Model Conclusions

Test Cost Must Fall Fast
Test cost/transistor must follow Moore’s Law Need 100x transistor test cost reduction! ITRS2005, Cost-Perf MPU

But … Test cell cost not following Moore’s Law
Already using DFT tester or old ATE Handler and probe card cost not scaling High-speed I/Os cost more Must reduce test time Parallel testing running out of gas Must reduce test time per transistor Constant test time per chip

Reducing Test Time Per Transistor
Less time, more transistors  must wiggle more wires in less time Higher power dissipation But… tests/transistor rising to cope with DSM Even higher power dissipation

But … Max Power/Transistor is Falling
ITRS2005, Cost-Perf MPU

Future Digital Test All About Power?
Fraction of chip that can be fired up at one time is decreasing Mission-mode power constraints Test supply noise Test thermal limits Limits intra-die test parallelism How to screen most defects per Joule?

Eliminate Wasted Energy
Useless transitions Scan power Unnecessary capture power Much research/commercial activity Low-odds test patterns Luck – tails of BIST, WRP Shotgun blasts – N-detect, DOREME, TARO, …

Squeeze Chip Harder Instead
IDDQ MINVDD … Small Delay Defect KLPG

Our Delay Test Research
Defect-Based Delay test ATPG considering: Resistive shorts and opens Process variation Capacitive crosstalk Temperature gradients Power supply noise Power dissipation

Kitchen Sink Fault Model
Have we forgotten anything? Crosstalk Die-to-Die Supply Temp Litho Intra-Die Spot Defect Noise Process Variation Functional Failure Local Delay Fault Reliability Hazard Combined Delay Fault Global Delay Fault

Target Realistic Defects
Resistive Short Stanojevic et al Resistive Open Madge et al

But… Fault population too large Limited fault model accuracy
Limited fab data Limited calibration time, cost Fault model must be abstract enough for fortuitous detection of unmodeled faults “The vectors do the work, not the fault model” – J. H. Patel

Our Approach Test K longest rising/falling paths through each gate/line (KLPG) Targets resistive opens Targets resistive shorts Sensitize opposing lines Few bridges per line with largest critical area [Tripp] Larger K deals with delay uncertainty Supply noise Process variation Delay modeling errors Crosstalk Analogy to N-detect

Model Conclusions

K Longest Paths Per Gate (KLPG)
CodGen ATPG developed at Texas A&M CodSim fault simulator Tests K longest paths through each gate/line Detects small defects on each gate/line Covers all transition faults Needs SDF May produce more patterns than TF test

Test Generation Algorithm
Search space Scan cells Scan cells Constraints from outside search space

KLPG Test Generation Flow
Start Extend the partial path with longest potential delay Insert into the partial path store Y Apply side inputs and perform direct implications Apply heuristics to avoid false paths N Conflict? Complete path? N End Final justification Y

Apply KLPG to Industrial Designs
TetraMAX/FastScan dofile/procfile/library Hierarchical Verilog Design Same inputs as TF test generation CPU Time: ~3x TF test generation Memory: 400 MB/1M gates KLPG Test Generator sdf K Test Data … Only 0’s & 1’s are different from TF ATPG outputs Test Sequence …… Load pattern Pulse clock ...... Tester

Chips are Slower Using KLPG Test
Transition fault test 10 ns KLPG-1 test 11 ns 180 nm / 40k gates / full scan / 2.3k scan cells

Cleaner Shmoo Transition Fault KLPG

KLPG Silicon Experiment
TI ASIC design 738K gates (597K gates in 250 MHz clock domain) 130 nm technology 5 clock domains (highest 250 MHz) 8 scan chains, muxed D flip-flops in 250 MHz domain 24 devices marginally pass regular TF test

Test Size Comparison Test # Patterns Comments Path Delay Test 744
Tests critical paths Regular TF 1 445 Dynamic compaction Randomized TF 1 471 KLPG-1 12 579 Static compaction

Up to 3% delay decrease seen in KLPG-1
KLPG Test Results Up to 3% delay decrease seen in KLPG-1 KLPG unique detects

KLPG with Bridge Faults
KLPG-1 targets resistive opens SAF, N-detect, KLPG-1 tests have good coverage of resistive shorts Sar-Dessai and Walker, ITC99 Qiu, Walker et al, TECHCON03, VTS04 Sensitization much easier than propagation Propagate first, then sensitize Ignore input-dependent gate strength Ignore opposing transition

Bridge Fault ATPG Approach
Generate longest path through bridge site Set DC bits to sensitize opposing value on bridged line (e.g. 0 opposing ) No extra uncompacted patterns, since need to test resistive opens Else, set opposing value first, then generate path “Top-off” patterns, but may compact

Bridge Fault Robust LOC Results
Circuit # Lines # Shorts Robust KLPG-1 with Shorts Robust KLPG-1 w/o Shorts # Test Patterns CPU (m:s) s13207 13 207 26 414 1 006 2:30 909 2:25 s15850 15 850 31 700 520 2:45 472 2:35 s35932 35 932 71 864 43 15:51 36 14:31 s38417 38 417 76 834 1 061 15:03 949 14:21 s38584 38 584 77 168 589 12:00 526 11:20 2 random non-feedback bridges to each line = TF count. Shorts between gate inputs and to power/GND are excluded.

Bridge ATPG Results Modest cost increase
Pattern count increases % ATPG time increases <9.2% Expect less impact on large designs, due to lower care bit density

KLPG Improvements Compaction Coverage metric Crosstalk

Dynamic Compaction Test Set Compaction Static Compaction
Performed after test generation Dynamic Compaction Performed during test generation Classic method good for stuck-at tests but not suitable for path delay tests Develop dynamic compaction for KLPG tests 31

Dynamic Compaction Approach
Vector pair and NAs (circles) for Path1 V2 V1 I1 I2 I3 I4 I5 I6 O1 O2 O3 O4 O5 O6 Path1 XX 11 0X X1 1X 00 A O2 O3 O6 O5 Vector pair and NAs for Path1 & 2 V4 I1 I2 I3 I4 I5 I6 O1 O4 Path1 1X 00 XX A x Path2 B Vector pair and NAs (Xs) for Path2 V3 XX 1X X0 I1 I2 I3 I4 I5 I6 O1 O2 O3 O4 O5 O6 x Path2 B Now let’s go to the basic idea of our dynamic compaction algorithm. When a path is generated, a set of necessary assignments necessary to sensitize and propagate the transition along the path are identified. This picture shows Path1 with falling transition through line A. Circles are the necessary assignments for Path1. V1 is the vector pair generated through a PODEM-like final justification procedure. V2 can detect Path1 too. But because of the limitation of PODEM-like final justification algorithm, it can not be generated alone. The second picture gives Path2 with rising transition through line B. Suppose V3 is the only vector pair can test Path2. If we generate V1 and V3 at first, later we can not compact them together because first bit of input I2 conflicts. But potentially we can compact path1 and path2 together since V2 is compatible with V3. How can we do it? The solution is to combine two sets of necessary assignments together and call final justification procedure. SoV4 can be generated successfully to test Path1 and Path2 at the same time. 32

Dynamic Compaction Algorithm
Definitions: vector : output for ATE pattern : a set of necessary assignments associated with one or more paths POOL : a data structure to save patterns Check the compatibility between necessary assignments of new path against a pattern in the POOL Generation of final test vector is postponed until test generation is finished 33

Dynamic Compaction Flow
Start with new pattern F POOL empty? Y End Insert F into POOL Set P to the first pattern in POOL N Y Conflict check between F and P N End of POOL? Conflict? Set P to the next pattern in POOL Y Combine necessary assignments of F and P & Do Final Justification N N Pass Justification? Update P with F Y Reorder P in POOL 34

Dynamic Compaction Experiments
K Longest robustly testable path generation through each line (K=1) Launch-on-shift/capture Compare to static compaction POOL size influence on vector count KLPG-1 vs Transition Fault Test 35

Dynamic Compaction Algorithm
Definitions: vector : output for ATE pattern : a set of necessary assignments associated with one or more paths POOL : a data structure to save patterns Check the compatibility between necessary assignments of new path against a pattern in the POOL Generation of final test vector is postponed until test generation is finished 36

Circuits ISCAS 89 benchmark circuits Full scan Unit delay model
Chip1 (44K Gates) Partial scan Embedded memories Chip2a (22K Gates) SDF delay 37

Robust Test (launch-on-capture)
% Vector Reduction Rate Vector Count 60% 60% 48% 21% 55% 43% 39% 53% 37% 23% 26% 38

POOL Size Influence (LOC robust)
# of Vectors 39

KLPG-1 Test Set Construction
Non-robust test Robust test Long transition fault test A long transition fault test tests longer paths than a regular transition fault test 40

Test Size (KLPG-1 vs. Transition)
chip1 chip2a chip3 Robust Non-robust Long TF Total Comm. TF 289 6 7 302 231 24 4 28 68 425 41 1 467 365 249 134 70 453 528 1192 452 103 1747 1900 619 687 493 1799 2537 4406 1688 550 6644 1445 For comparison, this figure shows the dynamically compacted KLPG-1 test size and Transition Fault Test size. Column 2 to column 4 give the vector size of robust test, top-off Non-robust test and top-off Long transition tests. Please note that many non-robust paths are also compacted into previous generated robust test vectors. This number only shows the new vectors generated. The same for Long TF. Column 5 gives the total vector number of KLPG-1 test. The number of transition fault test vectors generated by the commercial tool is listed in the last column. You can see that our KLPG-1 test size is at the same level with transition fault test. In several cases, KLPG-1 test size is even smaller than transition fault test. For chip3, the gate No is around 600K. KLPG-1 test size is around 5 times of commercial tool. But in this case commercial tool generated very small vector size for this design. This indicates that many transition faults in chip3 are easy-to-detect but testing them through the longest paths results in many more necessary assignments and lower compaction rate. Intuitively KLPG-1 test size should be several times bigger than commercial tool, since we add more constrains into the circuit. Considering the higher quality of KLPG-1 test, it is very promising. 41

Dynamic Compaction Results
Dynamic Compaction for KLPG tests Up to 3x reduction in vector count ~2x CPU time increase Small additional memory consumption KLPG-1 test size comparable to commercial transition fault test 42

Dynamic Compaction Future Work
Heuristics to accelerate dynamic compaction Advanced algorithms for more optimal results Dynamic compaction for more complicated industrial designs Constraints for power supply noise and temperature 43

Delay Fault Coverage Metric
VTS04 metric not constructive for delay test quality Need longest path through each line to accurately compute it – must run KLPG SDQM has same problem Die-to-die and intra-die process variation Die-to-die now done as post-process – wasteful Simple bounds on when to stop path generation – coverage vs. pattern count

Fault Coverage vs. K c7552 Drop fault when UB/LB coverage falloff
Most sites need only a few paths

Ideal K in C5315 with Die-to-Die
Most sites need 1 or 2 paths Most paths in many-path sites are ~same length Can drop most w/o much coverage loss

Capacitive Crosstalk Crosstalk affects near-critical paths
Don’t worry about near-critical due to spot defect – probability dominated by defect Consider case (b)

Capacitive Crosstalk Filter out couplings based on arrival time
Use simple greedy algorithm Couplings in order of delay increase Sensitize opposing transition one at a time May miss many little coupling case Worry about timing alignment? Probabilistic Compaction impact? – More care bits

Crosstalk Alignment Need path from PI to crosstalk site to have correct timing KLPG ATPG algorithm uses min/max delay constraints Targets are opposing transition in timing window Constraints narrow as path is built If potential alignment/transition is not realized, drop target Update timing with each crosstalk site, since could set other crosstalk sites to help or oppose

Model Conclusions

Supply Noise Supply noise significantly impacts the timing performance of DSM designs Frequency Gate Density Power Density Supply Voltage Delay sensitivity to voltage Excessive supply noise may come from: Random fill of don’t care bits Test pattern compaction Noise  longer delay  Overkill As technology advances to DSM regime, designs have become more and more sensitive to supply noise. Let’s see the following trends: Operating Frequency and gate density increases which means there are more simultaneous switching activity per unit area so power density increases In the mean time Supply voltage level decreases also gate delay becomes more sensitive to voltage level variation These trends lead to a more significant power supply noise impact on delay Why there’s excessive supply noise in delay testing compared with real functional mode? They may come from two sources: first is random filling of don’t care bits Delay test patterns generated by ATPG usually has a low fill-rate, and most bits are don’t care bits. In industry, random fill of those don’t care bits is usually applied to increase fortuitous detection of non-target defects. Unfortunately, industry data shows random fill can produce excessive supply noise. A second source of excessive noise comes from compaction. A highly compacted test pattern may generate excessive noise as well. As I have mentioned in the motivation slide, excessive noise causes unexpected extra delay, and finally results in noise-induced overkill. So that’s the background of the whole problem.

Concept: Effective Region
Circuit extracted as RC network Effective Region for a device: RC time constant < Clock cycle Assumption: all caps in region are equally effective No action in current cycle, irrelevant Discharge in current cycle, effective First, I want to introduce several key concepts for the model. The first concept is effective region. We first extract the circuit as a RC network. Assume a current impulse occurs somewhere in the network. Capacitors around this impulse will begin to discharge from nearby to far away, and result in localized voltage drop. If, a capacitor is far enough away, possibly it will not discharge within the current clock cycle. Such capacitors are considered irrelevant to the noise analysis in current cycle. Therefore, we define an effective region for a switching device as the largest area centered by the switching device, and its RC time constant is less than the clock cycle time. It means, in the current clock cycle, the switching current for a device only comes from capacitors in its effective region. To make things easier, we further assume that all capacitors in the region are equally effective regardless of where the capacitor is. For instance, there are two capacitors A and B. A is very close to the center of effective region. B is close to the border. But as long as A & B has same capacitance. they will get discharged same amount of charge to the switching device. A B

Find Effective Region for a Device
Current Algorithm: search region radius from small to maximum Practical improvement: binary search Perform only once for one design r To find out the effective region for a device, we can look at the circular regions centered by this device, and search from a small radius to maximum, and stop once the RC time constant of the area exceeds clock cycle time. A simple and practical improvement on the search algorithm is to do binary search which half of the maximum radius. The effective region is quite static. This is because resistance is static, and decoupling capacitance is also static. Parasitic circuit capacitance is usually much smaller than decoupling capacitance, and it varies little from pattern to pattern. Therefore, we only need to perform this search algorithm once.

Concept: Grid Grid is the smallest unit for analysis
RC time small enough compared with clock Uniform voltage level Each Grid Contains: Decoupling capacitance Parasitic capacitance Switching devices Each grid  an Effective Region A second key concept is Grid Effective Region is a concept to define locality. If, two switching device are close enough, they will have same effective region and same voltage level. It saves us time to put them together for analysis. So here we define the grid concept, which is the smallest unit for analysis We divides the whole circuit into n*m grids. Each grid contains: 1) decoupling cap 2) parasitic cap 3) and a bunch of switching device. The RC time constant of a grid is small enough compared with clock cycle time, so we can safely make approximation that all switching device in a grid has the same voltage, and the same effective region. Therefore, we can also say one grid is associated with one effective region. We also view an effective region as a set of grids, instead of an area of devices and capacitors. This grid concept makes our computation much simpler. And it has a slight impact on accuracy. Each Effective Region consists of a set of grids

Grid Noise Model Cd Cp Iswitching_1 Iswitching_n Switching Devices Assumption: Off-chip current ignored during the launch cycle Switching charge is equally provided by all grids in its effective region The basic idea is: During the beginning period of the clock cycle, when most switching activity occurs, the power pads are unable to provide current immediately to satisfy the switching current demand. This is because off-chip inductance prevents the supply current from rising immediately. Therefore, most charge demanded by the switching devices comes from on-chip capacitance in the effective region. Here, the figure shows the model for one grid. Note that here, grids are not independent of each other. Each grid gets charge from the grids in its effective region, and it also get discharged for some other grids if it belongs to their effective regions. Here, we make an assumption that we simply ignore off-chip current. Our reason is, it has little impact on propagation delay since most transitions have completed before the off-chip current rises appreciably. For effective region, we have made an assumption that any capacitors in the effectively region are equally discharged. Now we shift to unit of grid, we also assume that all grids in the effective region are equally discharged.

Grid Noise Model Vmax = ( ( i · Qi )) / ( Cd + Cp )
Iswitching_1 Iswitching_n Switching Devices Vmax = ( ( i · Qi )) / ( Cd + Cp ) Grid i: a grid whose effective region covers current grid Qi: switching charge of Grid i i: fraction of Qi provided by current grid Based on these model, we are now analyzing every grid to calculate its worst-case voltage drop. As I mentioned in the previous slide, a grid should provide charge to those grids if it belongs to their effective region. So the worst-case voltage drop here is the total charge provided by this grid, divided by the grid capacitance. Here Grid I is a grid whose effective region covers current grid. Qi is the total switching charge demanded by Grid I. Qi is shared by all grids in grid I’s effective region. so alpha I is the fraction of Qi that comes current grid. Using this function, we can find out maximum voltage drop for all the grids on the circuit.

Switching Current Model
Dynamic Charging Current Ipeak tbegin tend t Dynamic Charging Current: Look-up table by simulation Charge: Q = 0.5 · Ipeak · (tend – tbegin) Short Circuit Current: empirical function (Saturation Current, wire and device capacitance) Switching charge needs to be calculated for each device, so we need switching current model for this calculation Switching current drawn from the supply network in CMOS circuits mainly consists of two parts, the dynamic charging current on the output capacitive load, and the short circuit current. In most cases dynamic charging current is more significant than short circuit current. Here’s the dynamic charging current waveform. We model it as triangular. A table is built by simulation for each cell, such that one can determine its peak current and output transition time for different values of output load and input slope. Once we get the peak current and transition time from the table, we can get the total charge by calculating the area of this triangle. Short current is usually insignificant compared with dynamic charging current. We simply use an empirical function here to calculate short circuit charge.

Delay Model Look-up table at nominal voltage By simulation
Delay = f(tin, Cout) Out_slew = g(tin, Cout) Delay/slew is linear to supply voltage linear factor by simulation In our work, we first model both nominal delay and transition time as a function of input slope and output capacitive load. A look-up table is built for each library cell using simulation. We then use the linear model to calculate real delay and slew rate as a function of voltage. Again, the linear factor comes from library cell simulation. In practice, we take voltage drop as half of worst-case, and apply delay model to calculate noise-aware delay.

Supply Noise Analysis Flow
Start End Find Effective Region skip Calculate Delay Get Voltage Drop Load Vector Here’s the comprehensive supply noise analysis flow For each circuit, we need to associate each grid with a an effective region. This procedure only needs to be done once for each design before the first test pattern applied, then it can be skipped for the following patterns. F each test pattern, we will do logic simulation, assign switching charge to related grids, calculate voltage drop for each grid, and then calculate noise-aware delay. The complexity of this procedure is O(cell_count + grid_count2). In practice, we just make grid_count less than square root of cell_count, since it’s enough for accuracy. so the actual complexity is linear to cell_count only, which is the same as logic simulation. Switching Charge assigned to grids Logic Simulation Complexity: O(cell_count + grid_count2) Typically grid_count2 < cell_count

Experimental Design Experiments on NXP design
130nm DSP-like design (1M+ transistors) LOC path delay patterns with “X” bits statically sensitized paths  ensures transitions propagate on the target path Filling strategy: randomly set “X” bits to 1 with a specified rate Generate filled patterns using various fill rates We perform experiments on the same design that I introduced earlier. It is a 130nm DSP-like core, with over 1M transistors. We use LOC path delay patterns with lots of don’t care bits. The paths are statically sensitized so that we ensure transitions propagate on the target path. We then apply filling to these patterns, where we randomly set don’t-care bit to 1 with a specified rate, and fill the rest don’t care bits with 0. We also generate several batches of filled patterns using various filling rate.

Experimental Measurements
Path delay by analysis is correlated with measurement We then apply our supply noise analysis approach to these test patterns and show correlation with tester measurement. The correlation here is 0.83, which is pretty good. We’ll discuss the offset in the next slide.

Experimental Measurements
Noisy patterns cause significant delay increase Measured offset due to delay model characterization In this figure, the patterns are ordered by filling rate on x axis. The bottom blue line is nominal delay by delay model, yellow dots are our noise-aware analysis based on delay model, and the top purple dots are test measurements. This figures shows clearly our noise analysis predicts a similar trend as tester measurements. The offset comes from delay model characterization, since there’s a large mismatch between nominal delay and test measurement when noise is small. Ordered by fill rate

Supply Noise Future Work
Supply noise model refinement Off-chip dI/dt current Array-bond chips Ground bounce Better activity estimation Focus effort on noisy patterns Incremental estimation for ATPG Avoid logic simulation

Constant Power Dissipation
Constant power  linear temperature rise Easy to characterize Know temperature for each pattern Adjust capture clock timing Longer delay as temperature rises 35-55% delay increase for 100C rise in 65 nm Reorder patterns for constant power dissipation Consider groups of 10 patterns Takes 1-10 ms for ~1C rise 200 bit scan 100 MHz  2 s/pattern 10 patterns = 20 s << 1 ms

Minimize power variation
Constant Power Flow Dynamic Compaction Issues Need fast power model Patterns not independent Power due to both scan-in and scan-out switching Mentor Preferred Fill Reduce capture power Adjacent Fill Reduce average power Reorder Patterns Minimize power variation

Power Modeling Prior work by Touba et al indicated WSA proportional to scan chain switching Improved using scan chain WSA Scan cell feeding more gates likely to cause more circuit switching Most circuit switching during scan happens in first few levels of logic Experiments showed almost no difference in pattern reordering results using model vs. simulation (exact) results

Power Model Results

Constant Power Algorithm
Compute shift power of each pattern; /* power model */ Group patterns in order using specified group size; Compute total power P[i] of each group i; Compute average power ave of all groups; while iteration count not exceeded, do for each group i, do if P[i] > (1+pvb)*ave /* pvb = power variance bound = 0.05 here */ Find pattern Pn with lowest power in group j with lowest total power Select pattern Pm with highest power in group i and swap with Pn else if P[i] < (1-pvb)*ave Find pattern Pn with highest power in group j with highest total power Select pattern Pm with lowest power in group i and swap with Pn else continue to next group; Re-compute shift power for Pn-1, Pn, Pm-1, Pm /* power model */ Re-compute total shift power for group i, j Update ave;

s38417 Results

Constant Power Results
Fast - < 1 minute on ISCAS89 Std. Dev./Average drops by 2.5-6x ~3% on ISCAS89 circuits Remaining variation mostly due to high-power patterns Solution: veto high-power patterns during compaction

Conclusions Demonstrated KLPG on industrial designs
Modest test data volume increase Affordable ATPG time increase Demonstrated noise model on industrial design Demonstrated constant power reordering

Future Work Demonstrate on industrial data Fault Coverage Metric
Drop faults detected with high probability Exploit spatial and structural correlation Maximize coupling capacitance Use supply noise model in compaction and filling dI/dt model and multi-cycle launch

Acknowledgements Current Students Zheng Wang Zhongwei Jiang
Shiva Ganesan Former Students Jing Wang (AMD) Lei Wu (TI) Wangqi Qiu (Pextra) Colleagues at TI, NXP Sponsors – NSF, SRC

Needs SRC task 1618 liaison Design and test data

More Information http://faculty.cs.tamu.edu/walker

Questions?

At-Speed Test Considering Deep Submicron Effects

Similar presentations

Presentation on theme: "At-Speed Test Considering Deep Submicron Effects"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

At-Speed Test Considering Deep Submicron Effects

Similar presentations

Presentation on theme: "At-Speed Test Considering Deep Submicron Effects"— Presentation transcript:

Similar presentations

About project

Feedback