Presentation is loading. Please wait.

Presentation is loading. Please wait.

Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang

Similar presentations


Presentation on theme: "Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang"— Presentation transcript:

1 Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang
Fault Recovery of Reconfigurable FPGA Based on Tiling and Domain-partition Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang Presented by Yen-Yi, Hsu

2 Contents Introduction 1 Domain Partition Model 2
Creating Alternative Configurations 3 Recovery Approach 4 Experiments Results 5 Conclusion 6

3 INTRODUCTION Field Programmable Gate Arrays(FPGAs) have been widely used in embedded systems. for flexibility and functionality Unfortunately, current technology trends tend to make FPGAs less reliable “Domain-partition model” can recover from both transient errors and permanent fault 由於flexibility和functionality的考量,FPGA開始被廣泛的使用在embedded system中,但是FPGA vendor現在的趨勢,是larger dies,會讓CLB變大也更多interconnection的link,會讓產生錯誤的機率增加,為了改善reliability的問題,就有很多種fault-tolerance的技術應用在FPGA base的系統,這篇PAPER就是提出一個利用domain partition的方式提出一個fault recovery的方法。

4 Related Work Fault recovery in reconfigurable hardware can be broadly classified into two categories: run-time replacing and rerouting techniques for dynamic generation of alternative configurations after error detection and defect location Long latency for recovery (take minutes to complete) Increase the downtime and decrease the system availability approaches based on precompiled configuration created during the design phase 在related work中提到,用這種FPGA硬體開發的錯誤回復機制可以分成兩種,一種是在執行的時候發現錯誤在採用replacing或rerouting的方式動態的產生替代的configuration,第二種是based on precompiled configuration,在design phase就都設置完成。第一種 方式的缺點是recovery的時間較長,甚至要花上數分鐘,增加了系統停滯的時間,降低的系統的availability

5 Related Work Precompiled configuration
The entire design is partitioned into a set of tiles Extra storage space Can be classified into two categories non-overlapping each configuration has been mapped into distinct programmable resources from the basis configuration overlapping 而precompiled的方式,把整個區域分成多個tiles組成的集合,但是由於是在design階段就先設計好,所以需要額外的儲存空間,而precompiled又可分成兩種,non-overlapping和overlapping,就是指各個configuration之間,有沒有tile是重疊的

6 Related Work Overlapping schemes Each FPGA is partitioned into tiles
Each tile contains (m+k) columns of configurable resources configuration for a certain tile Can tolerate more defects than non-overlapping with same spare resources Domain Partition mode improvement of reliability Overlapping的部份,在reference中有一種方式是m+k approach,總共需要m+k欄,從m+k欄中取k欄用來實做,剩下的m欄是備用的,比non-overlapping能夠容忍更多的錯誤發生,然後在這篇paper中在overlapping的area assignment部份,採用DPM來提升reliability

7 Extended Domain Partition Model
Key feature of our fault recovery is assigning the available area of FPGA to all possible alternative configurations by the DP model based approach In the DP model, every configuration is considered as a candidate to be mapped into any available resource. Contrarily, in a tiled system every configuration is limited to the corresponding tile. 原本的DP model沒有辦法直接formulate tiled system,所以作者們把這個model做一個延伸。

8 Extended Domain Partition Model
S: a tiled fault tolerant design implemented on a FPGA A: a set of all configurable resources of S each element of T is called a tile of S S is good if and only if every element of T is good each element of Cn if caled a configuration of Tn

9 Extended Domain Partition Model
C15 C11 C12 C14 C13 C16 T={T1,T2,T3…T6} C1={C11, C21, C31} of T1 C25 C21 C22 C24 C23 C26 C35 C31 C32 C34 C33 C36

10 Extended Domain Partition Model

11 Extended Domain Partition Model

12 Extended Domain Partition Model
C15 dn(a)=(110) C11 C12 C14 a C13 C16 k=110 In(k)= + C25 C21 C22 Nn(k)=2/3 C24 a C23 C26 C35 C31 C32 C34 C33 C36

13 Extended Domain Partition Model

14 Extended Domain Partition Model
Partition the system symmetrically r denotes redundancy of the design and p denotes the number of configurations for a certain tile

15 Creating Alternative Configurations
a. parameter assignment r depends on the ratio that size of the FPGA divided by the size of the original circuit p can be determined according to the application requirement on reliability Bigger r and p has better chance to tolerant defects And improper big p may increase the system downtime of repair 接下來就利用剛剛的model來建立alternative configurations,主要分成四個步驟,首先先計算所需要的參數,p,q,r,這邊提到比較大的r和p可以有比較大的機會可以容忍錯誤,但是如果p大的不恰當的話反而會造成反效果。

16 Creating Alternative Configurations
b. resource partition To maximize the reliability, the overlapping area of Every q configurations need to Every q-1 configurations need to disjoint subsets Subarea: 1… :has an area … :has an area 第二個是資源的切割,得到r和p之後就可以算出Nq和Nq-1,為了maximize reliability,每q個configuration的overlapping的區域要是。。。。所以他把Tn分成Cqp+Cpq-1個disjoint subsets。每個subset都是Tn的subarea。前面的Cqp塊subarea的區域就是NqH(Tn),後面Cp q-1塊的區域就是Nq-1H(Tn),

17 Creating Alternative Configurations
c. area assignment A configuration management matrix M d. configuration generation 第三個area assignment,算出一個management matrix M,最後就是利用M來產生congifuration。

18 Case Study Avoid confusion, we did not partition the circuit into tiles in this section. That is to say, the entire circuit was treated as a single tile Following CAD tools were used: XST for the synthesis Xilinx PACE for creating area constraints Xilinx ISE for the place and route

19 Case Study a. parameter assignment b. resources partition q=2 =3+3=6
用case study的例子來說明,是一個benchmark circuit b12來當例子,

20 Case Study c. area assignment (MATLAB) d. configuration generation

21 Case Study Some interconnection resources in the prohibited areas between allowed subareas may be utilized to route the design The above facts result in losing ability of tolerating interconnection defects. Such a problem is also present in the column based “m+k” overlapping scheme

22 Case Study r=2, p=4  q=2 =6+4=10

23 Case Study in despite of there are more configurations, the overlapping area between different configurations is much less

24 Recovery Approach Blind reconfiguration: One simple way to repair a defective tile is to try all possible configurations alternately until a configuration is loaded that successfully avoids the fault There are several methods can be used to determine if the attempt is successful Each candidate configuration has an equal probability of avoid the fault

25 Reliability Computation

26 Recovery Attempts Considerations
assume that there is a single defect in the current configuration and there is no defect in the unused resources Each configuration, overlapping area Two configurations arbitrary

27 Recovery Attempts Considerations
Since the area of each configuration is 1/r, the probability that a single attempt can not repair the tile is That is to say, the probability that a single attempt is needed to recover is

28 Recovery Attempts Considerations
the overlapping area between 3 arbitrary configurations the probability Pi that i attempts are needed to recover from error is

29 Recovery Attempts Considerations
the mean attempts to recovery from signal defect can be formulated as in the worst case, q attempts can repair the tile When defective spare resources are considered, there arise the probability of latent defects. It is difficult to solve an analytical expression of mean attempts to recovery. However, it can be evaluated by simulation.

30 Experiments Results Proposed approach v.s. “m+k” approach
m and k are evaluated so that (m+k)/k is approximate to r p is set to , which equals to the number of configurations needed to apply the “m+k” approach

31 Experiments Results

32 Experiments Results Mean Time to Failure (MTTF)

33 Experiments Results Mean Defects to Failure (MDTF) is defined as the mean number of defects needed to fail the system

34 Conclusion The proposed fault recovery technique helps to improve the reliability Redundant resources can be fully utilized so that reliability can get improvement Exceeded the previous work in MDTF Drawback: large storage overhead DP results in less similarity among different configurations


Download ppt "Mi Zhou, Li-Hong Shang Yu Hu, Jing Zhang"

Similar presentations


Ads by Google