Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica.

Similar presentations


Presentation on theme: "1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica."— Presentation transcript:

1 1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica

2 2/55 Discrete Algorithms ‧ Discrete Math. lies in the foundation of modern computer science ‧ Most algorithms we have learned in computer science are discrete ‧ Discrete algorithms emphasize “worst case analysis” ‧ Many sequence manipulation algorithms in bioinformatics are discrete

3 Error-Tolerant Algorithms ‧ Many recognition problems in nature need algorithm to remove noises automatically to get the correct information : –Optical character recognition ( OCR ) –Human face recognition –Voice recognition –Style checker

4 Design of Algorithms ‧ Optimization problems –can define “approximation algorithms” ‧ Decision problems (isomorphism, recognition, etc ) ‧ one can consider the “least # of changes” needed to yield a “yes” answer But, this often makes the problem much harder ‧ even if one can find a solution above, it might not make any practical sense ‧ no easy way to measure the “deviation”

5 5/55 A New Paradigm Error-Tolerant Algorithms ‧ Real life data always contain some errors (say 5%) ‧ The Challenge: Discover the 95% “correct” information versus the 5% “incorrect” information automatically ‧ Robustness (difficult to define) ‧ Similar in nature to voice recognition and character recognition algorithms

6 6/55 Natural Problems (1) ‧ Natural problems: Problems arised from nature, which are guaranteed to have feasible solutions if data is collected accurately. – But because of noises in sampled data, such solutions are hard to come by. ‧ To tackle these problems one should focus on real data rather than worst case analysis.

7 7/55 Natural Problems (2) ‧ Techniques taking advantage of the natural constraints of these problems do not necessarily work for general data (especially the worst case), but could perform very well for those well- structured problems. Constraints  Structures  Knowledge

8 An Error-Tolerant Algorithm for the Consecutive Ones Property Wen-Lian Hsu Academia Sinica

9 Human Genome Project ‧ DNA sequencing (could be over 10 million bp) sequences of 4 letters A,G,C,T ‧ Topics of human genome project : –Cutting and reassembling DNA sequence –Sequence comparison –Gene finding –Transcription mechanism of DAN sequence –Prediction of the structure of proteins –Phylogenetic trees

10 Cutting and reassembling for DNA sequence ‧ Cut a DNA sequence into small pieces in different ways and reassemble them together ‧ the “small” pieces (called clones) are still too large to find complete sequences ‧ biologically, use “probe”to mark the clones –each probe could mark several clones clone could contain several probes

11 Probe-Clone (0,1)-Matrix ‧ Each probe can be regarded as a column; each clone can be regarded as a row of probes ‧ If each probe hits the DNA sequence only once (unique probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clones

12 Consecutive Ones Property (C1P) ‧ Booth & Lueker [1976] linear time, on-line –made use of a data structure called PQ-trees ‧ Hsu [1992] decomposition, off-line –did not use PQ-trees ‧ However, these algorithms do not work on data that contain errors

13 13/55 C1P Testing with Good Row Ordering

14 14/55 Exact Algorithm for Consecutive Ones Testing 1. Construct G’’, a spanning tree of G’ ( the strictly overlapping graph ). Each connected component corresponds to a prime submatrix. ( matrix decomposition ) 2. Decide the topological ordering of prime matrix. 3. For each prime matrix determine the ordering of columns, using the set partition strategy, according to the preorder traversal of the corresponding connected component of G’’. ( good ordering )

15 Problems in Lab Data ‧ False positives : a “1” should actually be a “0” ‧ False negatives : a “0” should actually be a ”1” ‧ The probes are not necessarily unique –there are a lot of repeating subsequences in a DNA sequence ‧ Chimeric clones : two clones stick together at the end ‧ In STOC, Karp[1993] posed this as the problem that needs major breakthrough in computational biology ‧ How to deal with it? -- neighborhood consensus

16 False Positives and False Negatives false positive false negative

17 17/55 Non-unique Probe 0

18 18/55 Non-unique Probe 0

19 19/55 Remote False Positives 0

20 20/55 Chimeric Clone 0

21 21 An Error-Tolerant Algorithm for the C1P test The idea is derived from the off-line C1P test based on Good row ordering

22 22/55 Strategy of Fault-tolerant Algorithm for Consecutive Ones Testing 1. Detecting and correcting the four types of errors to construct G’’. 2. Decide the topological ordering of prime matrix. 3. Using heuristic set partition strategy to determine the ordering of columns. There will be bad rows, lost columns, which indicate the corresponding clones, probes are bad, and the additional lab work is needed.

23 23/55 A Matrix Satisfying the C1P

24 24/55 A Matrix Mixed with All Four Type of Errors

25 25/55 Monotone Property in a Consecutive Ones Matrix

26 26/55 u E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Processing row u (I) -Errorless case LL RR

27 27/55 Processing row u (II) -Errorless case ‧ At the end, row u is shrunk to 2 columns, representing the left and right parts ‧ At the end of the algorithm, we can rewind the rows to restore all the shrunk rows

28 28/55 u False Negatives of Row u

29 29/55 u False Negatives of C(u)

30 30/55 A General Error-Tolerant Algorithm for constructing G’’ (I) 1. Determine, for each probe, whether it is unique, and remove the remote false positives. 2. Determine, for each clone, whether it is chimerical, and remove the remote false positives. 3. Detect certain false negatives using a global technique 4. Partition STA’(u) (STA(u) – E(u)), C(u) and D(u) based on the containment relationship and partition A(u) and B(u) from STA’(u).

31 31/55 A General Error-Tolerant Algorithm for constructing G’’ (II) 5. For each row u, detect those local false negatives and false positives. 6. Make u adjacent to every row in A(u) and B(u). 7. Delete row u, construct a special row [u] such that CL([u]) = {v 1,v 2 } and Proceed to the next regular row.

32 32/55 u ? ? ? E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Neighborhood Clustering LL RR

33 Non-Unique Probes

34 34/55 Chimeric Clones

35 35/55 Remote False Positive Remote False positive

36 36/55 ? ? ? False Negative (Global Method) Rows “close” to the above rows

37 37/55 u ? Avoid False Negatives of Row u Where would the false negatives go -to the left or right?

38 38/55 u ? ? Avoid False Negatives of C(u)

39 39/55 Monotone Property in a Consecutive Ones Matrix

40 40/55 Local False Positives and False Negatives false positive false negative

41 41/55 A Heuristic for Local False Positives and Negatives Fill-in Try the columns one by one to see which has the minimum fill-ins

42 42/55 Ordering Probes False negatives False positives

43 43/55 Bad Row for Partition

44 44/55 Islands of probes Island 1Island 2 Bad row

45 45/55 Order of Islands Island 1 Island 2Island 3

46 Jump Column of Result Matrix 147235689

47 47/55 Simulation Results (I) 100x100(total 50matrices)

48 48/55 Simulation Results (II) 200x200(total 50matrices)

49 49/55 Simulation Results (III) 400x400(total 50matrices)

50 50/55 A 50x50 matrix with error rate 5% 11111111 1111111111N11 111N1111 111111111111111111 1N11111N111111111111 1111111111111 111111 11N11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 11111111111111N1 11111111111 111N1111111111N 11111N11111 1111111111111111N11 11111111111111 11111111111N111 1111111111111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 11111111111111111 111111111N11 1111111111 1N1111111111N11111 P N111N111111111 N1N11111 P 1111111111111111 1111111N11N111111 11111111111111 11111111N111111 11N1111111 N1N111 111111111111111111 1111111111111 11111111111111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 1111111 111111111111111 11111111 1111111111F1F1 111F1111 111111111111111111 1F11111F111111111111 1111111111111 111111 11F11111111111111111 11111111111111111111 1111111 111111111 111111111111111111 111111111111111 11111111111 111F1111111111 111111F1111 1111111111111111F11 11111111111111 11111111111F111 11FF11111111111F1111F1 111111F111111111 11111F11111111111111 11111F11111111111 11111 1111111111F1 11111111111111 11111111 1111111111111111 1111111111 11111111111F11111 11F1111111111 111111 1F1111111111111111 1111111F11F111111 11F111111111111 11111111F111111 1F11111111 1F111 111111111111111111 1111111111111 11111111111111 11111111 11111F11111111F1 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 11F1111111111111

51 51/55 A 50x50 matrix with error rate 10% 11111111 1111111N11N11 1N1N1111 P 11111111111111111N 1N1NN11N11111111111N 1111111111111 P 111111 11N11111111111N11111 11111111111111111111 1N11111 1111111N1 111111N11111111111 11111111111111NN N1111111N11 111N1111111111N 11111N1N111 11111NN111111111N11 11111111111111 P 1N111111111N111 1N1N1111N1111N11111 111111N111111111 11111111111111111111 11111111111111111 111111 111111111N11 11111111111111 11111111 111N1111111111N11 111111111N11 111111N111 1N1111111111NN1111 N111N111111111 N1N11111 1111111111111111 1111111N11N111111 11111111111111 11111111N111111 P 11N1N11111 N1N11N 1N111N111111111111 1N1111111111N 111111111N1111 P 1111111 11111N111111111 111111111111111111 1N111111111111 111111N11 11111111111111111111 P 1111111 111111111111111 11111111 1111111F11F11 11F1FF11 11111FF1111111111 11FFFF1F1111111111 11FF11111111FF11 111111 111111111111F11111 11111111111111111111 1F11111 1111111F1 11111F11111111111 11111FF111111 111111F11 1F11FFF11111111 1111F1F111 1111111FF1111111 11111111111111 11111111111F111 1F111F11F1111F1FF11 111111F111111111 11F11111111111111111 1FF11111111111111 11111 111111111F11 11111111111111 11111111 11F111111111F111 111111F1111 1F11111F111 11F11111111FF1111 1111F1111111 1F11111 11111111111111FF1 1111111F11F111111 11FFF1F11111FF111F1 111FF11F1111111 111FF1111 11F11111111111111 11FF11F111111F1 11111111F11111 11F11111 11111F111111111 111111111111111111 1F111111111111 111111F11 11111111111111111111 1111111 111111111111111


Download ppt "1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica."

Similar presentations


Ads by Google