1 Error-Tolerant Algorithms in Bioinformatics Wen-Lian Hsu Institute of Information Science Academia Sinica
2/55 Discrete Algorithms ‧ Discrete Math. lies in the foundation of modern computer science ‧ Most algorithms we have learned in computer science are discrete ‧ Discrete algorithms emphasize “worst case analysis” ‧ Many sequence manipulation algorithms in bioinformatics are discrete
Error-Tolerant Algorithms ‧ Many recognition problems in nature need algorithm to remove noises automatically to get the correct information : –Optical character recognition ( OCR ) –Human face recognition –Voice recognition –Style checker
Design of Algorithms ‧ Optimization problems –can define “approximation algorithms” ‧ Decision problems (isomorphism, recognition, etc ) ‧ one can consider the “least # of changes” needed to yield a “yes” answer But, this often makes the problem much harder ‧ even if one can find a solution above, it might not make any practical sense ‧ no easy way to measure the “deviation”
5/55 A New Paradigm Error-Tolerant Algorithms ‧ Real life data always contain some errors (say 5%) ‧ The Challenge: Discover the 95% “correct” information versus the 5% “incorrect” information automatically ‧ Robustness (difficult to define) ‧ Similar in nature to voice recognition and character recognition algorithms
6/55 Natural Problems (1) ‧ Natural problems: Problems arised from nature, which are guaranteed to have feasible solutions if data is collected accurately. – But because of noises in sampled data, such solutions are hard to come by. ‧ To tackle these problems one should focus on real data rather than worst case analysis.
7/55 Natural Problems (2) ‧ Techniques taking advantage of the natural constraints of these problems do not necessarily work for general data (especially the worst case), but could perform very well for those well- structured problems. Constraints Structures Knowledge
An Error-Tolerant Algorithm for the Consecutive Ones Property Wen-Lian Hsu Academia Sinica
Human Genome Project ‧ DNA sequencing (could be over 10 million bp) sequences of 4 letters A,G,C,T ‧ Topics of human genome project : –Cutting and reassembling DNA sequence –Sequence comparison –Gene finding –Transcription mechanism of DAN sequence –Prediction of the structure of proteins –Phylogenetic trees
Cutting and reassembling for DNA sequence ‧ Cut a DNA sequence into small pieces in different ways and reassemble them together ‧ the “small” pieces (called clones) are still too large to find complete sequences ‧ biologically, use “probe”to mark the clones –each probe could mark several clones clone could contain several probes
Probe-Clone (0,1)-Matrix ‧ Each probe can be regarded as a column; each clone can be regarded as a row of probes ‧ If each probe hits the DNA sequence only once (unique probe) and there is no error in the probe-clone matrix, then one can use the consecutive ones test to order the clones
Consecutive Ones Property (C1P) ‧ Booth & Lueker [1976] linear time, on-line –made use of a data structure called PQ-trees ‧ Hsu [1992] decomposition, off-line –did not use PQ-trees ‧ However, these algorithms do not work on data that contain errors
13/55 C1P Testing with Good Row Ordering
14/55 Exact Algorithm for Consecutive Ones Testing 1. Construct G’’, a spanning tree of G’ ( the strictly overlapping graph ). Each connected component corresponds to a prime submatrix. ( matrix decomposition ) 2. Decide the topological ordering of prime matrix. 3. For each prime matrix determine the ordering of columns, using the set partition strategy, according to the preorder traversal of the corresponding connected component of G’’. ( good ordering )
Problems in Lab Data ‧ False positives : a “1” should actually be a “0” ‧ False negatives : a “0” should actually be a ”1” ‧ The probes are not necessarily unique –there are a lot of repeating subsequences in a DNA sequence ‧ Chimeric clones : two clones stick together at the end ‧ In STOC, Karp[1993] posed this as the problem that needs major breakthrough in computational biology ‧ How to deal with it? -- neighborhood consensus
False Positives and False Negatives false positive false negative
17/55 Non-unique Probe 0
18/55 Non-unique Probe 0
19/55 Remote False Positives 0
20/55 Chimeric Clone 0
21 An Error-Tolerant Algorithm for the C1P test The idea is derived from the off-line C1P test based on Good row ordering
22/55 Strategy of Fault-tolerant Algorithm for Consecutive Ones Testing 1. Detecting and correcting the four types of errors to construct G’’. 2. Decide the topological ordering of prime matrix. 3. Using heuristic set partition strategy to determine the ordering of columns. There will be bad rows, lost columns, which indicate the corresponding clones, probes are bad, and the additional lab work is needed.
23/55 A Matrix Satisfying the C1P
24/55 A Matrix Mixed with All Four Type of Errors
25/55 Monotone Property in a Consecutive Ones Matrix
26/55 u E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Processing row u (I) -Errorless case LL RR
27/55 Processing row u (II) -Errorless case ‧ At the end, row u is shrunk to 2 columns, representing the left and right parts ‧ At the end of the algorithm, we can rewind the rows to restore all the shrunk rows
28/55 u False Negatives of Row u
29/55 u False Negatives of C(u)
30/55 A General Error-Tolerant Algorithm for constructing G’’ (I) 1. Determine, for each probe, whether it is unique, and remove the remote false positives. 2. Determine, for each clone, whether it is chimerical, and remove the remote false positives. 3. Detect certain false negatives using a global technique 4. Partition STA’(u) (STA(u) – E(u)), C(u) and D(u) based on the containment relationship and partition A(u) and B(u) from STA’(u).
31/55 A General Error-Tolerant Algorithm for constructing G’’ (II) 5. For each row u, detect those local false negatives and false positives. 6. Make u adjacent to every row in A(u) and B(u). 7. Delete row u, construct a special row [u] such that CL([u]) = {v 1,v 2 } and Proceed to the next regular row.
32/55 u ? ? ? E(u) A(u) B(u) C(u) D(u) STA(u) STA’(u) Neighborhood Clustering LL RR
Non-Unique Probes
34/55 Chimeric Clones
35/55 Remote False Positive Remote False positive
36/55 ? ? ? False Negative (Global Method) Rows “close” to the above rows
37/55 u ? Avoid False Negatives of Row u Where would the false negatives go -to the left or right?
38/55 u ? ? Avoid False Negatives of C(u)
39/55 Monotone Property in a Consecutive Ones Matrix
40/55 Local False Positives and False Negatives false positive false negative
41/55 A Heuristic for Local False Positives and Negatives Fill-in Try the columns one by one to see which has the minimum fill-ins
42/55 Ordering Probes False negatives False positives
43/55 Bad Row for Partition
44/55 Islands of probes Island 1Island 2 Bad row
45/55 Order of Islands Island 1 Island 2Island 3
Jump Column of Result Matrix
47/55 Simulation Results (I) 100x100(total 50matrices)
48/55 Simulation Results (II) 200x200(total 50matrices)
49/55 Simulation Results (III) 400x400(total 50matrices)
50/55 A 50x50 matrix with error rate 5% N11 111N N11111N N N N N 11111N N N N N N N N N11111 P N111N N1N11111 P N11N N N N1N P N N N F1F1 111F F11111F F F F F F111 11FF F1111F F F F F F F F F11F F F F F F F F F F
51/55 A 50x50 matrix with error rate 10% N11N11 1N1N1111 P N 1N1NN11N N P N N N N N NN N N11 111N N 11111N1N NN N P 1N N111 1N1N1111N1111N N N N N N N111 1N NN1111 N111N N1N N11N N P 11N1N11111 N1N11N 1N111N N N N1111 P N N N P F11F11 11F1FF FF FFFF1F FF FF F F F F FF F11 1F11FFF F1F FF F111 1F111F11F1111F1FF F F FF F F F F1111 1F11111F111 11F FF F F FF F11F FFF1F11111FF111F1 111FF11F FF F FF11F111111F F F F F F