Efficient Algorithms for Locating the Length- Constrained Heaviest Segments, with Applications to Biomolecular Sequence Analysis Yaw-Ling Lin * Tao Jiang Kun-Mao Chao * Dept CS & Info Mngmt, Providence Univ, Taiwan Dept CS & Engineering, UC Riverside, USA Dept CS & Info Engnr, Nat. Taiwan Univ, Taiwan
Yaw-Ling Lin, Providence, Taiwan2 Outline Introduction. Applications to Biomolecular Sequence Analysis. Maximum Sum Consecutive Subsequence. Maximum Average Consecutive Subsequence. Implementation and Preliminary Experiments Concluding Remarks
Yaw-Ling Lin, Providence, Taiwan3 Introduction Two fundamental algorithms in searching for interesting regions in sequences: Given a sequence of real numbers of length n and an upper bound U, find a consecutive subsequence of length at most U with the maximum sum --- an O(n)-time algorithm. Given a sequence of real numbers of length n and a lower bound L, find a consecutive subsequence of length at least L with the maximum average. --- an O(n log L)-time algorithm.
Yaw-Ling Lin, Providence, Taiwan4 Applications to Biomolecular Sequence Analysis (I) Locating GC-Rich Regions –Finding GC-rich regions: an important problem in gene recognition and comparative genomics. –CpG islands ( 200 ~ 1400 bp ) –[Huang’94]: O(n L)-time algorithm. Post-Processing Sequence Alignments –Comparative analysis of human and mouse DNA: useful in gene prediction in human genome. –Mosaic effect: bad inner sequence. –Normalized local alignment. –Post-processing local aligned subsequences
Yaw-Ling Lin, Providence, Taiwan5 Applications to Biomolecular Sequence Analysis (II) Annotating Multiple Sequence Alignments – [Stojanovic’99]: conserved regions in biomolecular sequences. –Numerical scores for columns of a multiple alignment; each column score shall be adjusted by subtracting an anchor value. Ungapped Local Alignments with Length Constraints –Computing the length-constrained segment of each diagonal in the matrix with the largest sum (or average) of scores. –Applications in motif identification.
Yaw-Ling Lin, Providence, Taiwan6 Maximum Sum Consecutive Subsequence is left-negative is not. is minimal left-negative partitioned.
Yaw-Ling Lin, Providence, Taiwan7 Minimal left-negative partition
Yaw-Ling Lin, Providence, Taiwan8 MLN-partition: linear time
Yaw-Ling Lin, Providence, Taiwan9 Max-Sum with LC
Yaw-Ling Lin, Providence, Taiwan10 Analysis of MSLC
Yaw-Ling Lin, Providence, Taiwan11 Max Average Subsequence is right-skew is not. is decreasing right- skew partitioned.
Yaw-Ling Lin, Providence, Taiwan12 Decreasing right-skiew partition
Yaw-Ling Lin, Providence, Taiwan13 DRS-partition: linear time
Yaw-Ling Lin, Providence, Taiwan14 Max-Avg-Seq with LC
Yaw-Ling Lin, Providence, Taiwan15 Locate good-partner
Yaw-Ling Lin, Providence, Taiwan16 Analysis of MaxAvgSeq
Yaw-Ling Lin, Providence, Taiwan17 Implementation and Preliminary Experiments
Yaw-Ling Lin, Providence, Taiwan18 Implementation and Preliminary Experiments
Yaw-Ling Lin, Providence, Taiwan19 Conclusion Find a max-sum subsequence of length at most U can be done in O(n)-time. Find a max-avg subsequence of length at least L can be done in O(n log L)-time.
Yaw-Ling Lin, Providence, Taiwan20 Recent Progress Lu (CMCT’2002): finding the max-avg subsequence of length at least L on binary (0,1) sequences. O(n)-time. Goldwasser, Kao, Lu (2002, manuscripts): finding the max-avg subsequence of length at least L and at most U on real sequences. O(n)-time Tools: finding CpG islands using MAVG (joint work with Huang, X., Jiang, T. and Chao, K.-M.)
Yaw-Ling Lin, Providence, Taiwan21 Future Research Best k (nonintersecting) subsequences? Normalized local alignment? Measurement of goodness?