Download presentation
Presentation is loading. Please wait.
1
An analysis of “Using sequence compression to speed up probabilistic profile matching” by Valerio Freschi and Alessandro Bogliolo Cory Tobin
2
● Ex: Hydrophobicity sliding window program ● Scores for all characters within the window are added together and assigned to the character in question ● Characters in window further away from the character in question are weighted less Probabilistic Profiles Ex: A T T C G G C T C 5 2.5 2 1
3
● Time complexity of the brute force method is O(NP) ● N=length of sequence P=length of profile ● Looking for a more efficient way to score sequences with probabilistic profiles ● Made algorithms that could work on compressed sequences ● Use Run-length encoding and LZ78 compression ● Decompressing sequences prior to scoring is not necessary ● Test the algorithms on real sequences Overview
4
Run-Length Encoding ● Lossless compress method ● Sequential repeats are saved as a single character and an integer representing the number of repeats ● Only works well when there are lots of repetitive characters ● Better compression ratio with nucleotides than with amino acids Ex: A T T T G C G C A A A A A T A T T C T C T C T G T G G A A A A A A C G A (T,3) G C G C (A,5) T A (T,2) C T C T C T G T (G,2) (A,6) C G
5
● Lossless compression method ● Lempel Zif 1978 ● Stores the data in a tree structure ● Uses repeated patterns rather than sequentialy repeated characters ● Better compression ratio than run-length ● Compression algorithm is more complex than run-length LZ78 Compression A T A A T C A A C T C G AT AC C G
6
Brute-Force Scoring Algorithm A T A A T C A A C T C G 36 steps
7
Run-Length Scoring Algorithm A T A A T C A A C T C G 30 steps
8
LZ78 Scoring Algorithm A T A A T C A A C T C G 21 steps
9
Complexities Brute Force: O( N P ) Run-Length: O( ( N / l avg ) P ) LZ78: O( ( N / log N ) P )
10
Implications of Complexities ● Complexities are based on the compression ratios of the sequences ● If the compression ratio is 1:1 there is no reward for using the non- brute force algorithms ● Sequences of equal length but higher compression will yield algorithms with lower complexities
11
64 Dollar Question How do these algorithms stack up against real sequences?
12
Methods ● Randomly pick human DNA and protein sequences of varying lengths ● Calculate the compression ratio using brute force, Run-length, and LZ78 methods ● Run the algorithms on those sequences Characters in original sequence vs. characters in compressed sequence
13
Results ● Run-length did not provide much advantage over brute force ● LZ78 provided a great advantage over both brute force and run- length ● Longer sequences yield better LZ78 performance compared to brute force ● Both Run-length and LZ78 have lower complexities, therefore better performance, on DNA sequences rather than protein sequences
14
Pros and Cons ● Less time is needed to perform probabilistic profile matching ● Databases such as GenBank do not store their sequences in LZ78 or Run-length format ● One would need to retrieve the sequence, compress it, then run the algorithm ● This is probably worse than just using brute force on an uncompressed sequence
15
End
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.