Download presentation
Presentation is loading. Please wait.
Published byHester Walters Modified over 9 years ago
1
G ENOME - SCALE D ISK - BASED S UFFIX T REE I NDEXING Phoophakdee and Zaki
2
O UTLINE Suffix Tree introduction Application in Bioinformatics Trellis Trellis performance Conclusion
3
E XAMPLE S UFFIX T REE Sequence ACGACG$ What are Suffix Links
4
S UFFIX TREE RUNTIME Time complexity Construction of suffix tree: O(n) time and space where n is the size of the text being searched Substring Search: O(m) time where m is size of substring/search pattern Knuth-Morris-Pratt and Boyer-Moore algorithm comparison
5
A PPLICATION IN B IOINFORMATICS Database search Exact matching Approximate matching* Longest common substring Genome alignment* Structural motifs* Tandem repeats* Sequence comparison
6
P ROBLEMS WITH G ENOME - SCALE SUFFIX TREES Efficient O(n) suffix tree generating algorithms Tree must fit entirely in main memory e.g. Ukkonen’s algorithm Genomes are very large Human genome is 3 Gbp (0.75 GB) Data structure no longer able to fit in memory
7
W HAT T RELLIS SOLVES Prevents data skew in prefix partitioning Bad data skew with prefix partitioning leads to prefix partitions that may not fit into memory. From non-uniform distribution of alphabit/DNA Efficient disk-base implementation Function under low memory constraints Efficient disk IO usage Able to recover suffix links
8
T RELLIS S TEPS Prefix Creation Phase Partitioning Phase Merging Phase Suffix Link Recovery Phase (Optional)
9
T RELLIS O VERVIEW
10
M ERGING P HASE
11
T HRESHOLD (t) Determines partition of sequence Suffix subtree fits into memory during partitioning phase. Determines cutoff for prefix set inclusion Recombined prefixed suffix subtree will fit entirely into memory during merging phase. Allows input string and two sets of internal nodes to fit entirely into memory during suffix link recovery phase
12
T RELLIS O VERVIEW
13
P ERFORMANCE O(n 2 ) time and O(n) space (where n is sequence length) Comparison to TDD Currently only other algorithm that scales up to genome level Same time complexity Does not calculate suffix links
14
S UFFIX T REE C ONSTRUCTION
15
Q UERY T IMES
17
C ONCLUSION Efficient disk-based suffix tree generation that works well with limited memory Suffix links are recoverable Future work Extend to larger alphabets Buffer input sequence Parallelize partitioning and merging
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.