1 Parallel EST Clustering by Kalyanaraman, Aluru, and Kothari Nargess Memarsadeghi CMSC 838 Presentation
CMSC 838T – Presentation 2 Talk Overview u Overview of talk Motivation Background Techniques Evaluation Related work Observations
CMSC 838T – Presentation 3 Motivation: EST Clustering u Problem: EST Clustering Cluster fragments of cDNA u Related to ‘fragment assembly’ problem Detecting overlapping fragments u Overlaps can be computed: Pairwise alignment algorithm Dynamic programming u Alternative: Approximate overlap detection algorithms Dynamic programming
CMSC 838T – Presentation 4 Motivation u Common Tools: Takes too long l Days for 100,000 ESTs Runs out of memory u This paper: PaCE: l Parallel Clustering of ESTs Efficient parallel EST Clustering l Space efficient algorithm l Reduce total work l Reduce run-time
CMSC 838T – Presentation 5 Background: EST Clustering Tools u Three traditional software: Originally designed for fragment assembly: l TIGR Assembler l Phrap l CAP3 u One parallel software: UICLUSTER: assumes EST’s from 3’ end
CMSC 838T – Presentation 6 EST Clustering Tools u Basic approach Find pairs of similar sequences Align similar pairs l Dynamic programing u Quality of EST clustering l Phrap: Fastest u avoids dynamic programming u Relies on approximation, lower quality l CAP: Least # of erroneous clusters
CMSC 838T – Presentation 7 EST Clustering Tools’ Performance u With 50,000 maize ESTs Using PC with dual Pentium 450MHZ, 512 RAM : l TIGR: ran out of memory l Phrap: 40 min l CAP: > 24 hours u With 100,000 maize ESTs l all ran out of memory l CAP would require 4 days
CMSC 838T – Presentation 8 Goal u Space efficient algorithm Space requirement linear in the size of the input data set u Reduce total work Without sacrificing quality of clustering u Reduce run-time and facilitate the clustering of large data sets Through parallel processing Scale memory with # of processors
CMSC 838T – Presentation 9 Approach u Expense: Pairwise alignment (time + memory) Promising pairs ≈ l Common string: |s|= w l Cost: if common |s|=l > w, then repeats l-w+1 times
CMSC 838T – Presentation 10 Approach (Cont..) u Approach: Use trie structure Identify promising pairs l Merge clusters with strong overlaps l Avoid storing/testing all similar pairs Parallel EST Clustering Software: l Generalized Suffix Tree (GST) l Multiple processors: u Maintain and updates EST Clusters u Others generate batches of promising pairs, perform alignment
CMSC 838T – Presentation 11 Approach (Cont …)
CMSC 838T – Presentation 12 Tries 1)Index for each char 2)N leaves 3)Height N
CMSC 838T – Presentation 13 Suffix Tries (Cont..) 1)TRIM suffix trie
CMSC 838T – Presentation 14 Suffix Tries (Cont..) 1)Indicies 2)Storage O(n), constant is high though 3)Common string 4)Longest common substring
CMSC 838T – Presentation 15 Suffix Tries (Cont..) 1 2 a b a b $ a b $ b 3 $ 4 $ 5 Given a pattern P = ab we traverse the tree according to the pattern.
CMSC 838T – Presentation 16 Parallel Generation of GST u GST: Generalized Suffix Tree Compacted trie Longest common prefix found in constant time Used for on-demand pair generation Sequential: O(nl) Parallel: O(nl/p)
CMSC 838T – Presentation 17 Parallel Generation of GST (Cont …) u Previous implementations: l CRCW/CREW PRAM model l Work-optimal u Involves alphabetical ordering of characters l Unrealistic assumptions u synchronous operation of processors u infinite network bandwidth u no memory contention u Not practically efficient
CMSC 838T – Presentation 18 Parallel Generation of GST (Cont …) u Paper’s approach: EST’s equally distributed among processors Each processor l Partitions suffixes of ESTs into buckets Distribute buckets to the processors: l All suffixes in a bucket allocated to the same processor l Total # of suffixes allocated to a processor ≈ O ( )
CMSC 838T – Presentation 19 Parallel Generation of GST (Cont …) Each bucket’s processor: l Compute compacted trie of all its suffixes l Cannot use sequential construction u Suffixes of a string – not in the same bucket Each bucket: l Subtree in the GST Nodes: l Depth first search traversal of the trie l Pointer to the right most child
CMSC 838T – Presentation 20 On-demand Pair Generation u A pair should be generated if Share substring of length ≥ treshhold Maximal Leaves in a common node l Share a substring of length = depth of node u Parallel algorithm Each processor works with its trie if l Depth of its root in GST < threshhold
CMSC 838T – Presentation 21 On-demand Pair Generation u To process Sort internal nodes l Decreasing order of depth Lists of a node l Generated after process l Removed after parent is processed l Limits space O(nl) l Run time ≈ # pairs generated + cost of sorting l Rejected pairs increase run-time by a factor of 2 l Eliminating duplicates reduce run-time
CMSC 838T – Presentation 22 Parallel Clustering u Master-Slave paradigm: Master processor: l Maintains and updates clusters u Using union-find data structure u Receives messages from slave processors – A batch of next promising pairs generated by slave – Results of the pairwise alignment u Determines which ones to explore u Determines if merging should occur Slave processors: l Generate pairs on demand l Perform pairwise alignments of pairs dispatched by the master processor
CMSC 838T – Presentation 23 Parallel Clustering (Cont…) Organization of Parallel Clustering Software Master P Slave P Slave P slave P Batch of promising pairs generated + results of pairwise alignment Batchsize or fewer # of pairs + results of pairwise alignemnt on each pair
CMSC 838T – Presentation 24 Parallel Clustering (Cont..) u To start: Slave P starts with 3× batchsize pairs l Sends the 3rd batch to Master P l Starts alignment on 1st batch l Sends results on 1st + a newly generated batch l While waiting to receive results from Master P, aligns 2nd batch u Processor always has the next batch to work between: – Submitting the results of previous batch – Receiving another set of pairs
CMSC 838T – Presentation 25 Parallel Clustering (Cont..) u Improve and control quality l Parameters: u Match and mismatch scores u Gap penalties l Post processing: u Detection of alternating splicing u Consulting protein databases u Organism specific
CMSC 838T – Presentation 26 Experimental environment u Used C and MPI u Tested Quality of software: l Arabidopsis thaliana (due to availability of its genome) Run-time behavior: l 50,000 Maize ESTs with 32-processor IBM SP l # of processors l Data size l (# of Promising pairs) vs data size l Batchsize vs (# processors) l # of Clusters l Master processor’s time
CMSC 838T – Presentation 27 Quality Assessment u To asses quality A data set and its correct clustering ESTs from plant Arabidopsis thaliana Splice program l Align ESTs to the genome l Discard ESTs that u Don’t align u Aligned in multiple spots
CMSC 838T – Presentation 28 Quality Assessment (Cont …) u False negative: A pair in correct clustering is not paired in the output 5% u False positive: A pair not in correct clustering appears in results Negligible (< 0.04%) Due to conservative nature of algorithm
CMSC 838T – Presentation 29 Quality Assessment Cluster results Number of singleton clusters Number of non- singleton clusters Benchmark10,80318,727 CAP317,93017,556 PaCE14,80219,536 Distribution of the number singleton and non-singleton clusters for benchmark set of 168,200 Arabidopsis ESTs.
CMSC 838T – Presentation 30 Quality Assessment (Cont..)
CMSC 838T – Presentation 31 Run-time Assessment -Experiment with 50,000 maize ESTs: -32-processor IBM SP minutes
CMSC 838T – Presentation 32 Run-time Assessment (Cont …) pPreprocessingClusteringTotal Run-time (in seconds) spent in various components of PaCE for 20,000 ESTs. p, number of processors.
CMSC 838T – Presentation 33 Run-time Assessment (Cont..) u Run-time as a function of batchsize Small batchsize l Increase in communication overhead Large batchsize l Slaves less responsive to the need of generating pairs l Slave does not use latest clustering results Optimal batchsize l Determined by experiment u Master processor’s time Fixed batchsize, increase in # of processors l Gradual increase in Master P’s time With 32 processors, increase < 1% Using 1 Master Processor in not bottleneck
CMSC 838T – Presentation 34 Results u Space Linear in size of the input data set u Reduced total work without sacrificing quality u Reduced run-time Parallel processors Eliminating pairs u Faciliate clustering Scale memory with # Processors
CMSC 838T – Presentation 35 Observations u PaCE: Approaches EST clustering problem directly Better than l CAP3 l Phrap l TIGR Assembler Compare time/quality l TIGICL (TIGR Indices Clustering Tool) u Support for PVM l MegaBlast l STACK Large data sets l Lots of Processors Can improve clustering time? u Clustering algorithm
CMSC 838T – Presentation 36 References u S02/lectures/eval10-logp.pdf S02/lectures/eval10-logp.pdf u Apostolico, C. Iliopoulos, G. M. Landau, B. Schieber, and U. Vishkin. Parallel construction of a suffix tree with applications. Algorithmica, 3:347–365, 1988.