PARALLEL TREE MANIPULATION Islam Atta
Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H). Copyright © Islam Atta 2
This talk is about … Manipulation algorithm for Tree data structures Parallel Programming domain Novel Tree representation using Linear Arrays Copyright © Islam Atta 3
Trees…? Widely used Hierarchical data (financial data, NLP, machine vision, GIS, DNA, protein sequences…) Indexing/hashing (search engines…) Tree Manipulation context Full traversal of a tree (or sub-tree) for read or write accesses. E.g. CBIR, DNA sequence alignment Copyright © Islam Atta 4
Problem Categorized as Non-Uniform Random Access structures Bad spatial locality Incurring high miss rates Worse for multiprocessing (Berkley, 2006) Requires High on-/off-chip bandwidth Copyright © Islam Atta 5
Tree Representation A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Memory Layout Copyright © Islam Atta 6 No GAPS
Multiprocessing Platforms Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC) Explicit message passing Many small messages Shared memory architectures (Intel Quad-core) Coherent cache banks cache blocks grabbed when referenced ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Optimal Solution Reallocate tree elements in memory to form contiguous memory regions Copyright © Islam Atta 7
Organization & RepresentationEvaluationDiscussion & Conclusion GOAL: Allocate tree elements to promote Spatial Locality Organization & Representation Copyright © Islam Atta 8
Polish Notation Copyright © Islam Atta 9 Arithmetic example
Linear Tree: Depth-First Ordering A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV Copyright © Islam Atta 10
Recursive Contiguity A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUV N ST NST C GHI OPQ CGHIOPQ Copyright © Islam Atta 11
Non-shared memory architectures Explicit message passing Few Large messages Shared memory architectures Minimal False sharing Spatial Locality ABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUV Copyright © Islam Atta 12
Representation Copyright © Islam Atta 13 Data Array Parents Reference Array Children Reference Array
First-Child/Sibling Copyright © Islam Atta 14 Data Array Parents Reference Array First-Child Reference Array Siblings’ Reference Array
Scheduling Algorithm Designed for Cell BE and Blue Gene /L Message-passing (DMA, MPI, mailboxes) Challenges Unbalanced trees with varying computation complexity Limited local storage Larger data chunks, 128 byte aligned Algorithm properties Master-slave Dynamic scheduling of sub-workloads Work-stealing: coordinated by the master Double buffering Copyright © Islam Atta 15
Organization & RepresentationEvaluationDiscussion & Conclusion Implementation and comparison Evaluation Copyright © Islam Atta 16
Methodology Application: Sequence Alignment problem DNA, RNA, protein, NLP, Financial data Implementation: pthreads on x86 Intel machines UG: Quad-core Kodos: 2-socket quad-core Kang: 4-socket dual-core Data Cache Simulation In-memory Trees Copyright © Islam Atta 17
Memory Access Time Copyright © Islam Atta 18 Naïve sequence alignment consists of only read/write operations. Random: Sub-linear increase up to 4 threads. Saturates after 4 threads Linear: Sequential - 2.7X gain Hit memory-wall
Quad-core vs. 2-socket Quad-core Copyright © Islam Atta 19 Kodos-Random: Sub-linear increase before 8 threads Saturates after 8 threads Kodos-Linear: Similar to UG-linear
Interconnect Effect Assume that UG has a perfect interconnect and compare against it. Copyright © Islam Atta 20
Effective Bandwidth Kang KodosUG Copyright © Islam Atta 21
Modeling Real Computation Copyright © Islam Atta 22 Computation scaling modeled as SQRT(number of threads) 1.6 – 3.05 X for 1 – 4 threads
Other Experimental Results MetricRandomLinear Miss Rate (L2)14%1.6% Sequential/Parallel fractions Sequential is 10% with minor improvement for Linear Load balancingMaximum 4% deviation (no work-stealing required) Stalling on LocksNo difference Memory size ratio10.32 Copyright © Islam Atta 23
Organization & RepresentationEvaluationDiscussion & Conclusion Practical considerations & Potential work Discussion & Conclusion Copyright © Islam Atta 24
Discussion Limitations: Only shared memory architectures Max tree size: 4G Bytes, 47M nodes Compression First-child references can be reduced to 1-bit per node. Use Delta distance instead of full address. Copyright © Islam Atta 25
Discussion Copyright © Islam Atta 26
Next… Path #1: Implement and evaluate a commercial/scientific workload Developing a library/framework for parallel tree manipulation Path #2: Algorithm evaluation for non-shared memory architectures E.g. Blue Gene, Intel SCC Both Copyright © Islam Atta 27
Conclusion Tree manipulation using typical data representation is not well suited for parallel processing. Propose and evaluate a technique for parallel tree manipulation Performance gain for sequential and parallel processing Saves memory and bandwidth Scalable For our experiments, on-chip communication with fewer cores is superior to off-chip communication. Copyright © Islam Atta 28
Acknowledgment Special thanks goes to: Prof. Natalie Enright Jerger Prof. Greg Steffan Copyright © Islam Atta 29
QUESTIONS Fact: 42,270 runs were executed in the experimentation using 91 TBs of data. Thank You. Please send me your comments,