Download presentation
Presentation is loading. Please wait.
Published byAshley Gibbs Modified over 9 years ago
1
PARALLEL TREE MANIPULATION Islam Atta
2
Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H). Copyright © 2010-2012 Islam Atta 2
3
This talk is about … Manipulation algorithm for Tree data structures Parallel Programming domain Novel Tree representation using Linear Arrays Copyright © 2010-2012 Islam Atta 3
4
Trees…? Widely used Hierarchical data (financial data, NLP, machine vision, GIS, DNA, protein sequences…) Indexing/hashing (search engines…) Tree Manipulation context Full traversal of a tree (or sub-tree) for read or write accesses. E.g. CBIR, DNA sequence alignment Copyright © 2010-2012 Islam Atta 4
5
Problem Categorized as Non-Uniform Random Access structures Bad spatial locality Incurring high miss rates Worse for multiprocessing (Berkley, 2006) Requires High on-/off-chip bandwidth Copyright © 2010-2012 Islam Atta 5
6
Tree Representation A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Memory Layout Copyright © 2010-2012 Islam Atta 6 No GAPS
7
Multiprocessing Platforms Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC) Explicit message passing Many small messages Shared memory architectures (Intel Quad-core) Coherent cache banks cache blocks grabbed when referenced ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Optimal Solution Reallocate tree elements in memory to form contiguous memory regions Copyright © 2010-2012 Islam Atta 7
8
Organization & RepresentationEvaluationDiscussion & Conclusion GOAL: Allocate tree elements to promote Spatial Locality Organization & Representation Copyright © 2010-2012 Islam Atta 8
9
Polish Notation Copyright © 2010-2012 Islam Atta 9 Arithmetic example
10
Linear Tree: Depth-First Ordering A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV Copyright © 2010-2012 Islam Atta 10
11
Recursive Contiguity A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUV N ST NST C GHI OPQ CGHIOPQ Copyright © 2010-2012 Islam Atta 11
12
Non-shared memory architectures Explicit message passing Few Large messages Shared memory architectures Minimal False sharing Spatial Locality ABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUV Copyright © 2010-2012 Islam Atta 12
13
Representation Copyright © 2010-2012 Islam Atta 13 Data Array Parents Reference Array Children Reference Array
14
First-Child/Sibling Copyright © 2010-2012 Islam Atta 14 Data Array Parents Reference Array First-Child Reference Array Siblings’ Reference Array
15
Scheduling Algorithm Designed for Cell BE and Blue Gene /L Message-passing (DMA, MPI, mailboxes) Challenges Unbalanced trees with varying computation complexity Limited local storage Larger data chunks, 128 byte aligned Algorithm properties Master-slave Dynamic scheduling of sub-workloads Work-stealing: coordinated by the master Double buffering Copyright © 2010-2012 Islam Atta 15
16
Organization & RepresentationEvaluationDiscussion & Conclusion Implementation and comparison Evaluation Copyright © 2010-2012 Islam Atta 16
17
Methodology Application: Sequence Alignment problem DNA, RNA, protein, NLP, Financial data Implementation: pthreads on x86 Intel machines UG: Quad-core Kodos: 2-socket quad-core Kang: 4-socket dual-core Data Cache Simulation In-memory Trees Copyright © 2010-2012 Islam Atta 17
18
Memory Access Time Copyright © 2010-2012 Islam Atta 18 Naïve sequence alignment consists of only read/write operations. Random: Sub-linear increase up to 4 threads. Saturates after 4 threads Linear: Sequential - 2.7X gain Hit memory-wall
19
Quad-core vs. 2-socket Quad-core Copyright © 2010-2012 Islam Atta 19 Kodos-Random: Sub-linear increase before 8 threads Saturates after 8 threads Kodos-Linear: Similar to UG-linear
20
Interconnect Effect Assume that UG has a perfect interconnect and compare against it. Copyright © 2010-2012 Islam Atta 20
21
Effective Bandwidth Kang KodosUG Copyright © 2010-2012 Islam Atta 21
22
Modeling Real Computation Copyright © 2010-2012 Islam Atta 22 Computation scaling modeled as SQRT(number of threads) 1.6 – 3.05 X for 1 – 4 threads
23
Other Experimental Results MetricRandomLinear Miss Rate (L2)14%1.6% Sequential/Parallel fractions Sequential is 10% with minor improvement for Linear Load balancingMaximum 4% deviation (no work-stealing required) Stalling on LocksNo difference Memory size ratio10.32 Copyright © 2010-2012 Islam Atta 23
24
Organization & RepresentationEvaluationDiscussion & Conclusion Practical considerations & Potential work Discussion & Conclusion Copyright © 2010-2012 Islam Atta 24
25
Discussion Limitations: Only shared memory architectures Max tree size: 4G Bytes, 47M nodes Compression First-child references can be reduced to 1-bit per node. Use Delta distance instead of full address. Copyright © 2010-2012 Islam Atta 25
26
Discussion Copyright © 2010-2012 Islam Atta 26
27
Next… Path #1: Implement and evaluate a commercial/scientific workload Developing a library/framework for parallel tree manipulation Path #2: Algorithm evaluation for non-shared memory architectures E.g. Blue Gene, Intel SCC Both Copyright © 2010-2012 Islam Atta 27
28
Conclusion Tree manipulation using typical data representation is not well suited for parallel processing. Propose and evaluate a technique for parallel tree manipulation Performance gain for sequential and parallel processing Saves memory and bandwidth Scalable For our experiments, on-chip communication with fewer cores is superior to off-chip communication. Copyright © 2010-2012 Islam Atta 28
29
Acknowledgment Special thanks goes to: Prof. Natalie Enright Jerger Prof. Greg Steffan Copyright © 2010-2012 Islam Atta 29
30
QUESTIONS Fact: 42,270 runs were executed in the experimentation using 91 TBs of data. Thank You. Please send me your comments, iatta@eecg.toronto.eduiatta@eecg.toronto.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.