Presentation is loading. Please wait.

Presentation is loading. Please wait.

PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Similar presentations


Presentation on theme: "PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental."— Presentation transcript:

1 PARALLEL TREE MANIPULATION Islam Atta

2 Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H). Copyright © 2010-2012 Islam Atta 2

3 This talk is about … Manipulation algorithm for Tree data structures Parallel Programming domain Novel Tree representation using Linear Arrays Copyright © 2010-2012 Islam Atta 3

4 Trees…? Widely used Hierarchical data (financial data, NLP, machine vision, GIS, DNA, protein sequences…) Indexing/hashing (search engines…) Tree Manipulation context Full traversal of a tree (or sub-tree) for read or write accesses. E.g. CBIR, DNA sequence alignment Copyright © 2010-2012 Islam Atta 4

5 Problem Categorized as Non-Uniform Random Access structures Bad spatial locality Incurring high miss rates Worse for multiprocessing (Berkley, 2006) Requires High on-/off-chip bandwidth Copyright © 2010-2012 Islam Atta 5

6 Tree Representation A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Memory Layout Copyright © 2010-2012 Islam Atta 6 No GAPS

7 Multiprocessing Platforms Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC) Explicit message passing  Many small messages Shared memory architectures (Intel Quad-core) Coherent cache banks  cache blocks grabbed when referenced ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Optimal Solution Reallocate tree elements in memory to form contiguous memory regions Copyright © 2010-2012 Islam Atta 7

8 Organization & RepresentationEvaluationDiscussion & Conclusion GOAL: Allocate tree elements to promote Spatial Locality Organization & Representation Copyright © 2010-2012 Islam Atta 8

9 Polish Notation Copyright © 2010-2012 Islam Atta 9 Arithmetic example

10 Linear Tree: Depth-First Ordering A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV Copyright © 2010-2012 Islam Atta 10

11 Recursive Contiguity A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUV N ST NST C GHI OPQ CGHIOPQ Copyright © 2010-2012 Islam Atta 11

12 Non-shared memory architectures Explicit message passing  Few Large messages Shared memory architectures Minimal False sharing Spatial Locality ABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUV Copyright © 2010-2012 Islam Atta 12

13 Representation Copyright © 2010-2012 Islam Atta 13 Data Array Parents Reference Array Children Reference Array

14 First-Child/Sibling Copyright © 2010-2012 Islam Atta 14 Data Array Parents Reference Array First-Child Reference Array Siblings’ Reference Array

15 Scheduling Algorithm Designed for Cell BE and Blue Gene /L Message-passing (DMA, MPI, mailboxes) Challenges Unbalanced trees with varying computation complexity Limited local storage Larger data chunks, 128 byte aligned Algorithm properties Master-slave Dynamic scheduling of sub-workloads Work-stealing: coordinated by the master Double buffering Copyright © 2010-2012 Islam Atta 15

16 Organization & RepresentationEvaluationDiscussion & Conclusion Implementation and comparison Evaluation Copyright © 2010-2012 Islam Atta 16

17 Methodology Application: Sequence Alignment problem DNA, RNA, protein, NLP, Financial data Implementation: pthreads on x86 Intel machines UG: Quad-core Kodos: 2-socket quad-core Kang: 4-socket dual-core Data Cache Simulation In-memory Trees Copyright © 2010-2012 Islam Atta 17

18 Memory Access Time Copyright © 2010-2012 Islam Atta 18 Naïve sequence alignment consists of only read/write operations. Random: Sub-linear increase up to 4 threads. Saturates after 4 threads Linear: Sequential - 2.7X gain Hit memory-wall

19 Quad-core vs. 2-socket Quad-core Copyright © 2010-2012 Islam Atta 19 Kodos-Random: Sub-linear increase before 8 threads Saturates after 8 threads Kodos-Linear: Similar to UG-linear

20 Interconnect Effect Assume that UG has a perfect interconnect and compare against it. Copyright © 2010-2012 Islam Atta 20

21 Effective Bandwidth Kang KodosUG Copyright © 2010-2012 Islam Atta 21

22 Modeling Real Computation Copyright © 2010-2012 Islam Atta 22 Computation scaling modeled as SQRT(number of threads) 1.6 – 3.05 X for 1 – 4 threads

23 Other Experimental Results MetricRandomLinear Miss Rate (L2)14%1.6% Sequential/Parallel fractions Sequential is 10% with minor improvement for Linear Load balancingMaximum 4% deviation (no work-stealing required) Stalling on LocksNo difference Memory size ratio10.32 Copyright © 2010-2012 Islam Atta 23

24 Organization & RepresentationEvaluationDiscussion & Conclusion Practical considerations & Potential work Discussion & Conclusion Copyright © 2010-2012 Islam Atta 24

25 Discussion Limitations: Only shared memory architectures Max tree size: 4G Bytes, 47M nodes Compression First-child references can be reduced to 1-bit per node. Use Delta distance instead of full address. Copyright © 2010-2012 Islam Atta 25

26 Discussion Copyright © 2010-2012 Islam Atta 26

27 Next… Path #1: Implement and evaluate a commercial/scientific workload Developing a library/framework for parallel tree manipulation Path #2: Algorithm evaluation for non-shared memory architectures E.g. Blue Gene, Intel SCC Both Copyright © 2010-2012 Islam Atta 27

28 Conclusion Tree manipulation using typical data representation is not well suited for parallel processing. Propose and evaluate a technique for parallel tree manipulation Performance gain for sequential and parallel processing Saves memory and bandwidth Scalable For our experiments, on-chip communication with fewer cores is superior to off-chip communication. Copyright © 2010-2012 Islam Atta 28

29 Acknowledgment Special thanks goes to: Prof. Natalie Enright Jerger Prof. Greg Steffan Copyright © 2010-2012 Islam Atta 29

30 QUESTIONS Fact: 42,270 runs were executed in the experimentation using 91 TBs of data. Thank You. Please send me your comments, iatta@eecg.toronto.eduiatta@eecg.toronto.edu


Download ppt "PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental."

Similar presentations


Ads by Google