PARALLEL TREE MANIPULATION Islam Atta. Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing. 20110016153 TX, US, 2010. Experimental.

Slides:



Advertisements
Similar presentations
1 Optimizing compilers Managing Cache Bercovici Sivan.
Advertisements

Distributed Systems CS
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Performance of Cache Memory
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science
Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.
1 HYRISE – A Main Memory Hybrid Storage Engine By: Martin Grund, Jens Krüger, Hasso Plattner, Alexander Zeier, Philippe Cudre-Mauroux, Samuel Madden, VLDB.
NUMA Tuning for Java Server Applications Mustafa M. Tikir.
1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Data Parallel Algorithms Presented By: M.Mohsin Butt
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Chapter Hardwired vs Microprogrammed Control Multithreading
Chapter 17 Parallel Processing.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Assets and Dynamics Computation for Virtual Worlds.
CS 240A: Complexity Measures for Parallel Computation.
Computing Systems Memory Hierarchy.
Parallel Communications and NUMA Control on the Teragrid’s New Sun Constellation System Lars Koesterke with Kent Milfeld and Karl W. Schulz AUS Presentation.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Modularizing B+-trees: Three-Level B+-trees Work Fine Shigero Sasaki* and Takuya Araki NEC Corporation * currently with 1st Nexpire Inc.
ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.
CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.
1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Copyright © 2013, SAS Institute Inc. All rights reserved. MEMORY CACHE – PERFORMANCE CONSIDERATIONS CLAIRE CATES DISTINGUISHED DEVELOPER
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
An Implementation and Performance Evaluation of Language with Fine-Grain Thread Creation on Shared Memory Parallel Computer Yoshihiro Oyama, Kenjiro Taura,
LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:
Distributed computing using Projective Geometry: Decoding of Error correcting codes Nachiket Gajare, Hrishikesh Sharma and Prof. Sachin Patkar IIT Bombay.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Distributed WHT Algorithms Kang Chen Jeremy Johnson Computer Science Drexel University Franz Franchetti Electrical and Computer Engineering.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Interactive Rendering With Coherent Ray Tracing Eurogaphics 2001 Wald, Slusallek, Benthin, Wagner Comp 238, UNC-CH, September 10, 2001 Joshua Stough.
Embedded System Lab 김해천 Thread and Memory Placement on NUMA Systems: Asymmetry Matters.
Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.
The Memory Hierarchy Lecture # 30 15/05/2009Lecture 30_CA&O_Engr Umbreen Sabir.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
ASPLOS’02 Presented by Kim, Sun-Hee.  Technology trends ◦ The rate of frequency scaling is slowing down  Performance must come from exploiting concurrency.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
By Islam Atta Supervised by Dr. Ihab Talkhan
Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.
Complexity Measures for Parallel Computation. Problem parameters: nindex of problem size pnumber of processors Algorithm parameters: t p running time.
Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.
Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi
Hang Zhang1, Xuhao Chen1, Nong Xiao1,2, Fang Liu1
Real-Time Ray Tracing Stefan Popov.
Department of Electrical & Computer Engineering
Flow Path Model of Superscalars
Address-Value Delta (AVD) Prediction
Yiannis Nikolakopoulos
Virtual Memory Hardware
Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics
Hybrid Programming with OpenMP and MPI
EE 4xx: Computer Architecture and Performance Programming
Many-Core Graph Workload Analysis
Atlas: An Infrastructure for Global Computing
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

PARALLEL TREE MANIPULATION Islam Atta

Sources Islam Atta, Hisham El-Shishiny. System and method for parallel processing TX, US, Experimental evaluation was done as part of course work at UofT (ECE1749H, ECE1755H). Copyright © Islam Atta 2

This talk is about … Manipulation algorithm for Tree data structures Parallel Programming domain Novel Tree representation using Linear Arrays Copyright © Islam Atta 3

Trees…? Widely used Hierarchical data (financial data, NLP, machine vision, GIS, DNA, protein sequences…) Indexing/hashing (search engines…) Tree Manipulation context Full traversal of a tree (or sub-tree) for read or write accesses. E.g. CBIR, DNA sequence alignment Copyright © Islam Atta 4

Problem Categorized as Non-Uniform Random Access structures Bad spatial locality Incurring high miss rates Worse for multiprocessing (Berkley, 2006) Requires High on-/off-chip bandwidth Copyright © Islam Atta 5

Tree Representation A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Memory Layout Copyright © Islam Atta 6 No GAPS

Multiprocessing Platforms Non-shared memory architectures (Cell BE, Blue Gene, Intel SCC) Explicit message passing  Many small messages Shared memory architectures (Intel Quad-core) Coherent cache banks  cache blocks grabbed when referenced ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUVABCDEFGHIJKLMNOPQRSTUV Optimal Solution Reallocate tree elements in memory to form contiguous memory regions Copyright © Islam Atta 7

Organization & RepresentationEvaluationDiscussion & Conclusion GOAL: Allocate tree elements to promote Spatial Locality Organization & Representation Copyright © Islam Atta 8

Polish Notation Copyright © Islam Atta 9 Arithmetic example

Linear Tree: Depth-First Ordering A BCD EFGHIJK LMNOPQR STUV A BCD EFGHIJK LMNOPQR STUV Copyright © Islam Atta 10

Recursive Contiguity A BCD EFGHIJK LMNOPQR STUV ABCDEFGHIJKLMNOPQRSTUV N ST NST C GHI OPQ CGHIOPQ Copyright © Islam Atta 11

Non-shared memory architectures Explicit message passing  Few Large messages Shared memory architectures Minimal False sharing Spatial Locality ABCDEFGHIJKLMNOPQRSTUV ABCDEFGHIJKLMNOPQRSTUV Copyright © Islam Atta 12

Representation Copyright © Islam Atta 13 Data Array Parents Reference Array Children Reference Array

First-Child/Sibling Copyright © Islam Atta 14 Data Array Parents Reference Array First-Child Reference Array Siblings’ Reference Array

Scheduling Algorithm Designed for Cell BE and Blue Gene /L Message-passing (DMA, MPI, mailboxes) Challenges Unbalanced trees with varying computation complexity Limited local storage Larger data chunks, 128 byte aligned Algorithm properties Master-slave Dynamic scheduling of sub-workloads Work-stealing: coordinated by the master Double buffering Copyright © Islam Atta 15

Organization & RepresentationEvaluationDiscussion & Conclusion Implementation and comparison Evaluation Copyright © Islam Atta 16

Methodology Application: Sequence Alignment problem DNA, RNA, protein, NLP, Financial data Implementation: pthreads on x86 Intel machines UG: Quad-core Kodos: 2-socket quad-core Kang: 4-socket dual-core Data Cache Simulation In-memory Trees Copyright © Islam Atta 17

Memory Access Time Copyright © Islam Atta 18 Naïve sequence alignment consists of only read/write operations. Random: Sub-linear increase up to 4 threads. Saturates after 4 threads Linear: Sequential - 2.7X gain Hit memory-wall

Quad-core vs. 2-socket Quad-core Copyright © Islam Atta 19 Kodos-Random: Sub-linear increase before 8 threads Saturates after 8 threads Kodos-Linear: Similar to UG-linear

Interconnect Effect Assume that UG has a perfect interconnect and compare against it. Copyright © Islam Atta 20

Effective Bandwidth Kang KodosUG Copyright © Islam Atta 21

Modeling Real Computation Copyright © Islam Atta 22 Computation scaling modeled as SQRT(number of threads) 1.6 – 3.05 X for 1 – 4 threads

Other Experimental Results MetricRandomLinear Miss Rate (L2)14%1.6% Sequential/Parallel fractions Sequential is 10% with minor improvement for Linear Load balancingMaximum 4% deviation (no work-stealing required) Stalling on LocksNo difference Memory size ratio10.32 Copyright © Islam Atta 23

Organization & RepresentationEvaluationDiscussion & Conclusion Practical considerations & Potential work Discussion & Conclusion Copyright © Islam Atta 24

Discussion Limitations: Only shared memory architectures Max tree size: 4G Bytes, 47M nodes Compression First-child references can be reduced to 1-bit per node. Use Delta distance instead of full address. Copyright © Islam Atta 25

Discussion Copyright © Islam Atta 26

Next… Path #1: Implement and evaluate a commercial/scientific workload Developing a library/framework for parallel tree manipulation Path #2: Algorithm evaluation for non-shared memory architectures E.g. Blue Gene, Intel SCC Both Copyright © Islam Atta 27

Conclusion Tree manipulation using typical data representation is not well suited for parallel processing. Propose and evaluate a technique for parallel tree manipulation Performance gain for sequential and parallel processing Saves memory and bandwidth Scalable For our experiments, on-chip communication with fewer cores is superior to off-chip communication. Copyright © Islam Atta 28

Acknowledgment Special thanks goes to: Prof. Natalie Enright Jerger Prof. Greg Steffan Copyright © Islam Atta 29

QUESTIONS Fact: 42,270 runs were executed in the experimentation using 91 TBs of data. Thank You. Please send me your comments,