Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Slides:

Advertisements

Similar presentations

Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Advertisements

DIMACS Workshop on Parallelism: A 2020 Vision Alejandro Salinger University of Waterloo March 16, 2011.

Hadi Goudarzi and Massoud Pedram

Chapter 6: Memory Management

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

The Study of Cache Oblivious Algorithms Prepared by Jia Guo.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh April 30, 2010 Keynote talk at 10 th SIAM International.

Hierarchy-conscious Data Structures for String Analysis Carlo Fantozzi PhD Student (XVI ciclo) Bioinformatics Course - June 25, 2002.

Dr. Alexandra Fedorova August 2007 Introduction to Systems Research at SFU.

The Cache Location Problem IEEE/ACM Transactions on Networking, Vol. 8, No. 5, October 2000 P. Krishnan, Danny Raz, Member, IEEE, and Yuval Shavitt, Member,

I/O-Algorithms Lars Arge Aarhus University February 7, 2005.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Structuring Parallel Algorithms.

I/O-Algorithms Lars Arge Aarhus University February 6, 2007.

Theory: Asleep at the Switch to Many-Core Phillip B. Gibbons Intel Research Pittsburgh Workshop on Theory and Many-Core May 29, 2009 Slides are © Phillip.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

I/O-Algorithms Lars Arge Aarhus University February 14, 2008.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.

External Memory Algorithms Kamesh Munagala. External Memory Model Aggrawal and Vitter, 1988.

Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Design and analysis of algorithms for multicore architectures Alejandro Salinger April 2 nd, 2009 Joint work with Alex López-Ortiz and Reza Dorrigiv.

Data Warehousing 1 Lecture-24 Need for Speed: Parallelism Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.

Bounds on the Geometric Mean of Arc Lengths for Bounded- Degree Planar Graphs M. K. Hasan Sung-eui Yoon Kyung-Yong Chwa KAIST, Republic of Korea.

Web Caching and Content Distribution: A View From the Interior Syam Gadde Jeff Chase Duke University Michael Rabinovich AT&T Labs - Research.

The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

LATA: A Latency and Throughput- Aware Packet Processing System Author: Jilong Kuang and Laxmi Bhuyan Publisher: DAC 2010 Presenter: Chun-Sheng Hsueh Date:

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

MotivationFundamental ProblemsProblems on Graphs Parallel processors are becoming common place. Each core of a multi-core processor consists of a CPU and.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

1 Computation Spreading: Employing Hardware Migration to Specialize CMP Cores On-the-fly Koushik Chakraborty Philip Wells Gurindar Sohi

Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.

Trumping the Multicore Memory Hierarchy with Hi-Spade Phillip B. Gibbons Intel Labs Pittsburgh September 22, 2011.

1 Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1)

15-853: Algorithms in the Real World Locality II: Cache-oblivious algorithms – Matrix multiplication – Distribution sort – Static searching.

1 Data Structures CSCI 132, Spring 2014 Lecture 1 Big Ideas in Data Structures Course website:

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Martin Kruliš by Martin Kruliš (v1.1)1.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 15: Basic Parallel Programming Concepts.

Hierarchies, Clouds, and Specialization Phillip B. Gibbons Intel Labs Pittsburgh June 28, 2012 NSF Workshop on Research Directions in the Principles of.

Funnel Sort*: Cache Efficiency and Parallelism

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Programming Massively Parallel.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

COMP 5704 Project Presentation Parallel Buffer Trees and Searching Cory Fraser School of Computer Science Carleton University, Ottawa, Canada

COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.

1 Cache-Oblivious Query Processing Bingsheng He, Qiong Luo {saven, Department of Computer Science & Engineering Hong Kong University of.

Lecture 1: Introduction CprE 585 Advanced Computer Architecture, Fall 2004 Zhao Zhang.

Memory Hierarchies [FLPR12] Matteo Frigo, Charles E. Leiserson, Harald Prokop, Sridhar Ramachandran. Cache- Oblivious Algorithms. ACM Transactions on Algorithms,

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

GPU ProgrammingOther Relevant ObservationsExperiments GPU kernels run on C blocks (CTAs) of W warps. Each warp is a group of w=32 threads, which are executed.

Advanced Algorithms Analysis and Design

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.

Task Scheduling for Multicore CPUs and NUMA Systems

EE 193: Parallel Computing

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Theory: Asleep at the Switch to Many-Core

Objective of This Course

A Comparison of Cache-conscious and Cache-oblivious Codes

Lecture 2- Query Processing (continued)

Hybrid Programming with OpenMP and MPI

COMP60621 Fundamentals of Parallel and Distributed Systems

Lecture: Cache Hierarchies

Low Depth Cache-Oblivious Algorithms

Memory System Performance Chapter 3

COMP60611 Fundamentals of Parallel and Distributed Systems

Presentation transcript:

Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012

2 © Phillip B. Gibbons Multi-core Computing Lectures: Progress-to-date on Key Open Questions How to formally model multi-core hierarchies? What is the Algorithm Designer’s model? What runtime task scheduler should be used? What are the new algorithmic techniques? How do the algorithms perform in practice?

3 © Phillip B. Gibbons Lecture 1 Summary Multi-cores: today, future trends, challenges Computations & Schedulers –Modeling computations in work-depth framework –Schedulers: Work Stealing & PDF Cache miss analysis on 2-level parallel hierarchy –Private caches OR Shared cache Low-depth, cache-oblivious parallel algorithms –Sorting & Graph algorithms

4 © Phillip B. Gibbons Lecture 2 Outline Modeling the Multicore Hierarchy –PMH model Algorithm Designer’s model exposing Hierarchy –Multi-BSP model Quest for a Simplified Hierarchy Abstraction Algorithm Designer’s model abstracting Hierarchy –Parallel Cache-Oblivious (PCO) model Space-Bounded Schedulers –Revisit PCO model

5 © Phillip B. Gibbons up to 1 TB Main Memory 4…4… 24MB Shared L3 Cache 2 HW threads 32KB 256KB 2 HW threads 32KB 256KB 8…8… socket 32-core Xeon 7500 Multi-core 24MB Shared L3 Cache 2 HW threads 32KB 256KB 2 HW threads 32KB 256KB 8…8… socket

6 © Phillip B. Gibbons up to 0.5 TB Main Memory 4…4… 12MB Shared L3 Cache P 64KB 512KB P 64KB 512KB 12 … socket 48-core AMD Opteron MB Shared L3 Cache P 64KB 512KB P 64KB 512KB 12 … socket

7 © Phillip B. Gibbons How to Model the Hierarchy (?) … … … … … … … … … … … … … … … … … … … … … … … “Tree of Caches” abstraction captures existing multi-core hierarchies Parallel Memory Hierarchy (PMH) model [Alpern, Carter, Ferrante ‘93]

8 © Phillip B. Gibbons (Symmetric) PMH Model capacity M i block size B i miss cost C i fanout f i

9 © Phillip B. Gibbons PMH captures PEM model [Arge, Goodrich, Nelson, Sitchinava ‘08] p-processor machine with private caches Shared Cache model discussed in Lecture 1 Multicore Cache model [Blelloch et al. ‘08] L2 CacheShared L2 Cache CPU L1 CPU L1 CPU L1 Memory h=3

10 © Phillip B. Gibbons Lecture 2 Outline Modeling the Multicore Hierarchy –PMH model Algorithm Designer’s model exposing Hierarchy –Multi-BSP model Quest for a Simplified Hierarchy Abstraction Algorithm Designer’s model abstracting Hierarchy –Parallel Cache-Oblivious (PCO) model Space-Bounded Schedulers –Revisit PCO model

11 © Phillip B. Gibbons How to Design Algorithms (?) Design to Tree-of-Caches abstraction: Multi-BSP Model [Valiant ’08] –4 parameters/level: cache size, fanout, latency/sync cost, transfer bandwidth –Bulk-Synchronous … … … … … … … … … … …

12 © Phillip B. Gibbons Bridging Models Hardware Software Multi-BSP (p 1, L 1, g 1, m 1, ………….) Key: “ “ = “can efficiently simulate on” Slides from Les Valiant

13 © Phillip B. Gibbons.. p j components.. Multi-BSP: Level j component Level j -1 component Level j memory mjmj g j data rate L j - synch. cost g j -1 data rate

14 © Phillip B. Gibbons.. p 1 processors.. Level 0 = processor Level 1 memorym1m1 g 1 data rate L 1 = 0 - synch. cost g 0 =1 data rate Multi-BSP: Level 1 component

15 © Phillip B. Gibbons Like BSP except, 1. Not 1 level, but d level tree 2. Has memory (cache) size m as further parameter at each level. i.e. Machine H has 4d+1 parameters: e.g. d = 3, and (p 1, g 1, L 1, m 1 ) (p 2, g 2, L 2, m 2 ) (p 3, g 3, L 3, m 3 ) Multi-BSP

16 © Phillip B. Gibbons Optimal Multi-BSP Algorithms A Multi-BSP algorithm A* is optimal with respect to algorithm A if (i) Comp(A*) = Comp(A) + low order terms, (ii) Comm(A*) = O(Comm(A)) (iii) Synch(A*) = O(Synch(A)) where Comm(A), Synch(A) are optimal among Multi-BSP implementations, and Comp is total computational cost, and O() constant is independent of the model parameters. Presents optimal algorithms for Matrix Multiply, FFT, Sorting, etc. (simple variants of known algorithms and lower bounds)

17 © Phillip B. Gibbons Lecture 2 Outline Modeling the Multicore Hierarchy –PMH model Algorithm Designer’s model exposing Hierarchy –Multi-BSP model Quest for a Simplified Hierarchy Abstraction Algorithm Designer’s model abstracting Hierarchy –Parallel Cache-Oblivious (PCO) model Space-Bounded Schedulers –Revisit PCO model

18 © Phillip B. Gibbons How to Design Algorithms (?) Design to Tree-of-Caches abstraction: Multi-BSP Model –4 parameters/level: cache size, fanout, latency/sync cost, transfer bandwidth –Bulk-Synchronous Our Goal: Be Hierarchy-savvy ~ Simplicity of Cache-Oblivious Model –Handles dynamic, irregular parallelism –Co-design with smart thread schedulers … … … … … … … … … … …

19 © Phillip B. Gibbons Abstract Hierarchy: Simplified View What yields good hierarchy performance? Spatial locality: use what’s brought in –Popular sizes: Cache lines 64B; Pages 4KB Temporal locality: reuse it Constructive sharing: don’t step on others’ toes How might one simplify the view? Approach 1: Design to a 2 or 3 level hierarchy (?) Approach 2: Design to a sequential hierarchy (?) Approach 3: Do both (??)

20 © Phillip B. Gibbons Sequential Hierarchies: Simplified View External Memory Model –See [Vitter ‘01] Simple model Minimize I/Os Only 2 levels Only 1 “cache” Main Memory (size M) External Memory Block size B External Memory Model Can be good choice if bottleneck is last level

21 © Phillip B. Gibbons Sequential Hierarchies: Simplified View Cache-Oblivious Model [Frigo et al. ’99] Main Memory (size M) External Memory Block size B Ideal Cache Model Twist on EM Model: M & B unknown to Algorithm simple model Key Goal Guaranteed good cache performance at all levels of hierarchy Single CPU only (All caches shared) Key Algorithm Goal: Good performance for any M & B Encourages Hierarchical Locality

22 © Phillip B. Gibbons Example Paradigms Achieving Key Goal Scan: e.g., computing the sum of N items N/B misses, for any B (optimal) B 11 B 21 Divide-and-Conquer: e.g., matrix multiply C=A*B A 11 *B 11 + A 12 *B 21 A 11 A 12 O(N /B + N /(B*√M)) misses (optimal) 23 A 11 *B 12 + A 12 *B 22 = * A 21 *B 11 + A 22 *B 21 A 21 *B 12 + A 22 *B 22 A 21 A 22 B 12 B 22 Divide: Recursively compute A 11 *B 11,…, A 22 *B 22 Conquer: Compute 4 quadrant sums Uses Recursive Z-order Layout

23 © Phillip B. Gibbons Multicore Hierarchies: Key Challenge Theory underlying Ideal Cache Model falls apart once introduce parallelism: Good performance for any M & B on 2 levels DOES NOT imply good performance at all levels of hierarchy Key reason: Caches not fully shared L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 What’s good for CPU1 is often bad for CPU2 & CPU3 e.g., all want to write B at ≈ the same time B

24 © Phillip B. Gibbons Multicore Hierarchies Key New Dimension: Scheduling Key new dimension: Scheduling of parallel threads Key reason: Caches not fully shared Has LARGE impact on cache performance L2 CacheShared L2 Cache CPU2 L1 CPU1 L1 CPU3 L1 Can mitigate (but not solve) if can schedule the writes to be far apart in time Recall our problem scenario: all CPUs want to write B at ≈ the same time B

25 © Phillip B. Gibbons Constructive Sharing Destructive compete for the limited on-chip cache Constructive share a largely overlapping working set P L1 P P Shared L2 Cache Interconnect P L1 P P Shared L2 Cache Interconnect “Flood” off-chip PINs

26 © Phillip B. Gibbons Recall: Low-Span + Cache-Oblivious Guarantees on scheduler’s cache performance depend on the computation’s depth D –E.g., Work-stealing on single level of private caches: Thrm: For any computation w/ fork-join parallelism, O(M P D / B) more misses on P cores than on 1 core Approach: Design parallel algorithms with –Low span, and –Good performance on Cache-Oblivious Model Thrm: For any computation w/ fork-join parallelism for each level i, only O(M P D / B ) more misses than on 1 core, for hierarchy of private caches i i But: No such guarantees for general tree-of-caches

27 © Phillip B. Gibbons Lecture 2 Outline Modeling the Multicore Hierarchy –PMH model Algorithm Designer’s model exposing Hierarchy –Multi-BSP model Quest for a Simplified Hierarchy Abstraction Algorithm Designer’s model abstracting Hierarchy –Parallel Cache-Oblivious (PCO) model Space-Bounded Schedulers –Revisit PCO model

28 © Phillip B. Gibbons Handling the Tree-of-Caches To obtain guarantees for general tree-of-caches: We define a Parallel Cache-Oblivious Model and A corresponding Space-Bounded Scheduler

29 © Phillip B. Gibbons A Problem with Using CO Model Misses in CO model M/B misses Any greedy parallel schedule (M p = M): All processors suffer all misses in parallel P M/B misses Memory CPU … Shared Cache MpMp Memory CPU M P subtasks: each reading same M/B blocks in same order …… a1a1 a2a2 a1a1 a2a2 aMaM aMaM … P of these Carry Forward rule is too optimistic

30 © Phillip B. Gibbons Parallel Cache-Oblivious (PCO) Model Memory M,B P Case 1: task fits in cache All three subtasks start with same state At join, merge state and carry forward Carry forward cache state according to some sequential order Differs from cache-oblivious model in how cache state is carried forward [Blelloch, Fineman, G, Simhadri ‘11]

31 © Phillip B. Gibbons Parallel Cache-Oblivious Model (2) Differs from cache-oblivious model in how cache state is carried forward Memory M,B P Case 2: Task does not fit in cache All three subtasks start with empty state Cache set to empty at join

32 © Phillip B. Gibbons PCO Cache Complexity Q* Bounds assume M = Ω(B ) All algorithms are work optimal Q* bounds match both CO bounds and best sequential algorithm bounds 2 See [Blelloch, Fineman, G, Simhadri ‘11] for details

33 © Phillip B. Gibbons Lecture 2 Outline Modeling the Multicore Hierarchy –PMH model Algorithm Designer’s model exposing Hierarchy –Multi-BSP model Quest for a Simplified Hierarchy Abstraction Algorithm Designer’s model abstracting Hierarchy –Parallel Cache-Oblivious (PCO) model Space-Bounded Schedulers –Revisit PCO model

34 © Phillip B. Gibbons Space-Bounded Scheduler Key Ideas: Schedules a dynamically unfolding parallel computation on a tree-of-caches hierarchy Computation exposes lots of parallelism Assumes space use (working set sizes) of tasks are known (can be suitably estimated) Assigns a task to a cache C that fits the task’s working set. Reserves the space in C. Recurses on the subtasks, using the CPUs and caches that share C (below C in the diagram) … … … … … C [Chowdhury, Silvestri, Blakeley, Ramachandran ‘10]

35 © Phillip B. Gibbons Space-Bounded Scheduler Advantages over WS scheduler Avoids cache overloading for shared caches Exploits cache affinity for private caches

36 © Phillip B. Gibbons Problem with WS Scheduler: Cache overloading shared cache: 10MB CPU CPU 1 CPU 2 time 10MB 8MB Parallel subtasks sharing read data 8MB Overloaded cache introduces more cache (capacity) misses Hierarchy (focus on 1 cache) Computation

37 © Phillip B. Gibbons Space-Bounded Scheduler avoids cache overloading shared cache: 10MB CPU CPU 1 CPU 2 time 10MB 8MB Parallel subtasks sharing read data 8MB Does not overload the cache, so fewer cache misses Popular Computation Hierarchy (focus on 1 cache) Computation

38 © Phillip B. Gibbons Problem with WS Scheduler (2): Ignoring cache affinity Parallel tasks reading same data Shared memory 4MB 5MB 1MB each 5MB CPU Schedules any available task when a processor is idle All experience all cache misses and run slowly time CPU 1 CPU 2 CPU 3 CPU 4 Computation Hierarchy

39 © Phillip B. Gibbons Parallel tasks reading same data Shared memory 4MB 5MB 1MB each 5MB CPU Schedules any available task when a processor is idle All experience all cache misses and run slowly time CPU 1 CPU 2 CPU 3 CPU 4 Problem with WS Scheduler: Ignoring cache affinity Hierarchy Computation

40 © Phillip B. Gibbons Parallel tasks reading same data Shared memory 4MB 5MB 1MB 5MB CPU time CPU 1 CPU 2 CPU 3 CPU 4 Pin task to cache to exploit affinity among subtasks 5MB Space-Bounded Scheduler exploits cache affinity Hierarchy Computation Popular

41 © Phillip B. Gibbons Analysis Approach Goal: Algorithm analysis should remain lightweight and agnostic of the machine specifics Analyze for a single cache level using PCO model Infinite Size Main Memory size-M cache Unroll algorithm to tasks that fit in cache Analyze each such task separately, starting from an empty cache ≤M Cache complexity Q*(M) = Total # of misses, summed across all tasks

42 © Phillip B. Gibbons Analytical Bounds Cache costs: optimal ∑ levels Q * (M i ) x C i where C i is the miss cost for level i caches Running time: for “sufficiently balanced” computations: optimal O(∑ levels Q * (M i ) x C i / P) time on P cores Our theorem on running time also allows arbitrary imbalance, with the performance depending on an imbalance penalty Guarantees provided by our Space-Bounded Scheduler: [Blelloch, Fineman, G, Simhadri ‘11]

43 © Phillip B. Gibbons Motivation for Imbalance Penalty Tree-of-Caches Each subtree has a given amount of compute & cache resources To avoid cache misses from migrating tasks, would like to assign/pin task to a subtree But any given program task may not match both –E.g., May need large cache but few processors We extend PCO with a cost metric that charges for such space-parallelism imbalance –Attribute of algorithm, not hierarchy –Need minor additional assumption on hierarchy

44 © Phillip B. Gibbons Multi-core Computing Lectures: Progress-to-date on Key Open Questions How to formally model multi-core hierarchies? What is the Algorithm Designer’s model? What runtime task scheduler should be used? What are the new algorithmic techniques? How do the algorithms perform in practice? NEXT UP Lecture #3: Extensions

45 © Phillip B. Gibbons References [Alpern, Carter, Ferrante ‘93] B. Alpern, L. Carter, and J. Ferrante. Modeling parallel computers as memory hierarchies. Programming Models for Massively Parallel Computers, 1993 [Arge, Goodrich, Nelson, Sitchinava ‘08] L. Arge, M. T. Goodrich, M. Nelson, and N. Sitchinava. Fundamental parallel algorithms for private-cache chip multiprocessors. ACM SPAA, 2008 [Blelloch et al. ‘08] G. E. Blelloch, R. A. Chowdhury, P. B. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. ACM-SIAM SODA, 2008 [Blelloch, Fineman, G, Simhadri ‘11] G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. V. Simhadri. Scheduling Irregular Parallel Computations on Hierarchical Caches. ACM SPAA, 2011 [Chowdhury, Silvestri, Blakeley, Ramachandran ‘10] R. A. Chowdhury, F. Silvestri, B. Blakeley, and V. Ramachandran. Oblivious algorithms for multicores and network of processors. IPDPS, 2010 [Frigo et al. ’99] M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran. Cache-Oblivious Algorithms. IEEE FOCS 1999 [Valiant ‘08] L. G. Valiant. A bridging model for multi-core computing. ESA, 2008 [Vitter ‘01] J. S. Vitter. External memory algorithms and data structures. ACM Computing Surveys 33:2, (2001)