Divide & Conquer: Efficient Java for the multi-core world Velmurugan Periasamy Sunil Mundluri VeriSign, Inc.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

© 2013 IBM Corporation Implement high-level parallel API in JDK Richard Ning – Enterprise Developer 1 st June 2013.
CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.
INTEL CONFIDENTIAL Threading for Performance with Intel® Threading Building Blocks Session:
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
Concurrency The need for speed. Why concurrency? Moore’s law: 1. The number of components on a chip doubles about every 18 months 2. The speed of computation.
Starting Parallel Algorithm Design David Monismith Based on notes from Introduction to Parallel Programming 2 nd Edition by Grama, Gupta, Karypis, and.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Lecturer: Simon Winberg Lecture 18 Amdahl’s Law & YODA Blog & Design Review.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
V0.01 © 2009 Research In Motion Limited Introduction to Java Application Development for the BlackBerry Smartphone Trainer name Date.
CS533 Concepts of Operating Systems Class 6 The Duality of Threads and Events.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Parallel Programming Models and Paradigms
3.5 Interprocess Communication
V0.01 © 2009 Research In Motion Limited Introduction to Java Application Development for the BlackBerry Smartphone Trainer name Date.
Threads CSCI 444/544 Operating Systems Fall 2008.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
Basics of Operating Systems March 4, 2001 Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Nagios and Mod-Gearman In a Large-Scale Environment Jason Cook 8/28/2012.
Rechen- und Kommunikationszentrum (RZ) Parallelization at a Glance Christian Terboven / Aachen, Germany Stand: Version 2.3.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
INTEL CONFIDENTIAL Predicting Parallel Performance Introduction to Parallel Programming – Part 10.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.
Recognizing Potential Parallelism Introduction to Parallel Programming Part 1.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Concurrency Patterns Emery Berger and Mark Corner University.
Lecture 7: Design of Parallel Programs Part II Lecturer: Simon Winberg.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Parallel Computing CSCI 201L Jeffrey Miller, Ph.D. HTTP :// WWW - SCF. USC. EDU /~ CSCI 201 USC CSCI 201L.
Chapter 4 – Threads (Pgs 153 – 174). Threads  A "Basic Unit of CPU Utilization"  A technique that assists in performing parallel computation by setting.
SOFTWARE DESIGN. INTRODUCTION There are 3 distinct types of activities in design 1.External design 2.Architectural design 3.Detailed design Architectural.
Consider the program fragment below left. Assume that the program containing this fragment executes t1() and t2() on separate threads running on separate.
Parallelism Can we make it faster? 25-Apr-17.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.
Martin Kruliš by Martin Kruliš (v1.1)1.
Paper_topic: Parallel Matrix Multiplication using Vertical Data.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
Slide 6-1 Chapter 6 System Software Considerations Introduction to Information Systems Judith C. Simon.
Multithreading The objectives of this chapter are: To understand the purpose of multithreading To describe Java's multithreading mechanism.
Operating Systems Unit 2: – Process Context switch Interrupt Interprocess communication – Thread Thread models Operating Systems.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
Concurrency and Performance Based on slides by Henri Casanova.
1 Parallel Processing Fundamental Concepts. 2 Selection of an Application for Parallelization Can use parallel computation for 2 things: –Speed up an.
Lecturer 3: Processes multithreaded Operating System Concepts Process Concept Process Scheduling Operation on Processes Cooperating Processes Interprocess.
Mergesort example: Merge as we return from recursive calls Merge Divide 1 element 829.
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
OPERATING SYSTEMS CS 3502 Fall 2017
Spark Presentation.
Chapter 4 – Thread Concepts
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Many-core Software Development Platforms
Chapter 4: Threads.
Chapter 4: Threads.
Example of usage in Micron Italy (MIT)
F# for Parallel and Asynchronous Programming
Prof. Leonardo Mostarda University of Camerino
PERFORMANCE MEASURES. COMPUTATIONAL MODELS Equal Duration Model:  It is assumed that a given task can be divided into n equal subtasks, each of which.
Chapter 4: Threads & Concurrency
Chapter 4: Threads.
Parallel Programming with ForkJoinPool Tasks in Java
Parallel Programming with ForkJoinPool Tasks in Java
Presentation transcript:

Divide & Conquer: Efficient Java for the multi-core world Velmurugan Periasamy Sunil Mundluri VeriSign, Inc.

2 Verisign Public About us Java Performance Engineers Working at Verisign Source : File:Alonso_Renault_Pitstop_Chinese_GP_2008.jpg

3 Verisign Public Agenda Problem Statement Efficiently leverage hardware Fine grained Parallelism Fork-Join Framework Java 7 Implementation Examples / Demo

4 Verisign Public Where to look for performance improvements going forward?

5 Verisign Public Current Trend No more increase in clock speed Increasing number of cores Source:

6 Verisign Public Need to build applications that can efficiently utilize hardware resources (multiple cores) Hint: More parallelism

7 Verisign Public Background Improving application efficiency with single CPU Improve end user perceived performance (i.e. hide latency) by asynchronously performing some work (using threads)

8 Verisign Public Background Keeping up application efficiency with multiple CPUs Concurrent user requests (coarse grained parallelism) – e.g. web servers, application servers. Thread pools for scheduling tasks

9 Verisign Public Background How to continue the efficient utilization with many cores? Concurrent user requests may not be enough to keep CPUs busy Need more parallelism

10 Verisign Public Let’s clarify some terms Concurrency ≠ Parallelism Multithreading ≠ Parallelism

11 Verisign Public Few definitions Concurrency: A condition that exists when at least two threads are making progress Parallelism: A condition that arises when at least two threads are executing simultaneously Multithreading: Allows access to two or more threads. Execution occurs in more than one thread of control, using parallel or concurrent processing Source: Sun’s Multithreaded Programming Guide

12 Verisign Public Concurrency vs Parallelism

13 Verisign Public Need to build applications that can efficiently utilize hardware resources (multiple cores)

14 Verisign Public Possibilities for efficient utilization Server Virtualization

15 Verisign Public Possibilities for efficient utilization Other approaches towards concurrency – Message passing concurrency approach implemented by languages like Erlang, Scala, Haskell (Thousands of light weight processes) – Software Transactional Memory (e.g. Clojure)

16 Verisign Public Possibilities for efficient utilization fine grained parallelism

17 Verisign Public Fine Grained Parallelism Focus on tasks rather than threading Find appropriate tasks that can be efficiently parallelized The finer the granularity, the better the performance boost Proper functional decomposition is key

18 Verisign Public Fine Grained Parallelism at work

19 Verisign Public Where to look for Fine Grained Parallelism? CPU-intensive operations good candidates Sorting, searching, filtering, aggregation operations more suitable to parallelization Network servers which spend a great deal of time waiting on network I/O may not offer more value for parallelization Where NOT?

20 Verisign Public Amdahl’s Law Source: P = proportion of a program that can be made parallel (1 − P) = proportion that cannot be parallelized (remains serial) Maximum speedup that can be achieved with N processors is given by below formula For example Let P be 90%, (1 − P) is 10% In this case problem can be sped up by a maximum of a factor of 10 No matter how large N is (i.e. #processors/cores)

21 Verisign Public Fine Grained Parallelism Challenges How to decompose the application into units of work that can be executed in parallel? How to properly scale the parallelism? As per Amdahl’s law, look to maximize the parallelizable portion of the application Need new techniques, frameworks & libraries

22 Verisign Public Divide-and-Conquer Much before computers Basic example : How mail gets delivered

23 Verisign Public Divide-and-Conquer algorithm Recursively break down the problem into sub problems Combine the solutions to sub-problems to arrive at final result

24 Verisign Public How does this help us in achieving fine grained parallelism?

25 Verisign Public Fork-Join framework Source:

26 Verisign Public Fork Join Framework Parallel version of Divide and Conquer 1.Recursively break down the problem into sub problems 2.Solve the sub problems in parallel 3.Combine the solutions to sub-problems to arrive at final result General Form

27 Verisign Public Fork-Join framework Source:

28 Verisign Public Java library for Fork–Join framework Started with JSR 166 efforts JDK 7 includes it in java.util.concurrent Can be used with JDK 6.  Download jsr166y package maintained by Doug Lea 

29 Verisign Public Java library for Fork–Join framework ForkJoinPool  Executor service for running fork-join tasks Several styles of ForkJoinTasks  RecursiveAction  RecursiveTask

30 Verisign Public Fork-Join framework Executing a task means…  Create two or more new subtasks (fork)  Suspend the current task until the new tasks complete (join)  Invoke-in-parallel operation is key A portable way to express parallelizable algorithm

31 Verisign Public Selecting a max number Sequential algorithm

32 Verisign Public A Fork-Join Task Solve sequentially if problem is small Otherwise Create sub tasks & Fork Join the results Selecting max with Fork-Join framework

33 Verisign Public Another example – Merge Sort Recursive algorithm

34 Verisign Public Merge Sort with Fork-Join framework This class replaces the recursive method Fork & Join

35 Verisign Public Fork Join Task Demo

36 Verisign Public Fork-Join framework implementation Basic threads ( Thread.start(), Thread.join() )  May require more threads than VM can support Conventional thread pools  Could run into thread starvation deadlock as fork join tasks spend much of the time waiting for other tasks ForkJoinPool is an optimized thread pool executor (for fork-join tasks)

37 Verisign Public Fork-Join framework implementation Limited number of worker threads A private double-ended work queue (deque) for each worker When forking, push new task at the head of deque When waiting or idle, pop a task off the head of deque and executes it (instead of sleeping) If own deque is empty, steal an element off the tail of the deque of another randomly chosen worker Source: “A Java Fork/Join Framework” Paper by Doug Lea

38 Verisign Public ForkJoinTask as is could improve performance… However some boilerplate work is still needed. Is there a better way? i.e. declarative specification of parallelization There is a better way

39 Verisign Public Options for Transparent Parallelism ParallelArray  ForkJoinPool with an array  Versions for primitives and objects  Provide parallel aggregate operations  Can be used on JDK 6+ with extra166y package from

40 Verisign Public Example: ParallelArray Operations SQL like.. Batches operations in a single parallel step Select max is a single step A filter operation A map operation ParallelArray created with ForkJoinPool

41 Verisign Public Options for Transparent Parallelism Features in JDK8  Parallel utility methods in java.util.Arrays class  Many types of parallelSort method available  Parallel set/prefix operations  Bulk data operations for Collections  Stream interface to support “Pipes and Filter” pattern (with parallel option)  a.k.a “filter/map/reduce for Java”  Lambda expressions (JSR-335)  Closure – “Anonymous function expression”  Concise Code  Functional aspect

42 Verisign Public Lambda’s make life easier.. With Lambda’s

43 Verisign Public Fork-Join and Map-Reduce Source: Fork-JoinMap-Reduce Processing onCores on a single compute node Independent compute nodes Division of tasksDynamicDecided at start-up Inter-task communication AllowedNo Task redistribution Work-stealingSpeculative execution When to useData size that can be handled in a node Big Data Speed-upGood speed-up according to #cores, works with decent size data Scales incredibly well for huge data sets

44 Verisign Public Demo Transparent Parallelism

45 Verisign Public Re-cap To leverage multi-cores, find finer-grained sources of parallelism Aggregate data operations are best suited Fork-Join (in JDK 7.0) offers a nice way to "portably express" a parallelizable algorithm. Many exciting features coming in JDK8 to enable Parallelism (transparently)

46 Verisign Public References     Brian Goetz’s JavaOne presentation             

Thank You © 2013 VeriSign, Inc. All rights reserved. VERISIGN and other trademarks, service marks, and designs are registered or unregistered trademarks of VeriSign, Inc. and its subsidiaries in the United States and in foreign countries. All other trademarks are property of their respective owners. Copyright © 2013 VeriSign, Inc. All rights reserved. The VERISIGN word mark, the Verisign logo, and other Verisign trademarks, service marks, and designs that may appear herein are registered or unregistered trademarks or service marks of VeriSign, Inc., and its subsidiaries in the United States and foreign countries. Verisign has made efforts to ensure the accuracy and completeness of the information in this document. However, Verisign makes no warranties of any kind (whether express, implied or statutory) with respect to the information contained herein. Verisign assumes no liability to any party for any loss or damage (whether direct or indirect) caused by any errors, omissions, or statements of any kind contained in this document. Further, Verisign assumes no liability arising from the application or use of the products, services, or materials described or referenced herein and specifically disclaims any representation that any such products, services, or materials do not infringe upon any existing or future intellectual property rights.