On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science.

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

來源: 2012 Seventh International Conference on P2P, Parallel, Grid, Cloud and Internet Computing 作者: Jagmohan Chauhan, Shaiful Alam Chowdhury and Dwight.
Yaron Doweck Yael Einziger Supervisor: Mike Sumszyk Spring 2011 Semester Project.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Spark: Cluster Computing with Working Sets
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
1 Virtual Machine Resource Monitoring and Networking of Virtual Machines Ananth I. Sundararaj Department of Computer Science Northwestern University July.
Physical Database Monitoring and Tuning the Operational System.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Using JetBench to Evaluate the Efficiency of Multiprocessor Support for Parallel Processing HaiTao Mei and Andy Wellings Department of Computer Science.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Ch 4. The Evolution of Analytic Scalability
More on Locks: Case Studies
Calculating Discrete Logarithms John Hawley Nicolette Nicolosi Ryan Rivard.
© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.
SALSA: Language and Architecture for Widely Distributed Actor Systems. Carlos Varela, Abe Stephens, Department of.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
Storage in Big Data Systems
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Chapter 2 (PART 1) Light-Weight Process (Threads) Department of Computer Science Southern Illinois University Edwardsville Summer, 2004 Dr. Hiroshi Fujinoki.
Roopa.T PESIT, Bangalore. Source and Credits Dalvik VM, Dan Bornstein Google IO 2008 The Dalvik virtual machine Architecture by David Ehringer.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
MARISSA: MApReduce Implementation for Streaming Science Applications 作者 : Fadika, Z. ; Hartog, J. ; Govindaraju, M. ; Ramakrishnan, L. ; Gunter, D. ; Canon,
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.
GVis: Grid-enabled Interactive Visualization State Key Laboratory. of CAD&CG Zhejiang University, Hangzhou
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
A Memory-hierarchy Conscious and Self-tunable Sorting Library To appear in 2004 International Symposium on Code Generation and Optimization (CGO ’ 04)
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Antoine Chambille Head of Research & Development, Quartet FS
PROOF tests at BNL Sergey Panitkin, Robert Petkus, Ofer Rind BNL May 28, 2008 Ann Arbor, MI.
A Pattern Language for Parallel Programming Beverly Sanders University of Florida.
BIG DATA/ Hadoop Interview Questions.
SPIDAL Java Optimized February 2017 Software: MIDAS HPC-ABDS
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Scaling Spark on HPC Systems
Diskpool and cloud storage benchmarks used in IT-DSS
CS5102 High Performance Computer Systems Thread-Level Parallelism
Distributed Network Traffic Feature Extraction for a Real-time IDS
Spark Presentation.
CS399 New Beginnings Jonathan Walpole.
Atomic Operations in Hardware
Data Platform and Analytics Foundational Training
Myoungjin Kim1, Yun Cui1, Hyeokju Lee1 and Hanku Lee1,2,*
Accelerating MapReduce on a Coupled CPU-GPU Architecture
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
IEEE BigData 2016 December 5-8, Washington D.C.
COS 518: Distributed Systems Lecture 10 Andrew Or, Mike Freedman
Department of Computer Science University of California, Santa Barbara
CS110: Discussion about Spark
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
MAPREDUCE TYPES, FORMATS AND FEATURES
Chapter 4 Multiprocessors
CS639: Data Management for Data Science
Department of Computer Science University of California, Santa Barbara
COS 518: Distributed Systems Lecture 11 Mike Freedman
Fan Ni Xing Lin Song Jiang
CS639: Data Management for Data Science
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

On the Locality of Java 8 Streams in Real- Time Big Data Applications Yu Chan Ian Gray Andy Wellings Neil Audsley Real-Time Systems Group, Computer Science University of York, UK

Outline Context of the work Focus of the current paper Previous work on Stored Collections Java 8: Streams and Pipelines and their relationship to Fork and Join framework Explore the impact of ccNUMA and locality on the Java 8 model Conclusions Java 8 implementation of Streams and pipelines is very complex

Context I The JUNIPER EU project is currently investigating how the Java 8 platform augmented by the RTSJ can be used for real-time Big Data applications

Context II JUNIPER is interested in both Big Data applications on clusters of servers and on supercomputers  Here were are concerned with the cluster environment JUNIPER wants to use Java 8 streams to provide the underlying programming model for the individual programs executing on the server computers

Context III The Java support is targeted at the server computers contained within the clusters  it is not an alternative to, for example, the Hadoop framework whose main concern is the distribution of the data Current work is considering how to extend the Java stream support to a distributed environment

Context IV A JUNIPER application consists of a set of Java 8 programs (augmented with the RTSJ) that are mapped to a distributed computing cluster, such as an internet-based cloud service Performance is critical for big data applications  We need to understand the impact of using Java streams and pipelines Currently aicas are updating Jamaica for Java 8 and to support locality

Focus of the Paper To evaluate the JVM server-level support Java is architectural neutral: the programming model essentially assumes SMP support  But, servers nowadays tend to have a ccNUMA architecture The JVM has the responsibility of optimizing performance  But, we are also interested in the potential to have FPGA accelerators

Previous Work I Java's built-in stream sources have a number of drawbacks for use in Big Data processing 1. the in-memory sources (e.g. arrays and collections) store all their data in heap memory  this implies populating the collection before any operations can be performed, resulting in a potentially long delay while it takes place  heap memory is small compared to disk space, so for Big Data computations, there may not be enough heap memory to load the entire dataset from disk 2. the file-based sources (e.g. BufferedReader.lines) produce sequential streams, making parallel execution of the pipeline impossible

Previous Work II To overcome these limitations, we have introduced in the idea of a Stored Collection  reads its data from a file on-demand, thus eliminating the initial population step  generates a parallel stream to take advantage of multi- core hardware Stored Collection programs are up to 1.44 times faster and their heap usage is 2.35%- 84.1% of those for in-memory collection programs

Streams and Pipelines List transactionsIds = transactions.stream(). filter(t -> t.getType() == Transaction.GROCERY). sorted(comparing(Transaction::getValue).reversed()). map(Transaction::getId). collect(toList()); Lazy evaluation: the data is pulled through the stream not pushed

Streams and Pipelines class InputData { private long sensorReading; //... public long getSensorReading() { return sensorReading; } class OutputData { private byte[] hashedSensorReading; //... public void setHashedSensorReading(byte[] hash) { hashedSensorReading = hash; }

Streams and Pipelines class ProcessData { public void run() { Collection inputs =...; inputs.parallelStream().map(data -> {…}). forEach(outData -> {... }); } Input Stream Operation Output Stream Operation … Terminal Operation

Streams and Pipelines class ProcessData { public void run() { Collection inputs =...; inputs.parallelStream().map(data -> { long value = data.getSensorReading(); byte[] hash = new byte[32]; SHA256 sha256 = new SHA256(); for (int shift = 0; shift < 64; shift += 8) sha256.hash((byte) (value >> shift)); sha256.digest(hash); OutputData out = new OutputData(); out.setHashedSensorReading(hash); //... return out; }).forEach(outData -> {... }); }

Streams and Fork-Join Framework Each parallel stream source can provide a spliterator which partitions the stream Internally in the Java 8 stream support, the spliterator is called to generate sub streams Each sub stream is then processed by a task submitted to the default fork and join pool

Incore Stream Sources and Locality Here the memory used to hold the partitioned stream source spans two ccNUMA nodes Hence threads executing the tasks may be accessing remote memory In our experimental set-up, remote access is 18% slower than local access Setting thread affinities does not necessarily help

Experimental Setup 2 GHz AMD Opteron 8350 running Ubuntu  16 cores, 4 cores per NUMA node  2MB L2 cache: 512KB per node  2 MB of L3 shared cache  16 GB of main memory: 4GB per node  Swap disabled Java SE 8u5  14 GB initial and maximum heap memory  GC avoided by reusing objects

Experiment Measure the main processing time of computing the SHA-256 cryptographic hash function on consecutive long integers starting from 1  Without thread affinity  Binding one thread to one core  Binding not more than 4 threads to each NUMA node Use array-backed stream and stored collection- backed stream For the stored collection: the data is created when needed rather than reading from disk

Performance of Array-backed Streams long integers 200 runs graph shows cumulative histograms

Performance of Stored Collection - backed Streams long integers

Experiment Measure the execution time of computing the SHA- 256 cryptographic hash function on consecutive long integers starting from 1  Without thread affinity  Binding one thread to one core  Binding not more than 4 threads to each NUMA node Use array-backed stream and stored collection- backed stream This stream source is on disk: hence more similar to a big data application

Array-based versus Stored Collections ArrayStored Collection long integers

Conclusions The goal of this work has been (in the context of Java 8 streams and pipelines) to  understand what impact a ccNUMA architecture will have on the ability of a JVM to optimize performance without programmer help If we just use thread affinity, we may undermine any attempt made by the JVM to optimize Stored collections, a partitioned heaped (or physical scoped memory area) should allow the programmer more control and enforce locality of access