Yak: A High-Performance Big-Data-Friendly Garbage Collector

Slides:

Advertisements

Similar presentations

Object-Orientation Meets Big Data Language Techniques towards Highly- Efficient Data-Intensive Computing Harry Xu UC Irvine.

Advertisements

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

Dynamic Tainting for Deployed Java Programs Du Li Advisor: Witawas Srisa-an University of Nebraska-Lincoln 1.

Facade: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications UC Irvine USA Khanh Nguyen Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang,

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Hadoop Ecosystem Overview

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

Hyracks: A new partitioned-parallel platform for data-intensive computation Vinayak Borkar UC Irvine (joint work with M. Carey, R. Grover, N. Onose, and.

Joe Hummel, PhD Visiting Researcher: U. of California, Irvine Adjunct Professor: U. of Illinois, Chicago & Loyola U., Chicago Materials:

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Hadoop and HDFS

Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu.

Message Analysis-Guided Allocation and Low-Pause Incremental Garbage Collection in a Concurrent Language Konstantinos Sagonas Jesper Wilhelmsson Uppsala.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Speculative Region-based Memory Management for Big Data Systems Khanh Nguyen, Lu Fang, Harry Xu, Brian Demsky Donald Bren School of Information and Computer.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

© Hortonworks Inc Hadoop: Beyond MapReduce Steve Loughran, Big Data workshop, June 2013.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.

1 Tree and Graph Processing On Hadoop Ted Malaska.

Microsoft Ignite /28/2017 6:07 PM

Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Image taken from: slideshare

Connected Infrastructure

Big Data is a Big Deal!.

Sushant Ahuja, Cassio Cristovao, Sameep Mohta

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.

Machine Learning Library for Apache Ignite

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

INTRODUCTION TO PIG, HIVE, HBASE and ZOOKEEPER

Tutorial: Big Data Algorithms and Applications Under Hadoop

Pathology Spatial Analysis February 2017

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Spark Presentation.

Connected Infrastructure

Abstract Major Cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for.

Data Platform and Analytics Foundational Training

Speculative Region-based Memory Management for Big Data Systems

Hadoop Clusters Tess Fulkerson.

Central Florida Business Intelligence User Group

Skyway: Connecting Managed Heaps in Distributed Big Data Systems

Introduction to Spark.

Distributed Systems CS

Yak: A High-Performance Big-Data-Friendly Garbage Collector

Introduction to PIG, HIVE, HBASE & ZOOKEEPER

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Apache Spark Lecture by: Faria Kalim (lead TA) CS425, UIUC

Introduction to Apache

Pregelix: Think Like a Vertex, Scale Like Spandex

Overview of big data tools

Spark and Scala.

TIM TAYLOR AND JOSH NEEDHAM

Apache Spark Lecture by: Faria Kalim (lead TA) CS425 Fall 2018 UIUC

Distributed Systems CS

Charles Tappert Seidenberg School of CSIS, Pace University

MAPREDUCE TYPES, FORMATS AND FEATURES

Fast, Interactive, Language-Integrated Cluster Computing

CS 239 – Big Data Systems Fall 2018

Analysis of Structured or Semi-structured Data on a Hadoop Cluster

Map Reduce, Types, Formats and Features

Presentation transcript:

Yak: A High-Performance Big-Data-Friendly Garbage Collector Khanh Nguyen, Lu Fang, Guoqing Xu, Brian Demsky, Sanazsadat Alamian, Shan Lu Onur Mutlu University of California, Irvine University of Chicago ETH Zürich

BIG DATA Naiad Cloudera is the first that commercialized Hadoop – claimed to be the leader – opened in 2008 Hortonworks is Hadoop-enterprise – opened in 2011 Giraph: open-source counterpart of Pregel Hama: general Bulk Synchronous Processing computing engine on top of Hadoop to do scientific computing, iterative algorithm such Matrix, Graph and Machine Learning Hive: data warehouse: query and manage large datasets in distributed storage using SQL-like language, based on MapReduce – initially developed by Facebook Pig: has high-level language to express data analysis program – its infrastructure support parallel execution, auto-tune efficiency. On Hadoop Storm: distributed realtime computation system, easy to process unbounded streams of data. Utilize queueing and database technologies to consume data in arbitrary complex way, repartition streams between each stage Flume: library for collecting, aggregating and moving large amounts of log data. Main goal is to deliver data from applications to HDFS, based on streaming data flow. Mahout: machine learning, dumped Hadoop, used Spark Dryad

High cost of the managed runtime is a fundamental problem! Scalability JVM crashes due to OutOfMemory error at early stage [Fang et al., SOSP’15] Management cost GC time accounts for up to 50% of the execution time [Bu et al., ISMM’13] High cost of the managed runtime is a fundamental problem!

Two Paths, Two Hypotheses MASTER SLAVE SLAVE Control path Data path Epochal Hypothesis Generational Hypothesis

Region-based memory management WordCount Many data objects have the same life span: Document Mapper Memory Region Heap setup setup Region-based memory management words GC map cleanup cleanup GC run in the middle is wasted Can be efficiently reclaimed together

Region-based Memory Management Sophisticated static analysis won’t work for data-intensive systems What about control path? How to marry generational GC with region-based memory management?

Reduced memory management cost Yak Approach Heap Control Space Data Space Generational GC Region-based Memory Management Reduced memory management cost annotate epoch boundary: - epoch_start() - epoch_end()

annotation & refactoring Correctness User-level approach: Facade [Nguyen et al., ASPLOS’15] Region annotation & refactoring Developers Yak

Yak: An Automatic Solution Yak: the first hybrid GC Implemented in OpenJDK 8 Modified the interpreter, two JIT compilers, the heap layout, the Parallel Scavenge GC NO code refactoring needed; correctness guaranteed by the system On average, vs. default production GC (PS): Reduce 33% execution time Reduce 78% GC time

Challenges How to create regions? How to reclaim regions correctly?

How to Create Regions? A region starts at a call to epoch-start and ends at a call to epoch-end The location of epochs affects performance but not correctness Regions are thread-local Regions can be nested

How to Create Regions? Region Semilattice CS,* 1,t1 1,t2 region 2,t1 void main() { } //end of main CS,* epoch #1 epoch_start(); for( ) { } epoch_end(); epoch #2 epoch_start(); for( ) { } epoch_end(); 1,t1 1,t2 epoch #3 epoch_start(); for( ) { } epoch_end(); region 2,t1 2,t2 3,t1 3,t2 JOIN( , ) = 2,t1 2,t2 CS,* JOIN( , ) = 3,t1 2,t1

How to Reclaim Regions Correctly? Object Promotion Algorithm Two key tasks: What: Identify escaping objects: Tracking of cross-region/space references in write barrier A fast path for intra-region references Inter-region references are recorded in the remember sets of their destination regions Where: Decide the relocation destination: Query region semilattice

Region Deallocation epoch_end() <CS,*> <r1, t1> t1’s stack Remember Set escaping roots Barrier … <r2, t1>

Evaluations 3 real-world systems, 9 applications: Hadoop Popular distributed MapReduce implementation Hyracks [Borkar et al., ICDE’11] Data-parallel platform to run data-intensive jobs on a cluster of shared-nothing machines GraphChi [Kyrola et al., OSDI’12] High-performance graph analytical framework for a single machine

-78% -33% -2% Improvement Summary GC Time App. Time Exec. Time Normalized Performance (to PS) -78% -33% -2% GC Time App. Time Exec. Time

More Data in the Paper Statistics on Yak’s heap Overhead breakdowns Write barrier cost Extra space cost

Conclusion Goal: Reduce memory management cost in Big Data systems Solution: Yak, the first hybrid GC 33% execution time savings 78% GC time reduction Requires almost zero user effort

Thank You!