Object-Orientation Meets Big Data Language Techniques towards Highly- Efficient Data-Intensive Computing Harry Xu UC Irvine.

Slides:



Advertisements
Similar presentations
An Implementation of Mostly- Copying GC on Ruby VM Tomoharu Ugawa The University of Electro-Communications, Japan.
Advertisements

Chapter 13: Query Processing
Configuration management
Software change management
Configuration management
Paper by: Yu Li, Jianliang Xu, Byron Choi, and Haibo Hu Department of Computer Science Hong Kong Baptist University Slides and Presentation By: Justin.
Garbage collection David Walker CS 320. Where are we? Last time: A survey of common garbage collection techniques –Manual memory management –Reference.
Alan F. Gates, Olga Natkovich, Shubham Chopra, Pradeep Kamath, Shravan M. Narayanamurthy, Christopher Olston, Benjamin Reed, Santhosh Srinivasan, Utkarsh.
Names and Bindings.
APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
Genome-scale disk-based suffix tree indexing Benjarath Phoophakdee Mohammed J. Zaki Compiled by: Amit Mahajan Chaitra Venus.
Mobile and Wireless Computing Institute for Computer Science, University of Freiburg Western Australian Interactive Virtual Environments Centre (IVEC)
Pointer and Shape Analysis Seminar Context-sensitive points-to analysis: is it worth it? Article by Ondřej Lhoták & Laurie Hendren from McGill University.
Finding Low-Utility Data Structures Guoqing Xu 1, Nick Mitchell 2, Matthew Arnold 2, Atanas Rountev 1, Edith Schonberg 2, Gary Sevitsky 2 1 Ohio State.
Connectivity-Based Garbage Collection Presenter Feng Xian Author Martin Hirzel, et.al Published in OOPSLA’2003.
Previous finals up on the web page use them as practice problems look at them early.
Facade: A Compiler and Runtime for (Almost) Object-Bounded Big Data Applications UC Irvine USA Khanh Nguyen Khanh Nguyen, Kai Wang, Yingyi Bu, Lu Fang,
CACHETOR Detecting Cacheable Data to Remove Bloat Khanh Nguyen Guoqing Xu UC Irvine USA.
Memory management. Instruction execution cycle Fetch instruction from main memory Decode instruction Fetch operands (if needed0 Execute instruction Store.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Identifying and Incorporating Latencies in Distributed Data Mining Algorithms Michael Sevilla.
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland MLG, January, 2014 Jaehwan Lee.
Pregel: A System for Large-Scale Graph Processing
Implementation Yaodong Bi. Introduction to Implementation Purposes of Implementation – Plan the system integrations required in each iteration – Distribute.
Introduction to Hadoop and HDFS
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Keyword Searching and Browsing in Databases using BANKS Seoyoung Ahn Mar 3, 2005 The University of Texas at Arlington.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
Speculative Region-based Memory Management for Big Data Systems Khanh Nguyen, Lu Fang, Harry Xu, Brian Demsky Donald Bren School of Information and Computer.
Union-find Algorithm Presented by Michael Cassarino.
Static Detection of Loop-Invariant Data Structures Harry Xu, Tony Yan, and Nasko Rountev University of California, Irvine Ohio State University 1.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
The Software Development Process
Computing Simulation in Orders Based Transparent Parallelizing Pavlenko Vitaliy Danilovich, Odessa National Polytechnic University Burdeinyi Viktor Viktorovych,
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University IWPSE 2003 Program.
Week 14 Introduction to Computer Science and Object-Oriented Programming COMP 111 George Basham.
Is Your Graph Algorithm Eligible for Nondeterministic Execution? Zhiyuan Shao, Lin Hou, Yan Ai, Yu Zhang and Hai Jin Services Computing Technology and.
CS4432: Database Systems II Query Processing- Part 2.
CoCo: Sound and Adaptive Replacement of Java Collections Guoqing (Harry) Xu Department of Computer Science University of California, Irvine.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Page 1 A Platform for Scalable One-pass Analytics using MapReduce Boduo Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy SIGMOD 2011 IDS Fall Seminar 2011.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Mark Marron 1, Deepak Kapur 2, Manuel Hermenegildo 1 1 Imdea-Software (Spain) 2 University of New Mexico 1.
1 The Software Development Process ► Systems analysis ► Systems design ► Implementation ► Testing ► Documentation ► Evaluation ► Maintenance.
REX: RECURSIVE, DELTA-BASED DATA-CENTRIC COMPUTATION Yavuz MESTER Svilen R. Mihaylov, Zachary G. Ives, Sudipto Guha University of Pennsylvania.
ECE 750 Topic 8 Meta-programming languages, systems, and applications Automatic Program Specialization for J ava – U. P. Schultz, J. L. Lawall, C. Consel.
EpiC: an Extensible and Scalable System for Processing Big Data Dawei Jiang, Gang Chen, Beng Chin Ooi, Kian Lee Tan, Sai Wu School of Computing, National.
Mizan:Graph Processing System
Design issues for Object-Oriented Languages
Processes and threads.
Memory management.
Hadoop.
Chapter 1 Introduction.
Java 9: The Quest for Very Large Heaps
Yak: A High-Performance Big-Data-Friendly Garbage Collector
Chapter 1 Introduction.
Speculative Region-based Memory Management for Big Data Systems
CACHETOR Detecting Cacheable Data to Remove Bloat
Introduction to Spark.
ICS-2018 June 12-15, Beijing Zwift : A Programming Framework for High Performance Text Analytics on Compressed Data Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng.
Yak: A High-Performance Big-Data-Friendly Garbage Collector
Pregelix: Big(ger) Graph Analytics on A Dataflow Engine
Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights Feng Zhang †⋄, Jidong Zhai ⋄, Xipeng Shen #, Onur Mutlu ⋆, Wenguang.
Pregelix: Think Like a Vertex, Scale Like Spandex
Spark and Scala.
The Challenge of Cross - Language Interoperability
point when a program element is bound to a characteristic or property
CS 239 – Big Data Systems Fall 2018
Presentation transcript:

Object-Orientation Meets Big Data Language Techniques towards Highly- Efficient Data-Intensive Computing Harry Xu UC Irvine

Big Data Applications Large-scale, data-intensive applications are pervasively used to extract useful information from a sea of data items Three categories of applications – Dataflow frameworks such as Hadoop, Hyracks, Spark, and Storm – Message passing systems such as Giraph and Pregel – High-level languages such as Pig, Hive, AsterixDB, FlumeJava

Big Data, Big Challenges All of these applications (except for Spark) are written in Java – Spark is written in Scala but relies on JVM to execute Huge performance and scalability problems

Example Performance Problems The implementations of PageRank in Giraph, Spark, and Mahout all failed to process 500MB data on a 10GB heap Numerous complaints on poor performance and scalability can be found on various mailing lists and programming Q/A sites (e.g., stackoverflow.com) 47% of the execution time for answering a simple SQL query is taken by GC More details can be found in our ISMM’13 paper

What is Happening? Excessive object creation is the root of all evils – Each object has a header (space overhead) – Each object is subject to GC traversal (space and time overhead) – A Java program makes heavy use of data structures that use pointers to connect objects (space and time overhead) Object-oriented developers are encouraged to create objects

Consequences Low packing factor – Suppose each vertex has m outgoing edges and n incoming edges – The space overhead of the vertex is 16( m + n ) +148 – The actual data only needs 8( m + n ) + 24 Large GC costs – A Java hash table can contain millions of objects in a typical Big Data application

Our Proposal Stage I: an bloat-free design methodology – A buffer-based memory management technique – A bloat-free programming model Stage 2: provide compiler support – Automatically transform a regular Java program into a program under the bloat-free design – Various static/dynamic analysis and optimization techniques will be developed Stage 3: provide maintenance support – Design novel algorithms for analysis, debugging, testing etc. to reduce maintenance costs

Buffer-based Memory Management Key to designing a scalable Big Data application is to bound the number of objects – It cannot grow proportionally with the size of the input data set Allocate data items in buffers, which themselves are Java objects (e.g., java.nio.ByteBuffer) Data items in a buffer have similar lifetimes and can be discarded all together at the end of the iteration in which they are processed

Accessor-based Programming Model Data are all in buffers now We create objects only to represent accessors (not data items) An accessor can bind itself to different data items (i.e., reuse) For each type of data structure, we form an accessor structure, in which each accessor accesses a data item

Accessor Structure

An Initial Evaluation of Performance on PageRank The bloat-free version scales well with the size of the data set All the object-based implementations run out of memory at the 10 GB scale

The Current Status This is joint work with the database group at UCI, particularly – Vinayak Borkar, Yingyi Bu, and Michael Carey We are between stage 1 and stage 2

Why not switch back to An Unmanaged Language A typical Big Data application exhibits clear separation between control path and data path A control path takes the majority of the development effort – Our study on 6 Big Data applications shows that on average 64% of the LOC is in the control path A data path creates the majority of the heap objects – More than 90% of the heap objects is created by a data path Can we enjoy both the benefit of a managed language for a control path and the high performance for a data path?