Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

Slides:

Advertisements

Similar presentations

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© Hortonworks Inc Daniel Dai Thejas Nair Page 1 Making Pig Fly Optimizing Data Processing on Hadoop.

SkewTune: Mitigating Skew in MapReduce Applications

Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

APACHE GIRAPH ON YARN Chuan Lei and Mohammad Islam.

A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

FLANN Fast Library for Approximate Nearest Neighbors

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.

Panagiotis Antonopoulos Microsoft Corp Ioannis Konstantinou National Technical University of Athens Dimitrios Tsoumakos.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Smita Vijayakumar Qian Zhu Gagan Agrawal 1.  Background  Data Streams  Virtualization  Dynamic Resource Allocation  Accuracy Adaptation  Research.

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

1 Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data Yi Wang, Gagan Agrawal, Gulcin Ozer and Kun Huang The Ohio State University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Parallel Computation of Skyline Queries Verification COSC6490A Fall 2007 Slawomir Kmiec.

Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,

Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.

PySpark Tutorial - Learn to use Apache Spark with Python

TensorFlow– A system for large-scale machine learning

Spark Presentation.

Parallel Programming By J. H. Wang May 2, 2017.

Smart: A MapReduce-Like Framework for In-Situ Scientific Analytics

PREGEL Data Management in the Cloud

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Supporting Fault-Tolerance in Streaming Grid Applications

Linchuan Chen, Xin Huo and Gagan Agrawal

Year 2 Updates.

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

StreamApprox Approximate Stream Analytics in Apache Flink

StreamApprox Approximate Stream Analytics in Apache Spark

StreamApprox Approximate Computing for Stream Analytics

Communication and Memory Efficient Parallel Decision Tree Construction

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Gagan Agrawal The Ohio State University

KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner

Data-Intensive Computing: From Clouds to GPU Clusters

Smita Vijayakumar Qian Zhu Gagan Agrawal

Bin Ren, Gagan Agrawal, Brad Chamberlain, Steve Deitz

Pramod Bhatotia, Ruichuan Chen, Myungjin Lee

Declarative Transfer Learning from Deep CNNs at Scale

Yi Wang, Wei Jiang, Gagan Agrawal

Introduction to Spark.

CPU Structure CPU must:

COS 518: Distributed Systems Lecture 11 Mike Freedman

Presentation transcript:

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The Ohio State University

Outline Introduction Challenges System Design Experimental Results Conclusion

Two Trends of Big Data Analytics More Interactive Online processing is favored Approximate algs are great 100% accuracy is often not necessary Trade accuracy for performance and responsiveness More Cloud-Based Pay as you go Can save more with early termination

Online Analytics Processes Data Incrementally Early Termination Processes a (small) portion at a time Provides online estimate Early Termination Stops when the estimate is good enough Saves time and resources Example: Online Aggregation

+ MapReduce Major Advantage Limitation Hides all the parallelization complexities by simplified API Limitation Mainly designed for batch processing Online Analytics +

Challenges Online Mode + Batch Mode Online mode is similar to iterative processing Each iteration processes a portion and estimates Backward compatibility Old batch mode code should still work Should not break the original API Runtime Estimation + Synced Termination No more communication among workers after shuffling How to obtain a global estimate at each iteration? How to terminate each node synchronously?

Challenges Cont’d Associates Termination Condition with the App Semantics Accuracy Online aggregation Threshold (not necessarily accuracy) Top-k Convergence or delta of a series of estimates Clustering and regression

Bridging the Gap Addresses All the Challenges Applies online sampling and adds 2 optional functions Customizable runtime estimation and early termination Falls back to batch mode by default Master node combines local estimates and controls termination Unlike ETL, the output of online analytics is usually small Evaluates termination condition based on the estimates of the past few sampling iterations Some existing works only do some hacking on the MR and break the batch mode.

Precursor System: Smart Smart is Our MR-like Framework (SC’15)

System Overview Runtime Estimation Online Sampling Early Termination 3 major extensions Online Sampling Early Termination

Online Sampling Stratified Sampling without Replacement Stratified Good accuracy Without Replacement Sampled data continues to increase strictly until early termination or full scan Sample size is constant -> no tricky dynamic sample size adjustment

Runtime Estimation and Early Termination Further reduces sampling data in the time dimension Reuses the same merge (reduce) func. to reduce all the data processed so far Estimates based on the total input size and the running reduction results Early Termination Evaluates the most recent snapshot(s) of estimates

Extended System APIs Running Estimation Func. Gives total input size + stats of the data processed so far Estimates based on the partial input E.g., totally 10 GB data, after processing 1 GB data, sum = 1 estimated sum = 10 By default, returns the results of the data processed so far (i.e., no magnification or adjustment)

Extended System APIs (Cont’d) Early Termination Func. Evaluates accuracy or threshold -> retrieves the most recent estimate Evaluates convenience or delta -> retrieves the most recent 2 or more estimates By default, returns false to disable early termination and fall back to the batch mode

Experiments Setup Apps: 32 nodes * 8 cores, 12 GB memory 112 GB Ocean Dataset Apps: Online Aggregation: accuracy based termination Top-k: threshold based termination K-means: convergence based termination

Performance Evaluation Lazy Loading: partial I/O + partial computation Full loading: full I/O + partial computation Baseline: batch processing Lazy loading is up to 30x faster than baseline, and up to 11x faster than full loading

Which Factors Delay Termination? Selectivity = sampled data size / total input size Selectivity ∝ sample size Selectivity ∝ # of nodes

Which Factors Affect Accuracy? Sample size is not equivalent to sampling rate here. Sample size is the size used for sampling a data block at a time. Sampling rate is commonly used for the total sampled data. No noticeable diff on varying # of nodes More strict termination condition => higher accuracy Sample size for each sampling iteration also somehow affects accuracy

Conclusion Online Analytics + MR Critical Extensions Big data analytics can be done incrementally and stopped early to bring timely insight Need for approximate algs on MR Critical Extensions Online sampling Extended APIs Flexible runtime estimation Early termination condition w.r.t. app semantics