Rhea: automatic filtering for unstructured cloud storage Christos Gkantsidis, Dimitrios Vytiniotis, Orion Hodson, Dushyanth Narayanan, Florin Dinu, and.

Slides:



Advertisements
Similar presentations
Introduction to Java 2 Programming Lecture 10 API Review; Where Next.
Advertisements

Compilation 2011 Static Analysis Johnni Winther Michael I. Schwartzbach Aarhus University.
Data-Flow Analysis Framework Domain – What kind of solution is the analysis looking for? Ex. Variables have not yet been defined – Algorithm assigns a.
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Chapter 10 Introduction to Arrays
Spark: Cluster Computing with Working Sets
BigData Tools Seyyed mohammad Razavi. Outline  Introduction  Hbase  Cassandra  Spark  Acumulo  Blur  MongoDB  Hive  Giraph  Pig.
Shark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Hive on Spark.
CS 536 Spring Global Optimizations Lecture 23.
1 “White box” or “glass box” tests “White Box” (or “Glass Box”) Tests.
4/25/08Prof. Hilfinger CS164 Lecture 371 Global Optimization Lecture 37 (From notes by R. Bodik & G. Necula)
An Introduction to MapReduce: Abstractions and Beyond! -by- Timothy Carlstrom Joshua Dick Gerard Dwan Eric Griffel Zachary Kleinfeld Peter Lucia Evan May.
Prof. Fateman CS 164 Lecture 221 Global Optimization Lecture 22.
Describing Syntax and Semantics
Ben Livshits Based in part of Stanford class slides from
Prof. Bodik CS 164 Lecture 16, Fall Global Optimization Lecture 16.
1 CS428 Web Engineering Lecture 18 Introduction (PHP - I)
CSC 8310 Programming Languages Meeting 2 September 2/3, 2014.
1 Guide to JSP common functions 1.Including the libraries as per a Java class, e.g. not having to refer to java.util.Date 2.Accessing & using external.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Cloud MapReduce : a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.
Introduction to Java Appendix A. Appendix A: Introduction to Java2 Chapter Objectives To understand the essentials of object-oriented programming in Java.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
JAVA SERVER PAGES. 2 SERVLETS The purpose of a servlet is to create a Web page in response to a client request Servlets are written in Java, with a little.
Bug Localization with Machine Learning Techniques Wujie Zheng
Chapter 6: User-Defined Functions
An Introduction to HDInsight June 27 th,
Hello.java Program Output 1 public class Hello { 2 public static void main( String [] args ) 3 { 4 System.out.println( “Hello!" ); 5 } // end method main.
Debugging in Java. Common Bugs Compilation or syntactical errors are the first that you will encounter and the easiest to debug They are usually the result.
Oracle Data Integrator Procedures, Advanced Workflows.
Flow of Control Part 1: Selection
Problem Solving Techniques. Compiler n Is a computer program whose purpose is to take a description of a desired program coded in a programming language.
Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.
The Only Constant is Change: Incorporating Time-Varying Bandwidth Reservations in Data Centers Di Xie, Ning Ding, Y. Charlie Hu, Ramana Kompella 1.
Looping and Counting Lecture 3 Hartmut Kaiser
Page 1 5/2/2007  Kestrel Technology LLC A Tutorial on Abstract Interpretation as the Theoretical Foundation of CodeHawk  Arnaud Venet Kestrel Technology.
BEGINNING PROGRAMMING.  Literally – giving instructions to a computer so that it does what you want  Practically – using a programming language (such.
Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
CS 115 OBJECT ORIENTED PROGRAMMING I LECTURE 9 GEORGE KOUTSOGIANNAKIS Copyright: 2014 Illinois Institute of Technology- George Koutsogiannakis 1.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Controlling Program Flow with Looping Structures
1 Test Coverage Coverage can be based on: –source code –object code –model –control flow graph –(extended) finite state machines –data flow graph –requirements.
© Dr. A. Williams, Fall Present Software Quality Assurance – Clover Lab 1 Tutorial / lab 2: Code instrumentation Goals of this session: 1.Create.
Hello world !!! ASCII representation of hello.c.
Introduction to Programming G50PRO University of Nottingham Unit 6 : Control Flow Statements 2 Paul Tennent
Chapter 6 Controlling Program Flow with Looping Structures.
1 VLDB, Background What is important for the user.
Coming up Implementation vs. Interface The Truth about variables Comparing strings HashMaps.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Information and Computer Sciences University of Hawaii, Manoa
TensorFlow– A system for large-scale machine learning
CIS3931 – Intro to JAVA Lecture Note Set 2 17-May-05.
About Hadoop Hadoop was one of the first popular open source big data technologies. It is a scalable fault-tolerant system for processing large datasets.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Microsoft Visual Basic 2005 BASICS
Program Slicing Baishakhi Ray University of Virginia
“White box” or “glass box” tests
Overview of big data tools
LCC 6310 Computation as an Expressive Medium
Server-Side Programming
Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.
Presentation transcript:

Rhea: automatic filtering for unstructured cloud storage Christos Gkantsidis, Dimitrios Vytiniotis, Orion Hodson, Dushyanth Narayanan, Florin Dinu, and Antony Rowstron Presented by Gourav Khaneja

Motivation: Unstructured data Relational Databases had well-defined schema Unstructured “text” data (or loose structure): The structure of data is implicit in the application (flexibility)

Cluster design for data analytics Hadoop, Dryad, Map Reduce co-locate Storage and Compute

Elastic Cloud Amazon S3 & EC2: Amazon Elastic MapReduce Microsoft Azure Storage and computer cloud: Hadoop Scalable storage Elastic computeDC Network

Why separate clusters ? Security & Performance Isolation Independent Evolution (scalability & provisioning) (User) don’t pay for compute to keep data alive Scalable storage Elastic compute

Bottleneck Core DC bandwidth: Scarce & oversubscribe Scalable storage Elastic compute Bottleneck

Execute Mapper on storage ? Intuition: Mappers throw away a lot of data, but Data reduction not guaranteed Difficult to stop mappers during storage overload Storage nodes have to execute complicated logic (Hadoop system & protocol) Dependencies on runtime environment, libraries, etc

Solution: Rhea Filters unnecessary data at storage nodes Through static analysis of java byte code of mappers Filters are executable java code

Rhea: Design Storage Job Data Hadoo p Cluster Input Job Filter Generator Network Filter descriptions Filter Proxy 9 Extract row (select) & column (project) filters

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; String geoPoint = entries[2]; if (GEO_RSS_URI.equals(pointType)) { StringTokenizer st = new StringTokenizer(geoPoint, " "); String strLat = st.nextToken(); String strLong = st.nextToken(); double lat = Double.parseDouble(strLat); double lang = Double.parseDouble(strLong); String locationKey = ……… String locationName = ……… geoLocationKey.set(locationKey); geoLocationName.set(locationName); outputCollector.collect(geoLocationKey, geoLocationName); } Row Filters

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; String geoPoint = entries[2]; if (GEO_RSS_URI.equals(pointType)) { StringTokenizer st = new StringTokenizer(geoPoint, " "); String strLat = st.nextToken(); String strLong = st.nextToken(); double lat = Double.parseDouble(strLat); double lang = Double.parseDouble(strLong); String locationKey = ……… String locationName = ……… geoLocationKey.set(locationKey); geoLocationName.set(locationName); outputCollector.collect(geoLocationKey, geoLocationName); } 1. Label output lines.

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; String geoPoint = entries[2]; if (GEO_RSS_URI.equals(pointType)) { StringTokenizer st = new StringTokenizer(geoPoint, " "); String strLat = st.nextToken(); String strLong = st.nextToken(); double lat = Double.parseDouble(strLat); double lang = Double.parseDouble(strLong); String locationKey = ……… String locationName = ……… geoLocationKey.set(locationKey); geoLocationName.set(locationName); outputCollector.collect(geoLocationKey, geoLocationName); } 2. Collect all control flow path that reach to output labels (loops, conditional statements creates branches in the control flow)

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; String geoPoint = entries[2]; if (GEO_RSS_URI.equals(pointType)) { StringTokenizer st = new StringTokenizer(geoPoint, " "); String strLat = st.nextToken(); String strLong = st.nextToken(); double lat = Double.parseDouble(strLat); double lang = Double.parseDouble(strLong); String locationKey = ……… String locationName = ……… geoLocationKey.set(locationKey); geoLocationName.set(locationName); outputCollector.collect(geoLocationKey, geoLocationName); } 3. Create a flow map : For each instruction, for each variable referenced in that instruction: what instruction affects that variable.

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; if (GEO_RSS_URI.equals(pointType)) { outputCollector.collect(geoLocationKey, geoLocationName); } 4. Keep only the statements which are reaching destination for control flow statements.

public void map(… value …) { String[] entries = value.toString().split(“\t”); String articleName = entries[0]; String pointType = entries[1]; if (GEO_RSS_URI.equals(pointType)) { return true; } return false; } 5. Disjunction of paths: Return true for control reaching output labels. *This is a simplified version. The actual Rhea-generated code differs in terms of variable names and condition check.

Column Filters StringTokenizer, String.split based on regular expressions. Can be extended to other APIs. Conservative: do not filter otherwise Replace irrelevant tokens Generate fillers dynamically

State machine for column filter v=value.toString() t=new StringTokenizer(t,sep)t.nextToken() T=v.split(sep) … START

Filter Properties Correct Isolation and safety: No system calls, I/O call etc. Fully Transparent. Thus, best effort: can be killed anytime. Stateless: less memory usage (unlike mappers) Guarantee output < input : unlike mappers Termination: proof ?

Evaluation: Job Selectivity Many Jobs are very selective either on rows or columns or both Normalized selectivity of example jobs Many Jobs are very selective either on rows or columns or both 30 % of data transferred

Job Run Time Job run time normalized to baseline execution (without Rhea) Discussion: Filter time not included.

Throughput of Filtering Engine OK for a 2 core machine, transmitting at full line rate of 1 Gbps Optimizations only for column filter

Across Datacenters: WAN is the bottleneck Similar results as for LAN For a few jobs, LAN is a bottleneck instead of WAN

Dollar costs Why compute cost is reduced ? Per second compute cost (instead of per dollars)

Discussion The example jobs might be biased towards selectivity. How does system generalize beyond Hadoop/Java (Pig, Spark, streaming) ? Experiments to study computing availability at storage nodes. Not optimal (throughput-wise, selectivity-wise). False-positive rate ? Debugging becomes harder, in case of mapper bugs.

Stateful Mappers Statements may modify mapper state Example: A mapper emitting every nth row Solution: Treat state accessing statements as output labels

Optimizations Merge control paths if all the branches lead to output labels (loops and conditions) if (GEO_RSS_URI.equals(pointType)) { … }else{ … } While(condition){ … } outputCollector.collect(geoLocationKey, geoLocationName);

Evaluation Input data size and run time for 9 example jobs without Rhea Out of 160 mappers, 50% (26%) gives non-trivial row (column filters)

DC bandwidth: Scarce & oversubscribe 631 Mbps 230 Mbps