Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.

Slides:



Advertisements
Similar presentations
Wei Lu 1, Kate Keahey 2, Tim Freeman 2, Frank Siebenlist 2 1 Indiana University, 2 Argonne National Lab
Advertisements

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
LIBRA: Lightweight Data Skew Mitigation in MapReduce
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Locality-Aware Dynamic VM Reconfiguration on MapReduce Clouds Jongse Park, Daewoo Lee, Bokyeong Kim, Jaehyuk Huh, Seungryoul Maeng.
SkewTune: Mitigating Skew in MapReduce Applications
SecureMR: A Service Integrity Assurance Framework for MapReduce Wei Wei, Juan Du, Ting Yu, Xiaohui Gu North Carolina State University, United States Annual.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Ragib Hasan University of Alabama at Birmingham CS 491/691/791 Fall 2011 Lecture 10 09/15/2011 Security and Privacy in Cloud Computing.
Spark: Cluster Computing with Working Sets
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao Oct 2013 To appear in IEEE Transactions on.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
HeteroPar 2013 Optimization of a Cloud Resource Management Problem from a Consumer Perspective Rafaelli de C. Coutinho, Lucia M. A. Drummond and Yuri Frota.
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
Investigation of Data Locality in MapReduce
A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Report : Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Emalayan Vairavanathan
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
High Throughput Computing on P2P Networks Carlos Pérez Miguel
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
MapReduce How to painlessly process terabytes of data.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
SIGCOMM 2012 (August 16, 2012) Private and Verifiable Interdomain Routing Decisions Mingchen Zhao * Wenchao Zhou * Alexander Gurney * Andreas Haeberlen.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
By: Joel Dominic and Carroll Wongchote 4/18/2012.
Secure Offloading of Legacy IDSes Using Remote VM Introspection in Semi-trusted IaaS Clouds Kenichi Kourai Kazuki Juda Kyushu Institute of Technology.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Talal H. Noor, Quan Z. Sheng, Lina Yao,
Introduction to Distributed Platforms
Distributed Network Traffic Feature Extraction for a Real-time IDS
Introduction to MapReduce and Hadoop
PA an Coordinated Memory Caching for Parallel Jobs
Selectivity Estimation of Big Spatial Data
MapReduce Simplied Data Processing on Large Clusters
MapReduce: Data Distribution for Reduce
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
Harrison Howell CSCE 824 Dr. Farkas
Presentation transcript:

Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce

MapReduce in Brief Satisfying the demand for large scale data processing It is a parallel programming model invented by Google in 2004 and become popular in recent years. Data is stored in DFS(Distributed File System) Computation entities consists of one Master and many Workers (Mapper and Reducer) Computation can be divided into two phases: Map and Reduce 2

Word Count instance Hello, 2 Hello, 4 3

Integrity Vulnerability How to detect and eliminate malicious mappers to guarantee high computation integrity ? Straightforward approach: Duplication 4 Non-collusive worker

Integrity vulnerability But how to deal with collusive malicious workers? 5

VIAF Solution Duplication Introduce trusted entity to do random verification 6

Outline Motivation System Design Analysis and Evaluation Related work Conclusion 7

Architecture 8

Assumption Trusted entities Master DFS Reducers Verifiers Untrusted entities Mappers Information will not be tampered in network communication Mappers’ result reported to the master should be consistent with its local storage (Commitment based protocol in SecureMR) 9

Solution Idea: Duplication + Verification Deterministically duplicate each task to TWO mappers to discern the non-collusive mappers Non-deterministically verify the consistent result to discern the collusive mappers The credit of each mapper is accumulated by passing verification A Mapper become trustable only when its credit achieves Quiz Threshold 100% accuracy in detecting non-collusive malicious mapper The more verification applied on a mapper, the higher accuracy to determine whether it is collusive malicious. 10

Data flow control b 5.a

Outline Motivation System Design Analysis and Evaluation Related work Conclusion 12

Theoretical Analysis Measurement metric (for each task) Accuracy -- The probability the reducer receive a good result from mapper Mapping Overhead – The average number of execution launched by the mapper Verification Overhead – The average number of execution launched by the verifier 13

Accuracy vs Quiz Threshold Collusive mappers only, different p: m – Malicious worker ratio c – Collusive worker ratio p – probability that two assigned collusive workers are in one collusion group and can commit a cheat q – probability that two collusive workers commit a cheat r – probability that a non collusive worker commit a cheat v – Verification Probability k – Quiz Threshold. 14

Overhead vs Quiz Threshold Collusive worker only: Overhead for each task stays between 2.0 to 2.4 Collusive worker only: Verification Overhead for each task stays between 0.2 to when v is

Experimental Evaluation Implementation based on Hadoop virtual machines(512 MB of Ram, 40 GB disk each, Debian ) deployed on a 2.93 GHz, 8-core Intel Xeon CPU with 16 GB of RAM word count application, 400 mapping tasks and 1 reduce task Out of 11 virtual hosts 1 is both the Master and the benign worker 1 is the verifier 4 are collusive workers (malicious ratio is 40%) 5 are benign workers 16

Evaluation result Quiz Threshold Accuracy Mapper Overhead Verification Overhead % % % % % 3100% % 4100% % 5100% % 6100% % 7100% % Where c is 1.0, p is 1.0, q is 1.0, m is 0.4  0 17

Outline Motivation System Design Analysis and Evaluation Related work Conclusion 18

Related work Replication, sampling, checkpoint-based solutions were proposed in several distributed computing contexts Duplicate based for Cloud: SecureMR(Wei et al, ACSAC ‘09), Fault Tolerance Middleware for Cloud Computing( Wenbing et al, IEEE CLOUD ‘10) Quiz based for P2P: Result verification and trust based scheduling in peer- to-peer grids (S. Zhao et al, P2P ‘05) Sampling based for Grid: Uncheatable Grid Computing (Wenliang et al, ICDCS’04) Accountability in Cloud computing research Hardware-based attestation: Seeding Clouds with Trust Anchors (Joshua et al, CCSW ‘10) Logging and Auditing: Accountability as a Service for the Cloud(Jinhui, et al, ICSC ‘10) 19

Outline Motivation System Design Analysis and Evaluation Related work Conclusion 20

Conclusion Contributions Proposed a Framework (VIAF) to defeat both collusive and non-collusive malicious mappers in MapReduce calculation. Implemented the system and proved from theoretical analysis and experimental evaluation that VIAF can guarantee high accuracy while incurring acceptable overhead. Future work Further exploration without the assumption of trust of reducer. Reuse the cached result to better utilize the verifier resource. 21