Performance Considerations of Data Acquisition in Hadoop System

Slides:

Advertisements

Similar presentations

1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.

Advertisements

Hadoop at ContextWeb February ContextWeb: Traffic Traffic – up to 6 thousand Ad requests per second. Comscore Trend Data:

Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 15 Introduction to Rails.

CSF4 Meta-Scheduler Tutorial 1st PRAGMA Institute Zhaohui Ding or

© 2006 Open Grid Forum GGF18, 13th September 2006 OGSA Data Architecture Scenarios Dave Berry & Stephen Davey.

OPERATING SYSTEMS Lecturer: Szabolcs Mikulas Office: B38B

1 9 Moving to Design Lecture Analysis Objectives to Design Objectives Figure 9-2.

Distributed and Parallel Processing Technology Chapter2. MapReduce

Introduction to Hadoop Richard Holowczak Baruch College.

Copyright © 2013, Oracle and/or its affiliates. All rights reserved. 1 Navigating the Internet of Things Laurent ZELMANOWICZ Vice President EMEA ISV, OEM.

Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Suggested Course Outline Cloud Computing Bahga & Madisetti, © 2014Book website:

Database System Concepts and Architecture

Implementation Architecture

Dan Bassett, Jonathan Canfield December 13, 2011.

Addition 1’s to 20.

Number bonds to 10,

05/10/2011http:// 1/15 Connected! How we Integrated our Collections in WordPress using the EMu API Paul Trafford

University of Minnesota Optimizing MapReduce Provisioning in the Cloud Michael Cardosa, Aameek Singh†, Himabindu Pucha†, Abhishek Chandra

Remote - DSP Lab for Distance Education

CIS 4004: Web Based Information Technology Spring 2013

Mapreduce and Hadoop Introduce Mapreduce and Hadoop

A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.

Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

VIRTUALIZATION AND CLOUD COMPUTING Dr. John P. Abraham Professor, Computer Engineering UTPA.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Committed to Deliver….  We are Leaders in Hadoop Ecosystem.  We support, maintain, monitor and provide services over Hadoop whether you run apache Hadoop,

CERN - IT Department CH-1211 Genève 23 Switzerland t Monitoring the ATLAS Distributed Data Management System Ricardo Rocha (CERN) on behalf.

Pepper: An Elastic Web Server Farm for Cloud based on Hadoop Author ： S. Krishnan, J.-S. Counio Date ： Speaker ： Sian-Lin Hong IEEE International.

An Architecture for Video Surveillance Service based on P2P and Cloud Computing Yu-Sheng Wu, Yue-Shan Chang, Tong-Ying Juang, Jing-Shyang Yen speaker:

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.

W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Cloud Distributed Computing Platform 2 Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)

Contents HADOOP INTRODUCTION AND CONCEPTUAL OVERVIEW TERMINOLOGY QUICK TOUR OF CLOUDERA MANAGER.

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

Performance Evaluation of Image Conversion Module Based on MapReduce for Transcoding and Transmoding in SMCCSE Speaker : 吳靖緯 MA0G IEEE.

Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.

Alastair Duncan STFC Pre Coffee talk STFC July 2014 The Trials and Tribulations and ultimate success of parallelisation using Hadoop within the SCAPE project.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Introduction to HDFS Prasanth Kothuri, CERN 2 What’s HDFS HDFS is a distributed file system that is fault tolerant, scalable and extremely easy to expand.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Big traffic data processing framework for intelligent monitoring and recording systems 學生 : 賴弘偉教授 : 許毅然作者 : Yingjie Xia a, JinlongChen a,b,n, XindaiLu.

CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Grid Technology CERN IT Department CH-1211 Geneva 23 Switzerland t DBCF GT Our experience with NoSQL and MapReduce technologies Fabio Souto.

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING CLOUD COMPUTING

Hadoop Aakash Kag What Why How 1.

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Extraction, aggregation and classification at Web Scale

Ministry of Higher Education

CS110: Discussion about Spark

Map Reduce, Types, Formats and Features

Presentation transcript:

Performance Considerations of Data Acquisition in Hadoop System Baodong Jia, Tomasz WiktorWlodarczyk, Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger Namrata Patil

Contents….. Factors Influencing Performance Comparison Introduction Sub projects of Hadoop Two solutions for data acquisition Workflow of Chukwa system Primary components Setup for Performance Analysis Factors Influencing Performance Comparison Conclusion Department of computer science & Engg

Introduction Oil and Gas Industry Drilling done from service companies http://bub.blicio.us/social-media-for-oil-and-gas/ Department of computer science & Engg

Continued… Advantages Drilling status Companies collect drilling data by placing sensors on drilling bits and platforms and make it available on their servers. Advantages Drilling status Operators can get useful information on the historical data Problems Vast amounts of data are accumulated Infeasible or very time consuming to perform reasoning over it Solution Investigate application of MapReduce system Hadoop http://bub.blicio.us/social-media-for-oil-and-gas/ Department of computer science & Engg

Sub projects of hadoop 1. Hadoop Common 2. Chukwa 3. Hbase 4. HDFS HDFS - Distributed File System stores application data in a replicated way high throughput Chukwa - An open source data collection system designed for monitoring large distributed system. Department of computer science & Engg http://hadoop.apache.org/

Two solutions for data acquisition.. Solution 1 Acquiring data from data sources, and then copying the data file to HDFS Solution 2 Chukwa based Solution Department of computer science & Engg

Solution 1 Hadoop runs MapReduce jobs on the cluster Stores the results on HDFS Steps Prepare the required data set for the job Copy it to HDFS Submit the job to hadoop Store the result in a directory specified by user on HDFS. Get the result out of HDFS Department of computer science & Engg

Pros & Cons… Pros… Works efficiently for small number of files with large file size Cons… Takes a lot of extra time for large number of files with small file size Does not support appending file content Department of computer science & Engg

Solution 2 Overcome the problem of extra time generated by copying large file to HDFS Exists on top of Hadoop Chukwa feeds the organized data into cluster Uses temporary file to store the data collected from different agents. http://incubator.apache.org/chukwa/ Department of computer science & Engg

Department of computer science & Engg Chukwa Open source data collection system built on top of Hadoop. Inherits scalability and robustness Provides flexible and powerful toolkit to display, monitor, and analyze results http://incubator.apache.org/chukwa/ Department of computer science & Engg

Workflow of Chukwa system Department of computer science & Engg

Primary components….. Agents - run on each machine and emit data. Collectors - receive data from the agent and write it to stable storage. MapReduce jobs - parsing and archiving the data. HICC - Hadoop Infrastructure Care Center a web-portal style interface for displaying data. http://incubator.apache.org/chukwa/docs/r0.4.0/design.html Department of computer science & Engg

Continued… Agents Collecting data through their adaptors. Adaptors - small dynamically-controllable modules that run inside the agent process Several adaptors Agents run on every node of hadoop cluster Data from different hosts may generate different data. http://incubator.apache.org/chukwa/docs/r0.4.0/design.html Department of computer science & Engg

Collectors Gather the data through HTTP Receives data from up to several hundred agents Writes all this data to a single hadoop sequence file called sink file close their sink files, rename them to mark them available for processing, and resume writing a new file. Advantages Reduce the number of HDFS files generated by Chukwa Hide the details of the HDFS file system in use, such as its Hadoop version, from the adaptors http://incubator.apache.org/chukwa/docs/r0.4.0/design.html Department of computer science & Engg

MapReduce processing Aim organizing and processing incoming data MapReduce jobs Archiving - take chunks from their input, and output new sequence files of chunks, ordered and grouped Demux - take chunks as input and parse them to produce ChukwaRecords ( key – value pair) http://incubator.apache.org/chukwa/docs/r0.4.0/design.html Department of computer science & Engg

HICC - Hadoop Infrastructure Care Center Web interface for displaying data Fetches the data from MySQL database Easier to monitor data http://incubator.apache.org/chukwa/docs/r0.4.0/design.html Department of computer science & Engg

Setup for Performance Analysis Hadoop cluster that consists of 15 unix hosts that existed at the unix lab of UIS One tagged with name node and the others are used as data nodes. Data stored at data nodes in replicated way Department of computer science & Engg

Factors Influencing Performance Comparison Quality of the Data Acquired in Different Ways Time Used for Data Acquisition for Small Data Size Data Copying to HDFS for Big Data Size. Department of computer science & Engg

Quality of the Data Acquired in Different Ways Sink file size = 1Gb Chukwa agent check the file content every 2 seconds The Size of Data Acquired by Time Department of computer science & Engg

Time Used for Data Acquisition for Small Data Size Time used to acquire data from servers Put acquired data into HDFS Actual Time Used for Acquisition in a Certain Time Department of computer science & Engg

Data Copying to HDFS for Big Data Size. Slope of line is bigger when replica number is bigger Time Used to Copy Data Set to HDFS With Diferent Replica Number Department of computer science & Engg

Critical Value of Generating Time Differences Corresponding size of data file for generating time difference of data acquisition Size of Data set Time Used 20M 2s 30M 3s 40M 50M 8s Time used for copying according to the size of data set with replica number of 2 Department of computer science & Engg

Continued… Size of Data set Time Used 10M 2s 15M 20M 8s 30M 10s 40M Time used for copying according to the size of data set with replica number of 3 Department of computer science & Engg

Conclusion….. Chukwa was demonstrated to work more efficiently for big data size, while for small data size there was no difference between the solutions Department of computer science & Engg

Thanks.. Department of computer science & Engg