Big Data Processing in Cloud Computing Environments

Slides:

Advertisements

Similar presentations

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

INTRODUCTION TO CLOUD COMPUTING CS 595 LECTURE 6 2/13/2015.

Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.

PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.

Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Ch 4. The Evolution of Analytic Scalability

A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

SecureMR: A Service Integrity Assurance Framework for MapReduce Author: Wei Wei, Juan Du, Ting Yu, Xiaohui Gu Source: Annual Computer Security Applications.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Distributed Indexing of Web Scale Datasets for the Cloud {ikons, eangelou, Computing Systems Laboratory School of Electrical.

Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

An Architecture for Distributed High Performance Video Processing in the Cloud 作者 :Pereira, R.; Azambuja, M.; Breitman, K.; Endler, M. 出處 :2010 IEEE 3rd.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Optimizing Cloud MapReduce for Processing Stream Data using Pipelining 2011 UKSim 5th European Symposium on Computer Modeling and Simulation Speker : Hong-Ji.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

BIG DATA/ Hadoop Interview Questions.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Leverage Big Data With Hadoop Analytics Presentation by Ravi Namboori Visit

Unit 3 Virtualization.

CS 405G: Introduction to Database Systems

Organizations Are Embracing New Opportunities

Big Data is a Big Deal!.

Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.

An Open Source Project Commonly Used for Processing Big Data Sets

How to download, configure and run a mapReduce program In a cloudera VM Presented By: Mehakdeep Singh Amrit Singh Chaggar Ranjodh Singh.

Large-scale file systems and Map-Reduce

MongoDB Er. Shiva K. Shrestha ME Computer, NCIT

Applying Control Theory to Stream Processing Systems

Free Cloud Management Portal for Microsoft Azure Empowers Enterprise Users to Govern Their Cloud Spending and Optimize Cloud Usage and Planning MICROSOFT.

Hybrid Cloud Architecture for Software-as-a-Service Provider to Achieve Higher Privacy and Decrease Securiity Concerns about Cloud Computing P. Reinhold.

Introduction to MapReduce and Hadoop

Algorithms for Big Data Delivery over the Internet of Things

The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.

Hadoop Clusters Tess Fulkerson.

Extraction, aggregation and classification at Web Scale

Central Florida Business Intelligence User Group

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

Interlake Hybrid Cloud Management Suite

Scalable SoftNAS Cloud Protects Customers’ Mission-Critical Data in the Cloud with a Highly Available, Flexible Solution for Microsoft Azure MICROSOFT.

Logsign All-In-One Security Information and Event Management (SIEM) Solution Built on Azure Improves Security & Business Continuity MICROSOFT AZURE APP.

AllDigital Brevity on Microsoft Azure Cloud Platform Supercharges Media Workloads by Encoding During High-Speed File Transmission MICROSOFT AZURE ISV PROFILE:

CS110: Discussion about Spark

Ch 4. The Evolution of Analytic Scalability

Overview of big data tools

Introduction to MapReduce

MAPREDUCE TYPES, FORMATS AND FEATURES

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

Big Data Processing in Cloud Computing Environments James Vasky February 20, 2013

Article Title: Big Data Processing in Cloud Computing Environments Authors: Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li Conference: 2012 International Symposium on Pervasive Systems, Algorithms and Networks Date: December 2012

Introduction There has been an overwhelming increase of data flow over the past two decades Dzero, a physics experiment on matter, generates more than one TB of data/day Facebook = 570 Billion page views/month Stores 3 Billion new photos/month Manages 25 billion pieces of content

Managing Large Data Sets Traditional DBMSs may not be suitable for managing large data sets. Scalability and cost are important, and 0ne 0f the focuses in Big Data To manage data on such a large scale, we rely on Big Data Management systems. Discusses Big Data Architecture from three aspects. Distributed file system Non-structured and Semi-structured Data Storage Open Source Cloud Platform

Distributed File System Google File System (GFS) chunk-based distributed ﬁle system used by Google mainly for their search engine Hadoop File System Essentially an open source counterpart to GFS. Both are user-level file systems optimized for files measured in GBs Amazon Simple Storage Service

Non-Structural & Semi-Structural Used for non -relational data e.g. search logs, crawled web content, click streams. Bigtable is Google's distributed storage system for managing structured data in the petabytes, and does not support a full relational data model PNUTS designed to support Yahoo!'s web apps. Dynamo is key/value based data store for Amazon's internal applications.

Open Source Cloud Platform Amazon Web Services (AWS), Eucalyptus, Opennebula, Cloudstack, and Openstack are most popular cloud management platforms for infrastructure as a service. AWS not free but easy to use, and pay as you go. Eucalyptus,open source, is earliest cloud platform for IaaS, and signs API compatible agreement with AWS. openNebula can offer richest features, flexible ways, and better interoperability to build private, public or hybrid clouds CloudStack is an Apache open source project. It delivers public cloud computing but with users' own hardware. OpenStack is a collection of open source software project. This community shares a goal to create a cloud that is simple to deploy, massively scalable, and full of rich features. Good or specific apps for enterprises, still has some shortcomings like incomplete functions and lack of commercial support.

Applications Parallel processing models are necessary to process large amounts of data. Popular models are MPI, General Purpose GPU (GPGPU), MapReduce, and MapReduce-like. MapReduce, proposed by Google, is very popular and has been studied and applied by both industry and academia. It has two advantages: Hides details related to data storage, distribution, replication, etc. So simply that programmers only specify two functions for perorming the processing of big data: map and reduce. Map: master node takes input, divides it into smaller problem, sends it to worker nodes which may then do the same thing, and they work on the problems to send back output. Reduce: collects that and puts it back together as a solution to the original problem This method is capable of sorting a petabyte of data in a few hours.

Applications MapReduce is also criticized as a step backwards when compared with DBMS Index and scema free, so requires parsing each record at reading input. Conclusion is neither is good at what the other does so they are complementary. Some DBMS vendors have integrated a MapReduce front-end to a DBMS. HadoopDB takes the best features from the scalability of MapReduce and the performance of DBMS. Results = task processing times of Hadoop improved by a large factor MapReduce is also used for large statistical analysis. Ex: Ricardo = R + Hadoop

Optimizations Outline details of approaches to improve performance of MapReduce Data transfer bottlenecks Cloud users must consider how to minimize the cost of data transmission. Map-Reduce-Merge is a new model that adds a Merge phase after Reduce and combines two reduced outputs from two different jobs. Map-Join-Reduce improves MapReduce runtime by adding Join stage before Reduce to perform complex data analysis tasks on large clusters. MRShare takes a bunch of queries and organizes them by job so machines are doing similar jobs.

Optimizations & Discussion and Challenges Online- MapReduce is bad at processing online. MapReduce Online: designed to support online aggregation and continuous queries in MapReduce. HOP allows users to get early returns from a job being computed. Join Query Optimization- Join Query is a problem in big data. MapReduced devised for processing a single input while a join problem needs three or more. 3 stage approach proposed that efficiently balances the workload and minimizes the need for replication.

Big Data Storage and Management Current technologies cannot satisfy the needs of big data. The increasing speed of storage capacity is much less than that of data. Reconstruction of information framework is desperately needed. We need to solve the bottleneck problems.

Big Data Computation and Analysis Speed is a problem because the process cannot traverse all the related data in a short time. Index is an optimal choice for this but indices in big data are aiming at simple types of data while big data is becoming more complicated. Combination of appropriate index for big data and up to date preprocessing technology is desirable. Traditional sequential algorithm is inefficient for big data. We need to be able to increase parallelism.

Big Data Security Massive use of third party services and infrastructures that are used to host important data make security and privacy an important issue. The scale of data and applications grow exponentially, and pose challenges of dynamic data monitoring and security protection. How to process data mining without exposing sensitive information of users? Current technologies or privacy protection are mainly based on static data set, while data is always dynamically changed. Thus it is a challenge to implement effective protection in this complex circumstance. Legal and regulatory issues also need attention.

Conclusion Key issues including cloud storage and computing architecture, popular parallel processing frameworks, major applications and optimization of MapReduce discussed. Big Data calls for scalable storage index and a distributed approach to retrieve required results near real-time. Big data is complex and will only become even more complex. Significant challenges are posed to industry and academia, and require urgent attention.