Big Data Processing in Cloud Computing Environments James Vasky February 20, 2013
Article Title: Big Data Processing in Cloud Computing Environments Authors: Changqing Ji, Yu Li, Wenming Qiu, Uchechukwu Awada, Keqiu Li Conference: 2012 International Symposium on Pervasive Systems, Algorithms and Networks Date: December 2012
Introduction There has been an overwhelming increase of data flow over the past two decades Dzero, a physics experiment on matter, generates more than one TB of data/day Facebook = 570 Billion page views/month Stores 3 Billion new photos/month Manages 25 billion pieces of content
Managing Large Data Sets Traditional DBMSs may not be suitable for managing large data sets. Scalability and cost are important, and 0ne 0f the focuses in Big Data To manage data on such a large scale, we rely on Big Data Management systems. Discusses Big Data Architecture from three aspects. Distributed file system Non-structured and Semi-structured Data Storage Open Source Cloud Platform
Distributed File System Google File System (GFS) chunk-based distributed file system used by Google mainly for their search engine Hadoop File System Essentially an open source counterpart to GFS. Both are user-level file systems optimized for files measured in GBs Amazon Simple Storage Service
Non-Structural & Semi-Structural Used for non -relational data e.g. search logs, crawled web content, click streams. Bigtable is Google's distributed storage system for managing structured data in the petabytes, and does not support a full relational data model PNUTS designed to support Yahoo!'s web apps. Dynamo is key/value based data store for Amazon's internal applications.
Open Source Cloud Platform Amazon Web Services (AWS), Eucalyptus, Opennebula, Cloudstack, and Openstack are most popular cloud management platforms for infrastructure as a service. AWS not free but easy to use, and pay as you go. Eucalyptus,open source, is earliest cloud platform for IaaS, and signs API compatible agreement with AWS. openNebula can offer richest features, flexible ways, and better interoperability to build private, public or hybrid clouds CloudStack is an Apache open source project. It delivers public cloud computing but with users' own hardware. OpenStack is a collection of open source software project. This community shares a goal to create a cloud that is simple to deploy, massively scalable, and full of rich features. Good or specific apps for enterprises, still has some shortcomings like incomplete functions and lack of commercial support.
Applications Parallel processing models are necessary to process large amounts of data. Popular models are MPI, General Purpose GPU (GPGPU), MapReduce, and MapReduce-like. MapReduce, proposed by Google, is very popular and has been studied and applied by both industry and academia. It has two advantages: Hides details related to data storage, distribution, replication, etc. So simply that programmers only specify two functions for perorming the processing of big data: map and reduce. Map: master node takes input, divides it into smaller problem, sends it to worker nodes which may then do the same thing, and they work on the problems to send back output. Reduce: collects that and puts it back together as a solution to the original problem This method is capable of sorting a petabyte of data in a few hours.
Applications MapReduce is also criticized as a step backwards when compared with DBMS Index and scema free, so requires parsing each record at reading input. Conclusion is neither is good at what the other does so they are complementary. Some DBMS vendors have integrated a MapReduce front-end to a DBMS. HadoopDB takes the best features from the scalability of MapReduce and the performance of DBMS. Results = task processing times of Hadoop improved by a large factor MapReduce is also used for large statistical analysis. Ex: Ricardo = R + Hadoop
Optimizations Outline details of approaches to improve performance of MapReduce Data transfer bottlenecks Cloud users must consider how to minimize the cost of data transmission. Map-Reduce-Merge is a new model that adds a Merge phase after Reduce and combines two reduced outputs from two different jobs. Map-Join-Reduce improves MapReduce runtime by adding Join stage before Reduce to perform complex data analysis tasks on large clusters. MRShare takes a bunch of queries and organizes them by job so machines are doing similar jobs.
Optimizations & Discussion and Challenges Online- MapReduce is bad at processing online. MapReduce Online: designed to support online aggregation and continuous queries in MapReduce. HOP allows users to get early returns from a job being computed. Join Query Optimization- Join Query is a problem in big data. MapReduced devised for processing a single input while a join problem needs three or more. 3 stage approach proposed that efficiently balances the workload and minimizes the need for replication.
Big Data Storage and Management Current technologies cannot satisfy the needs of big data. The increasing speed of storage capacity is much less than that of data. Reconstruction of information framework is desperately needed. We need to solve the bottleneck problems.
Big Data Computation and Analysis Speed is a problem because the process cannot traverse all the related data in a short time. Index is an optimal choice for this but indices in big data are aiming at simple types of data while big data is becoming more complicated. Combination of appropriate index for big data and up to date preprocessing technology is desirable. Traditional sequential algorithm is inefficient for big data. We need to be able to increase parallelism.
Big Data Security Massive use of third party services and infrastructures that are used to host important data make security and privacy an important issue. The scale of data and applications grow exponentially, and pose challenges of dynamic data monitoring and security protection. How to process data mining without exposing sensitive information of users? Current technologies or privacy protection are mainly based on static data set, while data is always dynamically changed. Thus it is a challenge to implement effective protection in this complex circumstance. Legal and regulatory issues also need attention.
Conclusion Key issues including cloud storage and computing architecture, popular parallel processing frameworks, major applications and optimization of MapReduce discussed. Big Data calls for scalable storage index and a distributed approach to retrieve required results near real-time. Big data is complex and will only become even more complex. Significant challenges are posed to industry and academia, and require urgent attention.