Cloud Computing.

Slides:



Advertisements
Similar presentations
 Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware  Created by Doug Cutting and.
Advertisements

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Distributed Computations
7/2/2015EECS 584, Fall Bigtable: A Distributed Storage System for Structured Data Jing Zhang Reference: Handling Large Datasets at Google: Current.
Distributed Computations MapReduce
AN INTRODUCTION TO CLOUD COMPUTING Web, as a Platform…
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Ch 4. The Evolution of Analytic Scalability
Cloud Computing. Evolution of Computing with Network (1/2) Network Computing  Network is computer (client - server)  Separation of Functionalities Cluster.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
Meet with the AppEngine Márk Gergely eu.edge. What is AppEngine? It’s a tool, that lets you run your web applications on Google's infrastructure. –Google's.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction to Hadoop Owen O’Malley Yahoo!, Grid Team
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Cloud Computing from a Developer’s Perspective Shlomo Swidler CTO & Founder mydrifts.com 25 January 2009.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Google Cloud computing techniques (Lecture 03) 18th Jan 20161Dr.S.Sridhar, Director, RVCT, RVCE, Bangalore
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Platform as a Service (PaaS)
Map reduce Cs 595 Lecture 11.
Platform as a Service (PaaS)
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Software Systems Development
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Large-scale file systems and Map-Reduce
An Introduction to Cloud Computing
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
The Improvement of PaaS Platform ZENG Shu-Qing, Xu Jie-Bin 2010 First International Conference on Networking and Distributed Computing SQUARE.
Google and Cloud Computing
Ministry of Higher Education
湖南大学-信息科学与工程学院-计算机与科学系
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Ch 4. The Evolution of Analytic Scalability
Introduction to Apache
AWS Cloud Computing Masaki.
Internet and Web Simple client-server model
Lecture 16 (Intro to MapReduce and Hadoop)
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Cloud Computing

Evolution of Computing with Network (1/2) Network Computing Network is computer (client - server) Separation of Functionalities Cluster Computing Tightly coupled computing resources: CPU, storage, data, etc. Usually connected within a LAN Managed as a single resource Commodity, Open source

Evolution of Computing with Network (2/2) Grid Computing Resource sharing across several domains Decentralized, open standards Global resource sharing Utility Computing Don’t buy computers, lease computing power Upload, run, download Ownership model

The Next Step: Cloud Computing Service and data are in the cloud, accessible with any device connected to the cloud with a browser A key technical issue for developer: Scalability Services are not known geographically

Applications on the Web

Applications on the Web

Cloud Computing Definition Cloud computing is a concept of using the internet to allow people to access technology-enabled services. It allows users to consume services without knowledge of control over the technology infrastructure that supports them. - Wikipedia

Major Types of Cloud Compute and Data Cloud Host Cloud Amazon Elastic Computing Cloud (EC2), Google MapReduce, Science clouds Provide platform for running science code Host Cloud Google AppEngine Highly-available, fault tolerance, robustness for web capability Services are not known geographically

Cloud Computing Example - Amazon EC2 http://aws.amazon.com/ec2

Cloud Computing Example - Google AppEngine Google AppEngine API Python runtime environment Datastore API Images API Mail API Memcache API URL Fetch API Users API A free account can use up to 500 MB storage, enough CPU and bandwidth for about 5 million page views a month http://code.google.com/appengine/

Cloud Computing Advantages Separation of infrastructure maintenance duties from application development Separation of application code from physical resources Ability to use external assets to handle peak loads Ability to scale to meet user demands quickly Sharing capability among a large pool of users, improving overall utilization Services are not known geographically

Cloud Computing Summary Cloud computing is a kind of network service and is a trend for future computing Scalability matters in cloud computing technology Users focus on application development Services are not known geographically

Counting the numbers vs. Programming model Personal Computer One to One Client/Server One to Many Cloud Computing Many to Many

What Powers Cloud Computing in Google? Commodity Hardware Performance: single machine not interesting Reliability Most reliable hardware will still fail: fault-tolerant software needed Fault-tolerant software enables use of commodity components Standardization: use standardized machines to run all kinds of applications

What Powers Cloud Computing in Google? Infrastructure Software Distributed storage: Distributed File System (GFS) Distributed semi-structured data system BigTable Distributed data processing system MapReduce What is the common issues of all these software?

Google File System Files broken into chunks (typically 4 MB) Chunks replicated across three machines for safety (tunable) Data transfers happen directly between clients and chunkservers

GFS Usage @ Google 200+ clusters Filesystem clusters of up to 5000+ machines Pools of 10000+ clients 5+ Petabyte Filesystems All in the presence of frequent HW failure

BigTable Data model (row, column, timestamp)  cell contents

BigTable Distributed multi-level sparse map Scalable Self-managing Fault-tolerance, persistent Scalable Thousand of servers Terabytes of in-memory data Petabytes of disk-based data Self-managing Servers can be added/removed dynamically Servers adjust to load imbalance

Why not just use commercial DB? Scale is too large or cost is too high for most commercial databases Low-level storage optimizations help performance significantly Much harder to do when running on top of a database layer Also fun and challenging to build large-scale systems

BigTable Summary Data model applicable to broad range of clients Actively deployed in many of Google’s services System provides high-performance storage system on a large scale Self-managing Thousands of servers Millions of ops/second Multiple GB/s reading/writing Currently – 500+ BigTable cells Largest bigtable cell manages – 3PB of data spread over several thousand machines

Distributed Data Processing Problem: How to count words in the text files? Input files: N text files Size: multiple physical disks Processing phase 1: launch M processes Input: N/M text files Output: partial results of each word’s count Processing phase 2: merge M output files of step 1

Pseudo Code of WordCount

Task Management Logistics Execution: Decide which computers to run phase 1, make sure the files are accessible (NFS-like or copy) Similar for phase 2 Execution: Launch the phase 1 programs with appropriate command line flags, re-launch failed tasks until phase 1 is done Automation: build task scripts on top of existing batch system

Technical issues File management: where to store files? Store all files on the same file server  Bottleneck Distributed file system: opportunity to run locally Granularity: how to decide N and M? Job allocation: assign which task to which node? Prefer local job: knowledge of file system Fault-recovery: what if a node crashes? Redundancy of data Crash-detection and job re-allocation necessary

MapReduce A simple programming model that applies to many data-intensive computing problems Hide messy details in MapReduce runtime library Automatic parallelization Load balancing Network and disk transfer optimization Handle of machine failures Robustness Easy to use

MapReduce Programming Model Borrowed from functional programming map(f, [x1,…,xm,…]) = [f(x1),…,f(xm),…] reduce(f, x1, [x2, x3,…]) = reduce(f, f(x1, x2), [x3,…]) = … (continue until the list is exhausted) Users implement two functions map (in_key, in_value)  (key, value) list reduce (key, [value1,…,valuem])  f_value

MapReduce – A New Model and System Two phases of data processing Map: (in_key, in_value)  {(keyj, valuej) | j = 1…k} Reduce: (key, [value1,…valuem])  (key, f_value)

MapReduce Version of Pseudo Code No File I/O Only data processing logic

Example – WordCount (1/2) Input is files with one document per record Specify a map function that takes a key/value pair key = document URL Value = document contents Output of map function is key/value pairs. In our case, output (w,”1”) once per word in the document

Example – WordCount (2/2) MapReduce library gathers together all pairs with the same key(shuffle/sort) The reduce function combines the values for a key. In our case, compute the sum Output of reduce paired with key and saved

MapReduce Framework For certain classes of problems, the MapReduce framework provides: Automatic & efficient parallelization/distribution I/O scheduling: Run mapper close to input data Fault-tolerance: restart failed mapper or reducer tasks on the same or different nodes Robustness: tolerate even massive failures: e.g. large-scale network maintenance: once lost 1800 out of 2000 machines Status/monitoring

Task Granularity And Pipelining Fine granularity tasks: many more map tasks than machines Minimizes time for fault recovery Can pipeline shuffling with map execution Better dynamic load balancing Often use 200,000 map/5000 reduce tasks with 2000 machines

MapReduce: Uses at Google Typical configuration: 200,000 mappers, 500 reducers on 2,000 nodes Broad applicability has been a pleasant surprise Quality experiences, log analysis, machine translation, ad-hoc data processing Production indexing system: rewritten with MapReduce ~10 MapReductions, much simpler than old code

MapReduce Summary MapReduce is proven to be useful abstraction Greatly simplifies large-scale computation at Google Fun to use: focus on problem, let library deal with messy details

A Data Playground MapReduce + BigTable + GFS = Data playground Substantial fraction of internet available for processing Easy-to-use teraflops/petabytes, quick turn-around Cool problems, great colleagues

Open Source Cloud Software: Project Hadoop Google published papers on GFS(‘03), MapReduce(‘04) and BigTable(‘06) Project Hadoop An open source project with the Apache Software Fountation Implement Google’s Cloud technologies in Java HDFS(GFS) and Hadoop MapReduce are available. Hbase(BigTable) is being developed Google is not directly involved in the development avoid conflict of interest

Industrial Interest in Hadoop Yahoo! hired core Hadoop developers Announced that their Webmap is produced on a Hadoop cluster with 2000 hosts(dual/quad cores) on Feb. 19, 2008. Amazon EC2 (Elastic Compute Cloud) supports Hadoop Write your mapper and reducer, upload your data and program, run and pay by resource utilization Tiff-to-PDF conversion of 11 million scanned New York Times articles (1851-1922) done in 24 hours on Amazon S3/EC2 with Hadoop on 100 EC2 machines Many silicon valley startups are using EC2 and starting to use Hadoop for their coolest ideas on internet-scale of data IBM announced “Blue Cloud,” will include Hadoop among other software components

AppEngine Run your application on Google infrastructure and data centers Focus on your application, forget about machines, operating systems, web server software, database setup/maintenance, load balance, etc. Operand for public sign-up on 2008/5/28 Python API to Datastore and Users Free to start, pay as you expand http://code.google.com/appengine/

Summary Cloud computing is about scalable web applications and data processing needed to make apps interesting Lots of commodity PCs: good for scalability and cost Build web applications to be scalable from the start AppEngine allows developers to use Google’s scalable infrastructure and data centers Hadoop enables scalable data processing