EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

Slides:



Advertisements
Similar presentations
MapReduce Simplified Data Processing on Large Clusters
Advertisements

Can’t We All Just Get Along? Sandy Ryza. Introductions Software engineer at Cloudera MapReduce, YARN, Resource management Hadoop committer.
Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
A Hadoop Overview. Outline Progress Report MapReduce Programming Hadoop Cluster Overview HBase Overview Q & A.
Resource Management with YARN: YARN Past, Present and Future
Google App Engine Cloud B. Ramamurthy 7/11/2014CSE651, B. Ramamurthy1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
EXTENDING SCIENTIFIC WORKFLOW SYSTEMS TO SUPPORT MAPREDUCE BASED APPLICATIONS IN THE CLOUD Shashank Gugnani Tamas Kiss.
Platform as a Service (PaaS)
Google AppEngine. Google App Engine enables you to build and host web apps on the same systems that power Google applications. App Engine offers fast.
Google App Engine Danail Alexiev Technical Trainer SoftAcad.bg.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Lecture 8 – Platform as a Service. Introduction We have discussed the SPI model of Cloud Computing – IaaS – PaaS – SaaS.
M i SMob i S Mob i Store - Mobile i nternet File Storage Platform Chetna Kaur.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
MapReduce M/R slides adapted from those of Jeff Dean’s.
Introduction. Readings r Coulouris, Dollimore and Kindberg Distributed Systems: Concepts and Design Edn. 3 m Note: All figures from this book.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Introduction to Google App Engine. 2 Google App Engine Does one thing well: running web apps Simple app configuration Scalable Secure.
Using Map-reduce to Support MPMD Peng
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
Windows Azure. Azure Application platform for the public cloud. Windows Azure is an operating system You can: – build a web application that runs.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Paperless Timesheet Management Project Anant Pednekar.
My project  Small-Medium Enterprises (SMEs)  faces goods distribution problems  needs necessary resources, money and technical expertise, to purchase.
Distributed Computing Systems CSCI 4780/6780. Scalability ConceptExample Centralized servicesA single server for all users Centralized dataA single on-line.
Aneka Cloud ApplicationPlatform. Introduction Aneka consists of a scalable cloud middleware that can be deployed on top of heterogeneous computing resources.
GOOGLE APP ENGINE By Muktadiur Rahman. Contents  Cloud Computing  What is App Engine  Why App Engine  Development with App Engine  Quote & Pricing.
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Using Map-reduce to Support MPMD Peng
Experiments in Utility Computing: Hadoop and Condor Sameer Paranjpye Y! Web Search.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
(re)-Architecting cloud applications on the windows Azure platform CLAEYS Kurt Technology Solution Professional Microsoft EMEA.
Next Generation of Apache Hadoop MapReduce Owen
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
BIG DATA/ Hadoop Interview Questions.
Cofax Scalability Document Version Scaling Cofax in General The scalability of Cofax is directly related to the system software, hardware and network.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Platform as a Service (PaaS)
Platform as a Service (PaaS)
Memory Management.
Platform as a Service (PaaS)
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Advanced Topics in Concurrency and Reactive Programming: Case Study – Google Cluster Majeed Kassis.
Chapter 10 Data Analytics for IoT
Large-scale file systems and Map-Reduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Platform as a Service.
Software Engineering Introduction to Apache Hadoop Map Reduce
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
Distributed Systems CS
Google App Engine Danail Alexiev
Introduction to Apache
Lecture 16 (Intro to MapReduce and Hadoop)
Distributed Systems CS
Presentation transcript:

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES

 INTRODUCTION  MOTIVATION  IMPLEMENTATION  Core Logic (Map-Reduce Framework)  Job Scheduling  Load Balancing  HADOOP & GOOGLE APP ENGINE  CHALLENGES & ISSUES  PERFORMANCE ANALYSIS & RESULTS  QUESTIONS

 GOOGLE APP ENGINE ?  Paas (Platform as a Service)  A platform for hosting Web Applications  Virtualizes applications across multiple servers and  Google – managed data centers

 Project Description  Distribute computation across multiple servers and share the load across them  Use multiple accounts on App Engine  Task Tracker runs on each account  Job Tracker runs on a stand-alone machine

 WHY GOOGLE APP ENGINE ?  WRITE THE CORE LOGIC OF APP & DEPLOY IT  NO NEED TO WORRY ABOUT DATA CENTERS  AUTOMATIC SCALING  FREE UPTO CERTAIN LIMIT  PAY AS WE GO FURTHER

 WHAT WE DID ?  BUILT APPLICATIONS(INVERTED INDEX, WORDCOUNT, MOVIE RATINGS)  BUILT MAP – REDUCE FUNCTIONS FOR THESE APPLICATIONS  DEPLOYED THESE MAP/REDUCE FUNCTIONS ON TASK TRACKERS  A JOB TRACKER, ACTING AS A MASTER, DISTRIBUTES DATA THROUGH URLFETCH

 PROVIDED A UI TO ENABLE THE USER TO UPLOAD INPUT DATA ON GOOGLE’S PERSISTENT STORAGE - BIGTABLE  LIBRARIES USED TO CONNECT TO THE PERSISTENT STORAGE : JDO/JPA  USER CAN CHOOSE THE APPLICATION TO BE RUN

 JOB IS SUBMITTED TO JOB TRACKER  JOB TRACKER MAINTAINS A QUEUE OF JOBS  SCHEDULER  PRIORITY SCHEDULER THE USER CAN SPECIFY THE PRIORITY FOR THE JOB. BASED ON IT, JOB WILL BE INSERTED INTO THE QUEUE USED WHEN THE USER SPECIFIES A PRIORITY

 FIFO SCHEDULER THE SUBMITTED JOB IS INSERTED AT THE BACK OF THE QUEUE A JOB IS PICKED FROM THE FRONT THUS RUNNING IN A FIFO FASHION DEFAULT SCHEDULER

RESOURCEDAILY LIMIT(FREE) MAX RATE (FREE) DAILY LIMIT(BILLED) MAX RATE(BILLED REQUESTS13,00,000 REQUESTS 7,400 REQUESTS/MIN 4,30,00,000 REQUESTS 30,000 REQUESTS/MIN OUTGOING BANDWIDTH 1 GB56 MB/MIN1 GB FREE ; 1046 GB MAX 740 MB/MIN INCOMING BANDWIDTH 1 GB56 MB/MIN1 GB FREE ; 1046 GB MAX 740 MB/MIN CPU TIME6.5 CPU HOURS15 CPU- MIN/MIN 6.5 CPU HOURS FREE; 1729 MAX 72 CPU- MIN/MIN

 WHY ?  EVERY ACCOUNT HAS A FIXED QUOTA  DISTRIBUTION OF DATA ACROSS MULTIPLE TASK TRACKERS TO PERTAIN TO THE QUOTA  COST MODEL FOR LOAD BALANCING COST IS PROPORTIONAL TO THE AMOUNT OF DATA PROCESSED BY A TASK TRACKER DATA DIVIDED INTO EQUAL SIZED CHUNKS AND SENT TO THE TASK TRACKER’S MAP FUNCTION

 HANDLING HUGE DATA SETS  DATA DIVIDED INTO CHUNKS  WHAT IF CHUNK SIZE IS HUGE ?? AT LEAST, ONE OF THE TASK TRACKER WILL FAIL, NO MATTER WHICH LOAD BALANCING ALGORITHM IS USED SOLUTION : DYNAMICALLY INCREASE THE NO. OF TASK TRACKERS IF ONE OF THEM FAILS AFTER A FIXED NO OF TRIALS.

 LIMITED CONTROL ON GOOGLE APP ENGINE  NO SPAWNING OF THREADS  INABILITIY TO WRITE ON THE FILESYSTEM OF GOOGLE’S SERVER  NO CONTROL ON DATA LOCALITY  MACHINE ON WHICH DATA IS STORED, IS DYNAMICALLY ALLOCATED BY GOOGLE  IN HADOOP, THREADS AND FILE IO CAN BE DONE  IMPLEMENTING HADOOP USING GOOGLE APP ENGINE IS DIFFICULT

 DATA RETRIEVAL IS NOT IN THE SAME ORDER AS DATA STORAGE BECAUSE OF GOOGLE’S STORAGE ARCHITECTURE  NO CONTROL ON USAGE OF NETWORK BANDWIDTH BETWEEN THE JOB TRACKER AND TASK TRACKERS  EXPENSIVE JOIN,UNION OPERATIONS WHEN NUMBER OF TABLES INVOLVED ARE HUGE.

 RESULT SAME AS THAT WHEN RUNNING THE APPLICATION ON HADOOP.  TESTED WORDCOUNT APPLICATION ON A DATA SET CONSISTING OF WORDS USING 3 TASK TRACKERS  NETWORK BANDWIDTH IS A BOTTLENECK IN THE RUNTIME OF APPLICATION AS DATA HAS TO TRASNSFERRED FROM TASK TRACKERS TO JOB TRACKER AND VICE-VERSA.