EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES
INTRODUCTION MOTIVATION IMPLEMENTATION Core Logic (Map-Reduce Framework) Job Scheduling Load Balancing HADOOP & GOOGLE APP ENGINE CHALLENGES & ISSUES PERFORMANCE ANALYSIS & RESULTS QUESTIONS
GOOGLE APP ENGINE ? Paas (Platform as a Service) A platform for hosting Web Applications Virtualizes applications across multiple servers and Google – managed data centers
Project Description Distribute computation across multiple servers and share the load across them Use multiple accounts on App Engine Task Tracker runs on each account Job Tracker runs on a stand-alone machine
WHY GOOGLE APP ENGINE ? WRITE THE CORE LOGIC OF APP & DEPLOY IT NO NEED TO WORRY ABOUT DATA CENTERS AUTOMATIC SCALING FREE UPTO CERTAIN LIMIT PAY AS WE GO FURTHER
WHAT WE DID ? BUILT APPLICATIONS(INVERTED INDEX, WORDCOUNT, MOVIE RATINGS) BUILT MAP – REDUCE FUNCTIONS FOR THESE APPLICATIONS DEPLOYED THESE MAP/REDUCE FUNCTIONS ON TASK TRACKERS A JOB TRACKER, ACTING AS A MASTER, DISTRIBUTES DATA THROUGH URLFETCH
PROVIDED A UI TO ENABLE THE USER TO UPLOAD INPUT DATA ON GOOGLE’S PERSISTENT STORAGE - BIGTABLE LIBRARIES USED TO CONNECT TO THE PERSISTENT STORAGE : JDO/JPA USER CAN CHOOSE THE APPLICATION TO BE RUN
JOB IS SUBMITTED TO JOB TRACKER JOB TRACKER MAINTAINS A QUEUE OF JOBS SCHEDULER PRIORITY SCHEDULER THE USER CAN SPECIFY THE PRIORITY FOR THE JOB. BASED ON IT, JOB WILL BE INSERTED INTO THE QUEUE USED WHEN THE USER SPECIFIES A PRIORITY
FIFO SCHEDULER THE SUBMITTED JOB IS INSERTED AT THE BACK OF THE QUEUE A JOB IS PICKED FROM THE FRONT THUS RUNNING IN A FIFO FASHION DEFAULT SCHEDULER
RESOURCEDAILY LIMIT(FREE) MAX RATE (FREE) DAILY LIMIT(BILLED) MAX RATE(BILLED REQUESTS13,00,000 REQUESTS 7,400 REQUESTS/MIN 4,30,00,000 REQUESTS 30,000 REQUESTS/MIN OUTGOING BANDWIDTH 1 GB56 MB/MIN1 GB FREE ; 1046 GB MAX 740 MB/MIN INCOMING BANDWIDTH 1 GB56 MB/MIN1 GB FREE ; 1046 GB MAX 740 MB/MIN CPU TIME6.5 CPU HOURS15 CPU- MIN/MIN 6.5 CPU HOURS FREE; 1729 MAX 72 CPU- MIN/MIN
WHY ? EVERY ACCOUNT HAS A FIXED QUOTA DISTRIBUTION OF DATA ACROSS MULTIPLE TASK TRACKERS TO PERTAIN TO THE QUOTA COST MODEL FOR LOAD BALANCING COST IS PROPORTIONAL TO THE AMOUNT OF DATA PROCESSED BY A TASK TRACKER DATA DIVIDED INTO EQUAL SIZED CHUNKS AND SENT TO THE TASK TRACKER’S MAP FUNCTION
HANDLING HUGE DATA SETS DATA DIVIDED INTO CHUNKS WHAT IF CHUNK SIZE IS HUGE ?? AT LEAST, ONE OF THE TASK TRACKER WILL FAIL, NO MATTER WHICH LOAD BALANCING ALGORITHM IS USED SOLUTION : DYNAMICALLY INCREASE THE NO. OF TASK TRACKERS IF ONE OF THEM FAILS AFTER A FIXED NO OF TRIALS.
LIMITED CONTROL ON GOOGLE APP ENGINE NO SPAWNING OF THREADS INABILITIY TO WRITE ON THE FILESYSTEM OF GOOGLE’S SERVER NO CONTROL ON DATA LOCALITY MACHINE ON WHICH DATA IS STORED, IS DYNAMICALLY ALLOCATED BY GOOGLE IN HADOOP, THREADS AND FILE IO CAN BE DONE IMPLEMENTING HADOOP USING GOOGLE APP ENGINE IS DIFFICULT
DATA RETRIEVAL IS NOT IN THE SAME ORDER AS DATA STORAGE BECAUSE OF GOOGLE’S STORAGE ARCHITECTURE NO CONTROL ON USAGE OF NETWORK BANDWIDTH BETWEEN THE JOB TRACKER AND TASK TRACKERS EXPENSIVE JOIN,UNION OPERATIONS WHEN NUMBER OF TABLES INVOLVED ARE HUGE.
RESULT SAME AS THAT WHEN RUNNING THE APPLICATION ON HADOOP. TESTED WORDCOUNT APPLICATION ON A DATA SET CONSISTING OF WORDS USING 3 TASK TRACKERS NETWORK BANDWIDTH IS A BOTTLENECK IN THE RUNTIME OF APPLICATION AS DATA HAS TO TRASNSFERRED FROM TASK TRACKERS TO JOB TRACKER AND VICE-VERSA.