A Framework for Flexible Programming in Complex Grid Environments 04/24/08 Taura Lab. 2 nd Year 76426 Ken Hironaka.

1 A Framework for Flexible Programming in Complex Grid Environments 04/24/08 Taura Lab. 2 nd Year 76426 Ken Hironaka

2 New Context for Grid Computing Grid Computing: – Computation across multiple clusters over WAN Conventionally, high performance computing – Parallel programming experts Broadening of demands and needs – Natural Language Processing – Genetic Sequence Analysis – The users are extending to Non-parallel programming experts

3 More Applications for Grid Computing Just computing ⇒ computing is only 1 part “Cloud Computing” – Applications with Grid computing backend Handle intensive computation load-balancing – e.g.: Web-Applications backend Application Publicly accessible Simple Job Submitter is not enough

4 Problems with conventional Frameworks Conventional Grid Computing – Distributed Task computation frameworks – No interaction Answer to broader Application and Demands – Complex interaction Task File Fine Grain Interaction Need for flexible workflow coordination without loss of simplicity Programming support for Grid

5 Problems with Grid Computing Deployment Complexity on Grid Environments – Dynamically joining nodes – Node/Network failures – Network environment Prevalence of NAT/firewall Unreliable WAN connections leave join Fire Wall Faulty link Configuration? Communication (sockets)? Error Handling? Need for simple deployment on complex environments

6 Our Contribution A distributed object-oriented Programming Framework that alleviates the burden of Grid environments – Flexibility of programming without loss of simplicity – Simplicity of deployment Run on the Grid with minimal configuration – Real – Life Applications Deployed an application on over 900 cores across 9 clusters “trouble-shooting” search engine on the Grid – As example of Cloud Computing

7 Agenda Introduction Related Work Proposal Preliminary Experiments Conclusion and Future Work

8 Distributed Objects and RMI ProActive [Huet ‘04] Distributed Object Oriented – Objects on remote nodes Work Delegation – RMI (Remote Method Invocation) Parallel Computation via asynchronous RMI – Possible race-conditions Active Objects – 1 object = 1 thread – induces deadlocks foo.doJob(args) RMI compute foo -Need for synchronization ⇒ cluttered with locks/synchronization Coding becomes complex Async. RMI b.f() a.g() a b deadlock

9 Handling Joins and Failures JoJo [Nakada ‘04] – Master – Worker framework – Event driven coding – A handler is invoked for each event Task completion Node Joins Node Failures Join Failure Handler Join Handler - Synchronization issues - Event driven programming -For more complex problems, coding easily becomes unreadable

10 Resolving Connectivity on the Grid ProActive [Huet ‘04] – overlay Network for communication – Resorts to manual network configuration files Specify each connection Connection Configuration File NAT Firewall - configuration overhead becomes enormous on Grid scale Configure each link

11 Agenda Introduction Related Work Proposal Preliminary Experiments Conclusion and Future Work

12 Our Proposal A distributed object oriented framework for the Grid – Distributed Objects with Grid programming support deadlock-free synchronization Additional constructs to cope with join/failure of node – Automatic and Adaptive Overlay Construction for Grid Runtime - object oriented with support for race-condition/join/failure : flexibility and simplicity -deployment requires minimal configuration : simplicity

13 Object Synchronization Model parallel programming with minimal use of explicit locks Distributed objects with ownership – Its method can only be executed by 1 thread at a time : the owner thread – Eliminates data races Owner gives up ownership for blocking operations – Other threads may contest for ownership – Eliminates deadlocks for common cases Th object owner thread Th object new owner thread Give-up Owner ship block Th object unblock re-contest for ownership waiting threads

14 Adaptation to Dynamic Resources programming support for joining/leaving nodes Decentralized object lookup – Allow joining nodes to access other objects and join the computation Node Failure ⇒ RMI Failure – Failure returned as exception to method invocation – The user can catch the exception, and perform rollback procedures if necessary Exception! Objects in computation New object on joining node lookup Object on failed node

15 Automatic Overlay Construction (1) -Automatic/Transparent communication -Configuration ONLY for firewalled clusters -Adapts to dynamic joins/leaves Nodes create a TCP overlay network cooperatively – Each node picks a small number of nodes to connect – Created connected graph NAT Firewall Global IP Attempt connection established connections

16 Automatic Overlay Construction (2) NAT Clusters – NAT nodes can connect to global nodes Firewalled Clusters – Automatic SSH port- forwarding User specifies points Transparent Communication – Point-Point communication is routed over the network – Ad-hoc routing Protocol AODV [Perkins ‘97] – Adapts to node joins/leaves SSH Firewall traversal P-to-P communication

17 Failure Detection on Overlay How do we detect failures on the overlay? RMI Failure – Intermediate/end node failure ⇒ link failure Path Pointers – Forwarding nodes remember the nexthop – RMI reply is returned the same way For link failure along pointer, back-propagate the failure to the invoker Path pointer RMI handler failure Backpropagate

18 Agenda Introduction Related Work Proposal Preliminary Experiments Conclusion and Future Work

19 Experiment Cluster Settings 900 cores over 9 clusters hongo chiba okubo suzuk imade kototoi kyoto istbs tsubame Global IPs Firewall Private IPs All packets dropped

20 Overlay Construction Simulation – Evaluate the overlay construction scheme – For different cluster configurations, modified number of attempted connections per peer – 1000 trials per each cluster/attempted connection configuration Even for pathological case, 20 connections per peer is enough

21 Dynamic Master-Worker A master object distributes work to worker objects – 10,000 tasks all together – Task as RMIs Worker nodes join/leave at runtime – New task for new node – Reassignment for tasks on failed nodes – No tasks were lost during computation

22 Dynamic Master-Worker As the number of workers change, the number of assigned tasks change accordingly. The Master adaptively distributes, rolls back, and redistribute tasks.

23 A Real-Life Application Solving a combination optimization problem – Permutation Flow Shop Problem – Parallel Branch-and-Bound Master-Worker style Periodic updates – Work distribution Divide search space evenly as subtasks Load-balancing – Unfinished tasks are sub-divided and redistributed – Wasteful computation is quite possible

24 Master-Worker Coordination Master does RMI to Worker – Worker: periodic bound exchange with master – Not a straightforward Master-Worker application – Requires flexible framework like ours Master Worker doJob() exchange_bound()

25 Runtime Speedup Lacks scalability with over 900 cores

26 Cumulative Computation Time Growth in Cum. Comp. time is attributed to increased re-execution of task If the Cum. Comp. time is taken into account, the speed up from 169 cores to 948 cores (5.64 times) is 4.94

27 Troubleshoot Search Engine Ever stuck debugging, or troubleshooting? Re-rank google queries and give weight to pages for web-forums and solutions – Natural language processing and machine learning Parallel computation on Grid backend – Real time response backend Search Engine Query: “vmware kernel panic” Compute!!

28 Agenda Introduction Related Work Proposal Preliminary Experiments Conclusion and Future Work

29 Conclusion A distributed object oriented programming framework for Grid environments – A novel distributed object oriented programming model – Grid-enabled via automatic overlay construction Showed that real-life Grid application needs can be addressed by our framework – Deployed actual parallel applications on over 900 cores over 9 clusters with NAT/Firewalls, joins, and failures – Implemented a Grid computing backend for troubleshooting search engine

30 Future Work Reliable WAN communication for the Grid overlays – Node failure – Connection failure Weakness of WAN connections – Router Policies close connections after given period – Obscure kernel bugs with NAT Connection resets Faulty link WAN links are more vulnerable, and failures will occur

31 Some Related Work Robust Tree Topologies for Sensor Networks [ England ‘06] – Create spanning tree for data reduction – Flat tree for high reliability Fewest Hops – Tree with short distance for low power consumption Shortest Path ⇒ Spanning Tree that merges the two metrics for the best of two worlds Fewest Hop: High Reliability High Power Usage Shortest Path: Low Reliability Low Power Usage

32 Possible Future Direction Our context: Grid computing – communication latency = metric for link reliability Fewest Hops – Reliability for node failure Shortest Distance – Reliability for link failure Short reliable links Long faulty links Can we construct an overlay connection topology that take the best of two worlds?

36 Problems with Grid Computing (2) Complexity of Programming on the Grid – Low Level Computing (sockets) Communication Multi-threaded Computing (Synchronization) Heavy Burden on Non-experts – Flexibility and Integration Grid Frameworks for task distribution Independent parallel programming languages Computing is not execution of many independent tasks – Need finer grained communication Bad interface with user application – Java, Ruby, Python, PHP

37 Related Work Discussed with respect to criteria necessary for modern Grid computing – Workflow Coordination Flexibility without putting the burden on the user – Joining Nodes / Failure of resources Handling these events should not dominate the programming overhead – Connectivity in Wide-Area Networks Adaptation to networks with NAT/firewall with little manual settings

38 Workflow Coordination (1) Condor / DAGMan [Thain ‘05] – “Tasks” are expressed as script files and distributed on idle nodes – Dependency between tasks can be expressed in DAG (Directed Acyclic Graph) Ibis / Satin [Wrzesinska ‘06] – framework for divide-and-conquer problems – Tasks can be broken into smaller sub-tasks, on which it depends DAG Dependency Relationship Central Manager Busy Nodes Assign Cluster Task - Many computation cannot be expressed as “Tasks” with dependencies - A task’s communication is limited to others to which it has dependencies

39 Object Synchronization Example class A: def __init__(self, x): self.x = x def f(self, b): self.x += 1 #blocking RMI b.g() self.x -= 1 return Atomic section ab b.g() Value x stays consistent In method f(), instance a invokes blocking method g() on object b only 1 thread at a time give-up ownership during RMI block

40 Adaptation to Dynamic Resources Signal delivery to objects – Unblocks any thread that is blocking in the object’s context Can be used to notify asynchronous events – A joining node Node Failure ⇒ RMI Failure – Failure returned as exception to method invocation – The user can catch the exception, and perform rollback procedures if necessary exception Th object block unblock signal

41 Preliminary Experiments Overlay Construction Simulation A Simple Master-Worker Application with dynamically joining/leaving nodes A Real-life Parallel Application on the Grid A Troubleshoot-Search Engine

