1 Indranil Gupta (Indy) Lecture 4 The Grid. Clouds. January 29, 2009 CS 525 Advanced Distributed Systems Spring 09
2 Two Questions We’ll Try to Answer What is the Grid? Basics, no hype. What is its relation to p2p?
3 Example: Rapid Atmospheric Modeling System, ColoState U Hurricane Georges, 17 days in Sept 1998 –“RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data” –Used 5 km spacing instead of the usual 10 km –Ran on 256+ processors Can one run such a program without access to a supercomputer?
4 Wisconsin MIT NCSA Distributed Computing Resources
5 An Application Coded by a Physicist Job 0 Job 2 Job 1 Job 3 Output files of Job 0 Input to Job 2 Output files of Job 2 Input to Job 3 Jobs 1 and 2 can be concurrent
6 An Application Coded by a Physicist Job 2 Output files of Job 0 Input to Job 2 Output files of Job 2 Input to Job 3 May take several hours/days 4 stages of a job Init Stage in Execute Stage out Publish Computation Intensive, so Massively Parallel Several GBs
7 Wisconsin MIT NCSA Job 0 Job 2Job 1 Job 3
8 Job 0 Job 2Job 1 Job 3 Wisconsin MIT Condor Protocol NCSA Globus Protocol
9 Job 0 Job 2 Job 1 Job 3 Wisconsin MIT NCSA Globus Protocol Internal structure of different sites invisible to Globus External Allocation & Scheduling Stage in & Stage out of Files
10 Job 0 Job 3 Wisconsin Condor Protocol Internal Allocation & Scheduling Monitoring Distribution and Publishing of Files
11 Tiered Architecture (OSI 7 layer-like) Resource discovery, replication, brokering High energy Physics apps Globus, Condor Workstations, LANs Opportunity for Crossover ideas from p2p systems
12 The Grid Today Some are 40Gbps links! (The TeraGrid links) “A parallel Internet”
13 Globus Alliance Alliance involves U. Illinois Chicago, Argonne National Laboratory, USC-ISI, U. Edinburgh, Swedish Center for Parallel Computers Activities : research, testbeds, software tools, applications Globus Toolkit (latest ver - GT3) “The Globus Toolkit includes software services and libraries for resource monitoring, discovery, and management, plus security and file management. Its latest version, GT3, is the first full-scale implementation of new Open Grid Services Architecture (OGSA).”
14 More Entire community, with multiple conferences, get- togethers (GGF), and projects Grid Projects: Grid Users: –Today: Core is the physics community (since the Grid originates from the GriPhyN project) –Tomorrow: biologists, large-scale computations (nug30 already)?
15 Some Things Grid Researchers Consider Important Single sign-on: collective job set should require once-only user authentication Mapping to local security mechanisms: some sites use Kerberos, others using Unix Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job 1 Community authorization: e.g., third-party authentication
16 Grid History – 1990’s CASA network: linked 4 labs in California and New Mexico –Paul Messina: Massively parallel and vector supercomputers for computational chemistry, climate modeling, etc. Blanca: linked sites in the Midwest –Charlie Catlett, NCSA: multimedia digital libraries and remote visualization More testbeds in Germany & Europe than in the US I-way experiment: linked 11 experimental networks –Tom DeFanti, U. Illinois at Chicago and Rick Stevens, ANL:, for a week in Nov 1995, a national high-speed network infrastructure. 60 application demonstrations, from distributed computing to virtual reality collaboration. I-Soft: secure sign-on, etc.
17 Trends: Technology Doubling Periods – storage: 12 mos, bandwidth: 9 mos, and (what law is this?) cpu speed: 18 mos Then and Now Bandwidth –1985: mostly 56Kbps links nationwide –2004: 155 Mbps links widespread Disk capacity –Today’s PCs have 100GBs, same as a 1990 supercomputer
18 Trends: Users Then and Now Biologists: –1990: were running small single-molecule simulations –2004: want to calculate structures of complex macromolecules, want to screen thousands of drug candidates Physicists –2006: CERN’s Large Hadron Collider produced 10^15 B/year Trends in Technology and User Requirements: Independent or Symbiotic?
19 Prophecies In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”. Plug your thin client into the computing Utiling and Play your favorite Intensive Compute & Communicate Application –[Will this be a reality with the Grid?]
20 “We must address scale & failure” “We need infrastructure” P2PGrid
21 Definitions Grid P2P “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998) “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002) “Applications that takes advantage of resources at the edges of the Internet” (2000) “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002)
22 Definitions Grid P2P “Infrastructure that provides dependable, consistent, pervasive, and inexpensive access to high-end computational capabilities” (1998) “A system that coordinates resources not subject to centralized control, using open, general-purpose protocols to deliver nontrivial QoS” (2002) “Applications that takes advantage of resources at the edges of the Internet” (2000) “Decentralized, self-organizing distributed systems, in which all or most communication is symmetric” (2002) 525: (good legal applications without intellectual fodder) 525: (clever designs without good, legal applications)
23 Grid versus P2P - Pick your favorite
24 Applications Grid Often complex & involving various combinations of –Data manipulation –Computation –Tele-instrumentation Wide range of computational models, e.g. –Embarrassingly || –Tightly coupled –Workflow Consequence –Complexity often inherent in the application itself P2P Some –File sharing –Number crunching –Content distribution –Measurements Legal Applications? Consequence –Low Complexity
25 Applications Grid Often complex & involving various combinations of –Data manipulation –Computation –Tele-instrumentation Wide range of computational models, e.g. –Embarrassingly || –Tightly coupled –Workflow Consequence –Complexity often inherent in the application itself P2P Some –File sharing –Number crunching –Content distribution –Measurements Legal Applications? Consequence –Low Complexity
26 Scale and Failure P2P V. large numbers of entities Moderate activity –E.g., 1-2 TB in Gnutella (’01) Diverse approaches to failure –Centralized (SETI) –Decentralized and Self-Stabilizing FastTrackC4,277,745 iMesh1,398,532 eDonkey500,289 DirectConnect111,454 Blubster100,266 FileNavigator14,400 Ares7,731 ( 2/19/’03) Grid Moderate number of entities –10s institutions, 1000s users Large amounts of activity –4.5 TB/day (D0 experiment) Approaches to failure reflect assumptions –E.g., centralized components
27 Scale and Failure Grid Moderate number of entities –10s institutions, 1000s users Large amounts of activity –4.5 TB/day (D0 experiment) Approaches to failure reflect assumptions –E.g., centralized components P2P V. large numbers of entities Moderate activity –E.g., 1-2 TB in Gnutella (’01) Diverse approaches to failure –Centralized (SETI) –Decentralized and Self-Stabilizing FastTrackC4,277,745 iMesh1,398,532 eDonkey500,289 DirectConnect111,454 Blubster100,266 FileNavigator14,400 Ares7,731 ( 2/19/’03)
28 Services and Infrastructure Grid Standard protocols (Global Grid Forum, etc.) De facto standard software (open source Globus Toolkit) Shared infrastructure (authentication, discovery, resource access, etc.) Consequences Reusable services Large developer & user communities Interoperability & code reuse P2P Each application defines & deploys completely independent “infrastructure” JXTA, BOINC, XtremWeb? Efforts started to define common APIs, albeit with limited scope to date Consequences New (albeit simple) install per application Interoperability & code reuse not achieved
29 Services and Infrastructure Grid Standard protocols (Global Grid Forum, etc.) De facto standard software (open source Globus Toolkit) Shared infrastructure (authentication, discovery, resource access, etc.) Consequences Reusable services Large developer & user communities Interoperability & code reuse P2P Each application defines & deploys completely independent “infrastructure” JXTA, BOINC, XtremWeb? Efforts started to define common APIs, albeit with limited scope to date Consequences New (albeit simple) install per application Interoperability & code reuse not achieved
30 Coolness Factor GridP2P
31 Coolness Factor GridP2P
32 Summary: Grid and P2P 1) Both are concerned with the same general problem –Resource sharing within virtual communities 2) Both take the same general approach –Creation of overlays that need not correspond in structure to underlying organizational structures 3) Each has made genuine technical advances, but in complementary directions –“Grid addresses infrastructure but not yet scale and failure” –“P2P addresses scale and failure but not yet infrastructure” 4) Complementary strengths and weaknesses => room for collaboration (Ian Foster at UChicago)
33 Crossover Ideas Some P2P ideas useful in the Grid –Resource discovery (DHTs), e.g., how do you make “filenames” more expressive, i.e., a computer cluster resource? –Replication models, for fault-tolerance, security, reliability –Membership, i.e., which workstations are currently available? –Churn-Resistance, i.e., users log in and out; problem difficult since free host gets a entire computations, not just small files All above are open research directions, waiting to be explored!
34 Cloud Computing What’s it all about? A First Step
35 Life of Ra (a Research Area) TIME POPULARITY OF AREA First peak – end of hype (“This is a hot area!”) Hype - “Wow!” First trough – “I told you so!” Young Adolescent Middle Age Old Age (low-hanging fruits) (interesting Problems) (solid base, hybrid algorithms) (incremental Solutions) Where is Grid? Where is cloud computing?
36 How do I identify what stage a research area is in? 1.If there have been no publications in research area more than 1-2 years old, it is in the “Young Phase” 2.Pick a paper in the last 1 year published in the research area. Read it. If you think that you could have come up with the core idea in that paper (given all the background etc.), then the research area is in its “Young” phase. 3.Find the latest published paper that you think you could have come up with the idea for. If this paper has been cited by one round of papers (but these citing papers themselves have not been cited), then the research area is in the “Adolescent” phase. 4.Do Step 3 above, and if you find that the citing papers themselves have been cited, and so on, then the research area is at least in the “Middle Age” phase. 5.Pick a paper in the last 1-2 years. If you find that there are only incremental developments in these latest published papers, and the ideas may be innovative but are not yielding large enough performance benefits, then the area is mature. 6.If no one works in the research area, or everyone you talk to thinks negatively about the area (except perhaps the inventors of the area), then the area is dead.
37 What is a cloud? It’s a cluster! It’s a supercomputer! It’s a datastore! It’s superman! None of the above Cloud = Lots of storage + compute cycles nearby
38 Data-intensive Computing Computation-Intensive Computing –Example areas: MPI-based, High-performance computing, Grids –Typically run on supercomputers (e.g., NCSA Blue Waters) Data-Intensive –Typically store data at datacenters –Use compute nodes nearby –Compute nodes run computation services In data-intensive computing, the focus shifts from computation to the data: problem areas include –Storage –Communication bottleneck –Moving tasks to data (rather than vice-versa) –Security –Availability of Data –Scalability
39 Distributed Clouds A single-site cloud consists of –Compute nodes (split into racks) –Switches, connecting the racks –Storage (backend) nodes connected to the network –Front-end for submitting jobs –Services: physical resource set, software services A geographically distributed cloud consists of –Multiple such sites –Each site perhaps with a different structure and services
40 Only show internal switches used for data transfers, 1GbE with 48 ports Internal Switch 32 nodes DL160 Procurve Switch Procurve Switch 8 ports Internal Switch 32 nodes DL160 Internal Switch 32 nodes DL160 Internal Switch 32 nodes DL160 Storage Node Storage Node Storage Node Storage Node Head Node 2 ports Note: System management, monitoring, and operator console will use a different set of switches not pictured here. Cirrus Cloud at University of Illinois
41 Example: Cirrus Cloud at U. Illinois 128 servers. Each has –8 cores (total 1024 cores) –16 GB RAM –2 TB disk Backing store of about 250 TB Total storage: 0.5 PB Gigabit Networking
42 6 Diverse Sites within Cirrus I.UIUC – Systems Research for Cloud Computing + Cloud Computing Applications II.Karlsruhe Institute of Tech (KIT, Germany): Grid-style jobs III.IDA, Singapore IV.Intel V.HP VI.Yahoo!: CMU’s M45 cluster All will be networked together: see
43 What “Services”? Different Clouds Export different services Industrial Clouds –Amazon S3 (Simple Storage Service): store arbitrary datasets –Amazon EC2 (Elastic Compute Cloud): upload and run arbitrary images –Google AppEngine: develop applications within their appengine framework, upload data that will be imported into their format, and run Academic Clouds –Google-IBM Cloud (U. Washington): run apps programmed atop Hadoop –Cirrus cloud: run (i) apps programmed atop Hadoop and Pig, and (ii) systems-level research on this first generation of cloud computing models
44 Software “Services” Computational –MapReduce (Hadoop) –Pig Latin Naming and Management –Zookeeper –Tivoli, OpenView Storage –HDFS –PNUTS
45 Sample Service: MapReduce Google uses MapReduce to run 100K jobs per day, processing up to 20 PB of data Yahoo! has released open-source software Hadoop that implements MapReduce Other companies that have used MapReduce to process their data: A9.com, AOL, Facebook, The New York Times Highly-Parallel Data-Processing
46 What is MapReduce? Terms are borrowed from Functional Language (e.g., Lisp) Sum of squares: (map square ‘( )) –Output: ( ) [processes each record sequentially and independently] (reduce + ‘( )) –(+ 16 (+ 9 (+ 4 1) ) ) –Output: 30 [processes set of all records in a batch]
47 Map Process individual key/value pair to generate intermediate key/value pairs. Welcome Everyone Hello Everyone Welcome1 Everyone1 Hello1 Everyone1 Input
48 Reduce Processes and merges all intermediate values associated with each given key assigned to it Welcome1 Everyone1 Hello1 Everyone1 Everyone2 Hello1 Welcome1
49 Some Applications Distributed Grep: –Map - Emits a line if it matches the supplied pattern –Reduce - Copies the the intermediate data to output Count of URL access frequency –Map – Process web log and outputs –Reduce - Emits Reverse Web-Link Graph –Map – process web log and outputs –Reduce - emits
50 Programming MapReduce Externally: For user 1.Write a Map program (short), write a Reduce program (short) 2.Submit job; wait for result 3.Need to know nothing about parallel/distributed programming! Internally: For the cloud (and for us distributed systems researchers) 1.Parallelize Map 2.Transfer data from Map to Reduce 3.Parallelize Reduce 4.Implement Storage for Map input, Map output, Reduce input, and Reduce output
51 Inside MapReduce For the cloud (and for us distributed systems researchers) 1.Parallelize Map: easy! each map job is independent of the other! 2.Transfer data from Map to Reduce: All Map output records with same key assigned to same Reduce task use partitioning function (more soon) 3.Parallelize Reduce: easy! each map job is independent of the other! 4.Implement Storage for Map input, Map output, Reduce input, and Reduce output Map input: from distributed file system Map output: to local disk (at Map node); uses local file system Reduce input: from (multiple) remote disks; uses local file systems Reduce output: to distributed file system local file system = Linux FS, etc. distributed file system = GFS (Google File System), HDFS (Hadoop Distributed File System)
52 Internal Workings of MapReduce
53 Flow of Data Input slices are typically 16MB to 64MB. Map workers use a partitioning function to store intermediate key/value pair to the local disk. –e.g., Hash (key) mod R Output files Map workers Reduce workers partitioning
54 Fault Tolerance Worker Failure –Master keeps 3 states for each worker task (idle, in-progress, completed) –Master sends periodic pings to each worker to keep track of it (central failure detector) If fail while in-progress, mark the task as idle If map workers fail after completed, mark as idle Notify the reduce task about the map worker failure Master Failure –Checkpoint
55 Locality and Backup tasks Locality –Since cloud has hierarchical topology –GFS stores 3 replicas of each of 64MB chunks Maybe on different racks –Attempt to schedule a map task on a machine that contains a replica of corresponding input data: why? Stragglers (slow nodes) –Due to Bad Disk, Network Bandwidth, CPU, or Memory. –Perform backup (replicated) execution of straggler task: task done when first replica complete
56 Grep Locality optimization helps: 1800 machines read 1 TB at peak ~31 GB/s W/out this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs Workload: byte records to extract records matching a rare pattern (92K matching records) Testbed: 1800 servers each with 4GB RAM, dual 2GHz Xeon, dual 169 GB IDE disk, 100 Gbps, Gigabit ethernet per machine
57 Normal No backup tasks 200 processes killed Sort Backup tasks reduce job completion time a lot! System deals well with failures M = R = 4000 Workload: byte records (modeled after TeraSort benchmark)
58 Discussion Points Storage: Is the local write-remote read model good for Map output/Reduce input? –What happens on node failure? Entire Reduce phase needs to wait for all Map tasks to finish –Why? What is the disadvantage? What are the other issues related to our challenges: –Storage –Communication bottleneck –Moving tasks to data (rather than vice-versa) –Security –Availability of Data –Scalability –Locality: within clouds, or across them –Inter-cloud/multi-cloud computations –Other Programming Models? Based on MapReduce Beyond MapReduce-based ones Concern: Do clouds run the risk of going the Grid way?
59 P2P and Clouds/Grid Opportunity to use p2p design techniques, principles, and algorithms in cloud computing Cloud computing vs. Grid computing: what are the differences?
60 Prophecies In 1965, MIT's Fernando Corbató and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company”. Plug your thin client into the computing Utiling and Play your favorite Intensive Compute & Storage & Communicate Application –[Will this be a reality with the Grid and Clouds?] Are we there yet? ? ? ? Are we going towards it?
61 Administrative Announcements Student-led paper presentations (see instructions on website) Start from February 12th Groups of up to 2 students each class, responsible for a set of 3 “Main Papers” on a topic –45 minute presentations (total) followed by discussion –Set up appointment with me to show slides by 5 pm day prior to presentation List of papers is up on the website Each of the other students (non-presenters) expected to read the papers before class and turn in a one to two page review of the any two of the main set of papers (summary, comments, criticisms and possible future directions)
62 Announcements (contd.) Presentation Deadline: form groups by midnight of January 31 by dropping by my office hours (10.45 am – 12 pm, Tu, Th in 3112 SC) –Hurry! Some interesting topics are already taken! –I can help you find partners Use course newsgroup for forming groups and discussion: class.cs525
63 Announcements (contd.) Projects Groups of 2 (need not be same as presentation groups) We’ll start detailed discussions “soon” (a few classes into the student-led presentations) Please turn in filled-out “Student Infosheets” today or next lecture.
64 Next week No lecture Tuesday February 3 (no office hours either) Thursday (February 5) lecture: read Basic Distributed Computing Concepts papers
65 Backup Slides
66 Example: Rapid Atmospheric Modeling System, ColoState U Weather Prediction is inaccurate Hurricane Georges, 17 days in Sept 1998
68 Next Week Onwards Student led presentations start –Organization of presentation is up to you –Suggested: describe background and motivation for the session topic, present an example or two, then get into the paper topics Reviews: You have to submit both an copy (which will appear on the course website) and a hardcopy (on which I will give you feedback). See website for detailed instructions. –1-2 pages only, 2 papers only
69 Refinements and Extensions Local Execution –For debugging purpose –Users have control on specific Map tasks Status Information –Master runs an HTTP server –Status page shows the status of computation –Link to output file –Standard Error list
70 Refinements and Extensions Combiner Function –User defined –Done within map task. –Save network bandwidth. Skipping Bad records –Best solution is to debug & fix Not always possible ~ third-party source libraries –On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being processed –If master sees two failures for same record: Next worker is told to skip the record