Cloud Computing Systems Lin Gu Hong Kong University of Science and Technology Oct. 3, 2011 Hadoop, HDFS and Microsoft Cloud Computing Technologies
Platform Services Software Services Application Services Infrastructure Services The Microsoft Cloud Categories of Services
User Private Cloud Public Services Application Patterns Table Storage Service Table Storage Service Blob Storage Service Blob Storage Service Queue Service Queue Service ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) Web Svc (Web Role) Web Svc (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) ASP.NET (Web Role) Jobs (Worker Role) Jobs (Worker Role) Silverlight Application Silverlight Application Web Browser Mobile Browser Mobile Browser WPF Application WPF Application Service Bus Access Control Service Workflow Service Workflow Service User Data User Data Application Data Reference Data Grid / Parallel Computing Application Enterprise Data Enterprise Web Svc Enterprise Application Data Service Data Service Storage Service Storage Service Identity Service Identity Service Application Service Application Service Enterprise Identity
4 Hadoop—History Started in 2005 by Doug Cutting Yahoo! became the primary contributor in 2006 – Scaled it to 4000 node clusters in 2009 Yahoo! deployed large-scale science clusters in 2007 Many users today – Amazon/A9 – Facebook – Google – IBM – Joost – Last.fm – New York Times – PowerSet – Veoh
Hadoop at Facebook Production cluster comprises 8000 cores, 1000 machines, 32 GB per machine (July 2009) – 4 SATA disks of 1 TB each per machine – 2-level network hierarchy, 40 machines per rack – Total cluster size is 2 PB (projected to be 12 PB in Q3 2009) Another test cluster has 800 cores “16GB each” Source: Dhruba Borthakur
Hadoop—Motivation Need a general infrastructure for fault- tolerant, data-parallel distributed processing Open-source MapReduce – Apache License Workloads are expected to be IO bound and not CPU bound
First, a file system is in need—HDFS Very large distributed file system running on commodity hardware – Replicated – Detect failures and recovers from them Optimized for batch processing – High aggregate bandwidth, locality aware User-space FS, runs on heterogeneous OS
Secondary NameNode Client HDFS NameNode DataNodes 1. filename 2. BlckId, DataNodes 3.Read data Cluster NameNode : Manage metadata DataNode : manage file data—Maps a block-id to a physical location on disk Secondary NameNode: fault tolerance—Periodically merge the Transaction log
HDFS Provide a single namespace for entire cluster – Files, directories, and their hierarchy Files are broken up into large blocks – Typically 128 MB block size – Each block is replicated on multiple DataNodes Meta-data in Memory – Metadata: Names of files (including dirs) and a list of Blocks for each file, l ist of DataNodes for each block, file attributes, e.g creation time, replication factor – High performance (high throughput, low latency) A Transaction Log records file creations, file deletions. etc Data Coherency:emphasizes the append operation Client can –find location of blocks –access data directly from DataNode
Hadoop—Design Hadoop Core – Distributed File System - distributes data – Map/Reduce - distributes logic (processing) – Written in Java – Runs on Linux, Mac OS/X, Windows, and Solaris Fault tolerance – In a large cluster, failure is norm – Hadoop re-executes failed tasks Locality – Map and Reduce in Hadoop queries HDFS for locations of data – Map tasks are scheduled close to the inputs when it is possible
12 Hadoop Ecosystem Hadoop Core – Distributed File System – MapReduce Framework Pig (initiated by Yahoo!) – Parallel Programming Language and Runtime Hbase (initiated by Powerset) – Table storage for semi-structured data Zookeeper (initiated by Yahoo!) – Coordinating distributed systems Storm Hive (initiated by Facebook) – SQL-like query language and storage
Word Count Example Read text files and count how often words occur. – The input is a collection of text files – The output is a text file each line: word, tab, count Map: Produce pairs of (word, count) Reduce: For each word, sum up the counts.
WordCount Overview public class WordCount { 14 public static class Map extends MapReduceBase implements Mapper... { public void map } public static class Reduce extends MapReduceBase implements Reducer... { public void reduce } public static void main(String[] args) throws Exception { 40 JobConf conf = new JobConf(WordCount.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); 54 FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); 57 } }
wordCount Mapper 14 public static class Map extends MapReduceBase implements Mapper { 15 private final static IntWritable one = new IntWritable(1); 16 private Text word = new Text(); public void map( LongWritable key, Text value, OutputCollector output, Reporter reporter) throws IOException { 19 String line = value.toString(); 20 StringTokenizer tokenizer = new StringTokenizer(line); 21 while (tokenizer.hasMoreTokens()) { 22 word.set(tokenizer.nextToken()); 23 output.collect(word, one); 24 } 25 } 26 }
wordCount Reducer 28 public static class Reduce extends MapReduceBase implements Reducer { public void reduce(Text key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { 31 int sum = 0; 32 while (values.hasNext()) { 33 sum += values.next().get(); 34 } 35 output.collect(key, new IntWritable(sum)); 36 } 37 }
Invocation of wordcount 1./usr/local/bin/hadoop dfs -mkdir 2./usr/local/bin/hadoop dfs -copyFromLocal 3./usr/local/bin/hadoop jar hadoop-*-examples.jar wordcount [-m ] [-r ]
18 Example Hadoop Applications: Search Assist™ Before HadoopAfter Hadoop Time26 days20 minutes LanguageC++Python Development Time2-3 weeks2-3 days Database for Search Assist™ is built using Hadoop. 3 years of log-data, 20-steps of map-reduce
Large Hadoop Jobs Webmap ~70 hours runtime ~300 TB shuffling ~200 TB output 1480 nodes ~73 hours runtime ~490 TB shuffling ~280 TB output 2500 nodes Sort benchmarks (Jim Gray contest) 1 Terabyte sorted 209 seconds 900 nodes 1 Terabyte sorted 62 seconds, 1500 nodes 1 Petabyte sorted hours, 3700 nodes Largest cluster 2000 nodes 6PB raw disk 16TB of RAM 16K CPUs 4000 nodes 16PB raw disk 64TB of RAM 32K CPUs (40% faster CPUs too) Source: Eric Baldeschwieler, Yahoo!
Data Warehousing at Facebook Web ServersScribe Servers Network Storage Hadoop Cluster Oracle RAC MySQL Source: Dhruba Borthakur – 15 TB uncompressed data ingested per day – 55TB of compressed data scanned per day – jobs on production cluster per day – 80M compute minutes per day
How to construct a simple, generic, and automatic parallelization engine for the cloud? But all these are data analytics applications. Can it extend to general computation? Let’s look at an example...
The Tomasulo’s Algorithm Designed initially for IBM 360/91 – Out-of-order execution The descendants of this include: Alpha 21264, HP 8000, MIPS 10000, Pentium III, PowerPC 604, …
Three Stages of Tomasulo Algorithm 1.Issue—get instruction from a queue – Record the instruction’s information in the processor’s internal control, and rename registers 2.Execute—operate on operands (EX) – When all operands are ready, execute; otherwise, watch Common Data Bus (CDB) for result 3.Write result—finish execution (WB) – Write result to CDB. All awaiting units receive the result.
Tomasulo organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6
How does Tomasulo exploit parallelism? Naming and renaming – Keep track of data dependence and resolve conflicts by renaming registers. Reservation stations – Record instructions’ control information and the values of operands. Data has versions. In Tomasulo, data drive logic When data is ready, execute!
Dryad Distributed/parallel execution – Improve throughput, not latency – Automatic management of scheduling, distribution, fault tolerance, and parallelization! Computations are expressed as a DAG – Directed Acyclic Graph: vertices are computations, edges are communication channels – Each vertex has several input and output edges
Why using a dataflow graph? A general abstraction of computation The programmer may not have to know how to construct the graph – “SQL-like” queries: LINQ Can all computation be represented by a finite graph?
Yet Another WordCount, in Dryad Count Word:n MergeSort Word:n Count Word:n Distribute Word:n
Organization
Job as a DAG (Directed Acyclic Graph) Processing vertices Channels (file, pipe, shared memory) Inputs Outputs
Scheduling at JM A vertex can run on any computer once all its inputs are ready – Prefers executing a vertex near its inputs (locality) Fault tolerance – If a task fails, run it again – If task’s inputs are gone, run upstream vertices again (recursively) – If a task is slow, run another copy elsewhere and use the output from the faster computation
Distributed Data-Parallel Computing Research problem: How to write distributed data-parallel programs for a compute cluster? The DryadLINQ programming model – Sequential, single machine programming abstraction – Same program runs on single-core, multi-core, or cluster – Familiar programming languages – Familiar development environment
LINQ LINQ: A language for relational queries – Language INtegrated Query – More general than distributed SQL – Inherits flexible C# type system and libraries – Available in Visual Studio products A set of operators to manipulate datasets in.NET – Support traditional relational operators Select, Join, GroupBy, Aggregate, etc. – Integrated into.NET Programs can call operators Operators can invoke arbitrary.NET functions Data model – Data elements are strongly typed.NET objects – More expressive than SQL tables Is SQL Turing- complete? Is LINQ?
Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; LINQ + Dryad = DryadLINQ C# collection results C# Vertex code Query plan (Dryad job) Data
Dryad DryadLINQ System Architecture DryadLINQ Client (11) Distributed query plan.NET program Query Expr Cluster Output Tables Results Input Tables Invoke Query plan Output Table Dryad Execution.Net Objects ToTable foreach Vertex code
Yet Yet Another Word Count Count word frequency in a set of documents: var docs = [A collection of documents]; var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count()));
Word Count in DryadLINQ Count word frequency in a set of documents: var docs = DryadLinq.GetTable (“file://docs.txt”); var words = docs.SelectMany(doc => doc.words); var groups = words.GroupBy(word => word); var counts = groups.Select(g => new WordCount(g.Key, g.Count())); counts.ToDryadTable(“counts.txt”);
Distributed Execution of Word Count SM DryadLINQ GB S LINQ expression IN OUT Dryad execution
DryadLINQ Design An optimizing compiler generates the distributed execution plan – Static optimizations: pipelining, eager aggregation, etc. – Dynamic optimizations: data-dependent partitioning, dynamic aggregation, etc. Automatic code generation and distribution by DryadLINQ and Dryad – Generates vertex code that runs on vertices, channel serialization code, callback code for runtime optimizations – Automatically distributed to cluster machines
Summary DAG dataflow graph is a powerful computation model Language integration enables programmers to easily use the DAG based computation Decoupling of Dryad and DryadLINQ – Dryad: execution engine (given DAG, schedule tasks and handle fault tolerance) – DryadLINQ: programming language and tools (given query, generates DAG)
Development Works with any LINQ enabled language – C#, VB, F#, IronPython, … Works with multiple storage systems – NTFS, SQL, Windows Azure, Cosmos DFS Released within Microsoft and used on a variety of applications External academic release announced at PDC – DryadLINQ in source, Dryad in binary – UW, UCSD, Indiana, ETH, Cambridge, …
Advantages of DAG over MapReduce Dependence is naturally specified – MapReduce: complex job runs >=1 MR stages Tasking overhead Reduce tasks of each stage write to replicated storage – Dryad: each job is represented with a DAG intermediate vertices written to local file Dryad provides a more flexible and general framework – E.g., multiple types of input/output
Image Processing Cosmos DFSSQL Servers DryadLINQ in the Software Stack 43 Windows Server Cluster Services Azure Platform Dryad DryadLINQ Windows Server Other Languages CIFS/NTFS Machine Learning Graph Analysis Data Mining Applications … Other Applications