DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley.

Slides:

Advertisements

Similar presentations

Distributed Data-Parallel Programming using Dryad Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu Microsoft Research Silicon Valley.

Advertisements

Cluster Computing with Dryad Mihai Budiu, MSR-SVC LiveLabs, March 2008.

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Machine Learning in DryadLINQ Kannan Achan Mihai Budiu MSR-SVC, 1/30/

Distributed Data-Parallel Computing Using a High-Level Programming Language Yuan Yu Michael Isard Joint work with: Andrew Birrell, Mihai Budiu, Jon Currey,

C# and LINQ Yuan Yu Microsoft Research Silicon Valley.

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans Qifa Ke, Michael Isard, Yuan Yu Microsoft Research Silicon Valley EuroSys 2013.

DryadLINQ A System for General-Purpose Distributed Data-Parallel Computing Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep.

Monitoring and Debugging Dryad(LINQ) Applications with Daphne Vilas Jagannath, Zuoning Yin, Mihai Budiu University of Illinois, Microsoft Research SVC.

Distributed Computations

From LINQ to DryadLINQ Michael Isard Workshop on Data-Intensive Scientific Computing Using DryadLINQ.

Distributed computing using Dryad Michael Isard Microsoft Research Silicon Valley.

Dryad / DryadLINQ Slides adapted from those of Yuan Yu and Michael Isard.

Cluster Computing with DryadLINQ Mihai Budiu, MSR-SVC PARC, May

Tools and Services for Data Intensive Research Roger Barga Nelson Araujo, Tim Chou, and Christophe Poulain Advanced Research Tools and Services Group,

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

Distributed Computations MapReduce

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Dryad and DryadLINQ for data-intensive research (and a bit about Windows Azure) Condor Week 2010, Madison, WI Christophe Poulain Microsoft Research.

PARALLEL PROGRAMMING ABSTRACTIONS 6/16/2010 Parallel Programming Abstractions 1.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Dryad and DryadLINQ Theophilus Benson CS Distributed Data-Parallel Programming using Dryad By Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael.

Cluster Computing with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Intel Research Berkeley, Systems Seminar Series October 9, 2008.

Dryad and DryadLINQ Presented by Yin Zhu April 22, 2013 Slides taken from DryadLINQ project page: projects/dryadlinq/default.aspx.

Microsoft DryadLINQ --Jinling Li. What’s DryadLINQ? A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. [1]

Image Processing Image Processing Windows HPC Server 2008 HPC Job Scheduler Dryad DryadLINQ Machine Learning Graph Analysis Graph Analysis Data Mining.NET.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Programming clusters with DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Association of C and C++ Users (ACCU) Mountain View, CA, April 13, 2011.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Dryad and DryadLINQ Aditya Akella CS 838: Lecture 6.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

Training Kinect Mihai Budiu Microsoft Research, Silicon Valley UCSD CNS 2012 RESEARCH REVIEW February 8, 2012.

1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.

MapReduce M/R slides adapted from those of Jeff Dean’s.

SALSASALSASALSASALSA Design Pattern for Scientific Applications in DryadLINQ CTP DataCloud-SC11 Hui Li Yang Ruan, Yuduo Zhou Judy Qiu, Geoffrey Fox.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

CPSC 404, Laks V.S. Lakshmanan1 External Sorting Chapter 13: Ramakrishnan & Gherke and Chapter 2.3: Garcia-Molina et al.

4 5 6 var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where

Distributed Computations MapReduce/Dryad M/R slides adapted from those of Jeff Dean’s Dryad slides adapted from those of Michael Isard.

Dryad and DryaLINQ. Dryad and DryadLINQ Dryad provides automatic distributed execution DryadLINQ provides automatic query plan generation Dryad provides.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Definition DryadLINQ is a simple, powerful, and elegant programming environment for writing large-scale data parallel applications running on large PC.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

Large-scale Machine Learning using DryadLINQ Mihai Budiu Microsoft Research, Silicon Valley Ambient Intelligence: From Sensor Networks to Smart Environments.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

WHO WILL BENEFIT FROM THIS TALK TOPICS WHAT YOU’LL LEAVE WITH Developers looking to build applications that analyze big data. Developers building applications.

MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

CS239-Lecture 3 DryadLINQ Madan Musuvathi Visiting Professor, UCLA

Some slides adapted from those of Yuan Yu and Michael Isard

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

CSCI5570 Large Scale Data Processing Systems

Distributed Computations MapReduce/Dryad

Parallel Computing with Dryad

MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

DryadInc: Reusing work in large-scale computations

Fast, Interactive, Language-Integrated Cluster Computing

MapReduce: Simplified Data Processing on Large Clusters

Presentation transcript:

DryadLINQ: Computer Vision (among other things) on a cluster ECCV AC workshop 14 th June, 2008 Michael Isard Microsoft Research, Silicon Valley

Parallel programming, yada yada Intel claims we will all have many-core, etc. “This algorithm is easily parallelizable” –Not “we implemented a parallel version” Historically, low-latency fine-grain parallelism –Shared-memory SMP (threads, locks, etc.) –MPI (finite-element analysis, etc.) But also data-parallel! –We have lots of data now (video, the web) –But most people still use their laptops/toy data –Even “big” systems use tens of computers

Why do people use Matlab? Parallel programming tedious and complex –Distributed programming even worse –Perl scripts, manual management of data, … Matlab is easy (or at least popular) –Relatively few high-level constructs –System “does the right thing” –Programmers willing to put up with a lot We want similarly low barrier to entry –Familiar languages, legacy codebase, etc.

What are we doing? When single-computer processing runs out of steam –Web-scale processing of terabytes of data Infeasible without a big cluster –Network log-mining, machine learning Multi-week job → 4 hours on 250 computers 1-hour iteration → 3.5 minutes on 4 computers

A typical data-intensive query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Ulfar’s most frequently visited web pages

Steps in the query var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; Go through logs and keep only lines that are not comments. Parse each line into a LogEntry object. Go through logentries and keep only entries that are accesses by ulfar. Group ulfar ’s accesses according to what page they correspond to. For each page, count the occurrences. Sort the pages ulfar has accessed according to access frequency.

Serial execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; For each line in logs, do… For each entry in logentries, do.. Sort entries in user by page. Then iterate over sorted list, counting the occurrences of each page as you go. Re-sort entries in access by page frequency.

Parallel execution var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access;

Linear Regression Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 9

Execution Graph 10 X×X T Y×X T Σ X[0]X[1]X[2]Y[0]Y[1]Y[2] Σ [ ] -1 * A

DryadLINQ Programmer writes sequential C# code –Rich type system, libraries, modules, loops… –System can figure out data-parallelism Sees declarative expression plans Full control of high-level optimizations Traditional parallel-database tricks

Dryad execution engine General-purpose execution environment for distributed, data-parallel applications –Concentrates on throughput not latency –Assumes private data center Automatic management of scheduling, distribution, fault tolerance, etc. Well tested over two years on clusters of thousands of computers Andrew Birrell, Mihai Budiu, Dennis Fetterly, Michael Isard, Yuan Yu

Job = Directed Acyclic Graph Processing vertices Channels (file, pipe, shared memory) Inputs Outputs

Scheduler state machine Scheduling a DAG –Vertex can run anywhere once all its inputs are ready Constraints/hints place it near its inputs –Fault tolerance If A fails, run it again If A’s inputs are gone, run upstream vertices again (recursively) If A is slow, run another copy elsewhere and use output from whichever finishes first

Static/dynamic optimizations Static optimizer builds execution graph Dynamic optimizer mutates running graph –Picks number of partitions when size is known –Builds aggregation trees based on locality

LINQ Constructs/type system in.NET v3.5 Operators to manipulate datasets –Data elements are arbitrary.NET types Traditional relational operators –Select, Join, Aggregate, etc. Extensible –Add new operators –Add new implementations

DryadLINQ Automatically distribute a LINQ program Few Dryad-specific extensions –Same source program runs on single-core through multi-core up to cluster Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

A complete DryadLINQ program public class LogEntry { public string user; public string ip; public string page; public LogEntry(string line) { string[] fields = line.Split(' '); this.user = fields[8]; this.ip = fields[9]; this.page = fields[5]; } public class UserPageCount { public string user; public string page; public int count; public UserPageCount(string user, string page, int count) { this.user = user; this.page = page; this.count = count; } DryadDataContext ddc = new DryadDataContext(“fs://logfile”); DryadTable logs = ddc.GetTable (); var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line); var user = from access in logentries where select access; var accesses = from access in user group access by access.page into pages select new UserPageCount("ulfar", pages.Key, pages.Count()); var htmAccesses = from access in accesses where access.page.EndsWith(".htm") orderby access.count descending select access; htmAccesses.ToDryadTable(“fs://results”)

Query plan LINQ query DryadLINQ: From LINQ to Dryad Dryad select where logs Automatic query plan generation Distributed query execution by Dryad var logentries = from line in logs where !line.StartsWith("#") select new LogEntry(line);

How does it work? Sequential code “operates” on datasets But really just builds an expression graph –Lazy evaluation When a result is retrieved –Entire graph is handed to DryadLINQ –Optimizer builds efficient DAG –Program is executed on cluster

Terasort 10 billion 100-byte records (10 12 bytes) 240 computers, 960 disks –349 secs Comparable with record public struct TeraRecord : IComparable { public const int RecordSize = 100; public const int KeySize = 10; public byte[] content; public int CompareTo(TeraRecord rec) { for (int i = 0; i < KeySize; i++) { int cmp = this.content[i] - rec.content[i]; if (cmp != 0) return cmp; } return 0; } public static TeraRecord Read(DryadBinaryReader rd) { TeraRecord rec; rec.content = rd.ReadBytes(RecordSize); return rec; } public static int Write(DryadBinaryWriter wr, TeraRecord rec) { return wr.WriteBytes(rec.content); } class Terasort { public static void Main(string[] args) DryadDataContext ddc = new DryadTable records = ddc.GetPartitionedTable ("sherwood-sort2.pt"); var q = records.OrderBy(x => x); q.ToDryadPartitionedTable("sherwood-sort2.pt"); }

Machine Learning in DryadLINQ 22 Dryad DryadLINQ Large Vector Machine learning Data analysis Kannan Achan, Mihai Budiu

Linear Regression Code Vectors x = input(0), y = input(1); Matrices xx = x.PairwiseOuterProduct(x); OneMatrix xxs = xx.Sum(); Matrices yx = y.PairwiseOuterProduct(x); OneMatrix yxs = yx.Sum(); OneMatrix xxinv = xxs.Map(a => a.Inverse()); OneMatrix A = yxs.Map( xxinv, (a, b) => a.Multiply(b)); 23

Expectation Maximization lines 3 iterations shown

Computer vision Ongoing –Epitomes, features for image search, … Anecdotal evidence –Nebojsa Jojic, Anitha Kannan Tutorial from Mihai Anitha implemented Probabilistic Image Map algorithm in an afternoon

Continuing research Application-level research –What can we write with DryadLINQ? System-level research –Performance, usability, etc. Lots of interest from learning/vision researchers