Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Digital Library Service – An overview Introduction System Architecture Components and their functionalities Experimental Results.

Overview of MapReduce and Hadoop

LIBRA: Lightweight Data Skew Mitigation in MapReduce

EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.

Developing a MapReduce Application – packet dissection.

Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.

Greg Thain Computer Sciences Department University of Wisconsin-Madison Condor Parallel Universe.

CMU SCS : Multimedia Databases and Data Mining Extra: intro to hadoop C. Faloutsos.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

Resource Management with YARN: YARN Past, Present and Future

Jaime Frey Computer Sciences Department University of Wisconsin-Madison Condor-G: A Case in Distributed.

Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

CPS216: Advanced Database Systems (Data-intensive Computing Systems) How MapReduce Works (in Hadoop) Shivnath Babu.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Location-aware MapReduce in Virtual Cloud 2011 IEEE computer society International Conference on Parallel Processing Yifeng Geng1,2, Shimin Chen3, YongWei.

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.

Human SNPs from short reads in hours using cloud computing Ben Langmead 1, 2, Michael C. Schatz 2, Jimmy Lin 3, Mihai Pop 2, Steven L. Salzberg 2 1 Department.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.

Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.

Introduction to Hadoop and HDFS

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Whirlwind Tour of Hadoop Edward Capriolo Rev 2. Whirlwind tour of Hadoop Inspired by Google's GFS Clusters from systems Batch Processing High.

CSE 548 Advanced Computer Network Security Document Search in MobiCloud using Hadoop Framework Sayan Cole Jaya Chakladar Group No: 1.

Greg Thain Computer Sciences Department University of Wisconsin-Madison cs.wisc.edu Interactive MPI on Demand.

Introduction to Hadoop Programming Bryon Gill, Pittsburgh Supercomputing Center.

MapReduce How to painlessly process terabytes of data.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.

Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!

Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.

CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Cole Jaya Chakladar Group No: 1.

U N I V E R S I T Y O F S O U T H F L O R I D A Hadoop Alternative The Hadoop Alternative Larry Moore 1, Zach Fadika 2, Dr. Madhusudhan Govindaraju 2 1.

Programming in Hadoop Guangda HU Huayang GUO

Hadoop implementation of MapReduce computational model Ján Vaňo.

HADOOP DISTRIBUTED FILE SYSTEM HDFS Reliability Based on “The Hadoop Distributed File System” K. Shvachko et al., MSST 2010 Michael Tsitrin 26/05/13.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

Map-Reduce examples 1. So, what is it? A two phase process geared toward optimizing broad, widely distributed parallel computing platforms Apache Hadoop.

CSE 548 Advanced Computer Network Security Trust in MobiCloud using Hadoop Framework Updates Sayan Kole Jaya Chakladar Group No: 1.

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

Hadoop & Neptune Feb 김형준.

Next Generation of Apache Hadoop MapReduce Owen

INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.

1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.

BIG DATA/ Hadoop Interview Questions.

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.

By Chris immanuel, Heym Kumar, Sai janani, Susmitha

An Open Source Project Commonly Used for Processing Big Data Sets

Examples Example: UW-Madison CHTC Example: Global CMS Pool

Hadoop MapReduce Framework

Introduction to MapReduce and Hadoop

Software Engineering Introduction to Apache Hadoop Map Reduce

The Basics of Apache Hadoop

湖南大学-信息科学与工程学院-计算机与科学系

PU. Setting up parallel universe in your pool and when (not

Presentation transcript:

Condor Project Computer Sciences Department University of Wisconsin-Madison Running Map-Reduce Under Condor

Cast of thousands › Mihai Pop › Michael Schatz › Dan Sommer  University of Maryland Center for Computational Biology › Faisal Khan, Ken Hahn UW › David Schwartz, LMCG

In 2003…

Shortly thereafter…

Two main Hadoop parts

For more detail CondorWeek 2009 talk Dhruba Borthakur Week2009/condor_presentations/bo rthakur-hadoop_univ_research.ppt

HDFS overview › Making POSIX distributed file system go fast is easy…

HDFS overview › …If you get rid of the POSIX part › Remove  Random access  Support for small files  authentication  In-kernel support

HDFS Overview › Add in  Data replication (key for distributed systems)  Command line utilities

HDFS Architecture

HDFS Condor Integration › HDFS Daemons run under master  Management/control › Added HAD support for namenode › Added host based security

Condor HDFS: II File transfer support transfer_input_files = hfds://… Spool in hdfs

Map Reduce

Shell hackers map reduce › grep tag input | sort | uniq –c | grep

MapReduce lingo for the native Condor speaker › Task tracker  startd/starter › Job tracker  condor_schedd

Map Reduce under Condor › Zeroth law of software engineering › Job tracker/task tracker must be managed!  Otherwise very bad things happen

Hadoop on Demand w/Condor

Map Reduce as overlay › Parallel Universe job › Starts job tracker on rank 0 › Task trackers everywhere else › Open Question:  Run more small jobs, or fewer bigger › One job tracker per user (i.e. per job)

On to real science… › David Schwartz, matchmaker Mihai Pop

Contrail – MR genome assembly /contrail-bio/index.php

Genome assembly

DNA 3 Billion base pairs Sequencing machines only read small reads at a time

Already done this?

High throughput sequencers

Contrail Scalable Genome Assembly with MapReduce › Genome: African male NA18507 (Bentley et al., 2008) › Input: 3.5B 36bp reads, 210bp insert (SRA000271) › Preprocessor: Quality-Aware Error Correction. Cloud SurfingError CorrectionCompressedInitial N Max N50 >10B 27 >1 B 303 bp < 100 bp 5.0 M 14, bp 4.2 M 20, bp In Progress Resolve Repeats

Running it under Condor › Used CHTC B-240 cluster › ~100 machines  8 way nehalem cpu  12 Gb total  1 disk partition dedicated to HDFS  HDFS running under condor master

Running it on Condor › Used the MapReduce PU overlay › Started with Fruit Flies › … › And it crashed › Zeroth law of software engineering  Version mismatch › Debugging…

Debugging › After a couple of debugging rounds › Fruit Fly sequenced!!  On to humans!

Cardinality › How many slots per task tracker?  Task tracker, like schedd multi-slots › One machine  8 cores  1 disk  1 memory system › How many mappers per slot

More MR under Condor › More debugging, NPEs › Updated MR again › Some performance regressions › One power outage › 12 weeks later…

Success!

Conclusions › Job trackers must be managed!  Glide-in is more than Condor on batch › Hadoop – more than just MapReduce › HDFS – good partner for Condor › All this stuff is moving fast