The case for cloud computing in genome informatics by Lincoln D Stein Genome Biology 2010, 11:207.

Slides:



Advertisements
Similar presentations
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
EHarmony in Cloud Subtitle Brian Ko. eHarmony Online subscription-based matchmaking service Available in United States, Canada, Australia and United Kingdom.
Spark: Cluster Computing with Working Sets
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Introduction to Systems Analysis and Design
DISTRIBUTED COMPUTING
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce in the Clouds for Science CloudCom 2010 Nov 30 – Dec 3, 2010 Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox {tgunarat, taklwu,
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
Google Distributed System and Hadoop Lakshmi Thyagarajan.
HADOOP ADMIN: Session -2
U.S. Department of the Interior U.S. Geological Survey David V. Hill, Information Dynamics, Contractor to USGS/EROS 12/08/2011 Satellite Image Processing.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
Science in Clouds SALSA Team salsaweb/salsa Community Grids Laboratory, Digital Science Center Pervasive Technology Institute Indiana University.
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
SALSASALSA Twister: A Runtime for Iterative MapReduce Jaliya Ekanayake Community Grids Laboratory, Digital Science Center Pervasive Technology Institute.
Cloud Computing 1. Outline  Introduction  Evolution  Cloud architecture  Map reduce operation  Platform 2.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
Parallel Applications And Tools For Cloud Computing Environments Azure MapReduce Large-scale PageRank with Twister Twister BLAST Thilina Gunarathne, Stephen.
SCIENCE VOL FEBRUARY 2011 R 黃博強 R 林彥伯 R 蘇醒宇 R 吳卓翰 R 蘇煒迪 R 陳維.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
A Hierarchical MapReduce Framework Yuan Luo and Beth Plale School of Informatics and Computing, Indiana University Data To Insight Center, Indiana University.
Parallelization of Classification Algorithms For Medical Imaging on a Cluster Computing System 指導教授 : 梁廷宇 老師 系所 : 碩光通一甲 姓名 : 吳秉謙 學號 :
Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.
Cooperative experiments in VL-e: from scientific workflows to knowledge sharing Z.Zhao (1) V. Guevara( 1) A. Wibisono(1) A. Belloum(1) M. Bubak(1,2) B.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Cloud Computing Paradigms for Pleasingly Parallel Biomedical Applications Thilina Gunarathne, Tak-Lon Wu Judy Qiu, Geoffrey Fox School of Informatics,
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
ApproxHadoop Bringing Approximations to MapReduce Frameworks
MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Parallel IO for Cluster Computing Tran, Van Hoai.
Next Generation of Apache Hadoop MapReduce Owen
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
INTRODUCTION TO HADOOP. OUTLINE  What is Hadoop  The core of Hadoop  Structure of Hadoop Distributed File System  Structure of MapReduce Framework.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce Compiler RHadoop
Hadoop Aakash Kag What Why How 1.
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
Introduction to Apache
Distributing META-pipe on ELIXIR compute resources
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce, Types, Formats and Features
Presentation transcript:

The case for cloud computing in genome informatics by Lincoln D Stein Genome Biology 2010, 11:207

Introduction The impending collapse of the genome informatics Major sequence-data relateddatabases: GenBank, EMBL, DDBJ, SRA, GEO, and ArrayExpress Value-added integrators of genomic data: Ensembl, UCSC Genome Browser, Galaxy, and many model organism databases.

The old genome informatics ecosystem

Basis for the production and consumption of genomic information Moore's Law: “the number of transistors that can be placed on an integrated circuit board is increasing exponentially, with a doubling time of roughly 18 months.” Similar laws for disk storage and network capacity have also been observed. Until now the doubling time for DNA sequencing was just a bit slower than the growth of compute and storage capacity.

Historical trends in storage prices vs DNA sequencing costs Notice: this is a logarithmic plot – exponential curves appear as straight lines. “Next generation” sequencing technologies in the mid-2000s changed the trends and now threatens the conventional genome informatics ecosystem

Current Situation  Soon it will cost less to sequence a base of DNA than to store it on hard disk  We are facing potential tsunami of genome data that will swamp our storage system and crush our compute clusters  Who is affected:  Genome sequence repositories who need maintain and timely update their data  “Power users” who accustomed to download the data to local computer for analysis

Cloud Computing  Cloud computing is a general term for computation-as- a-service  Computation-as-a-service means that customers rent the hardware and the storage only for the time needed to achieve their goals  In cloud computing the rentals are virtual  The service provider implements the infrastructure to give users the ability to create, upload, and launch virtual machines on their extremely large and powerful compute farm.

Virtual Machine  Virtual machine is software that emulates the properties of a computer.  Any operating system can be chosen to run on the virtual machine and all the tasks we usually perform on the computer can be performed on virtual machine.  Virtual machine bootable disk can be stored as an image. This image can used as a template to start up multiple virtual machines – launching a virtual compute cluster is very easy.  Storing large data sets is usually done with virtual disks. Virtual disk image can be attached to virtual machine as local or shared hard drive.

The “new” genome informatics ecosystem based on cloud computing Continue to work with the data via web- pages in their accustomed way Have two options: continue downloading data to local clusters as before, or use cloud computing – move the cluster to the data.

Cloud computing service providers and their offered services  Some of the cloud computing service providers are: Amazon (Elastic Cloud Computing), Rackspace Cloud, Flexiant.  Amazon offers a variety of bioinformatics-oriented virtual machine images:  Images prepopulated by Galaxy  Bioconductor – programming environment with R statistic package  GBrowser – genome browser  BioPerl  JCVI Cloud BioLinux – collection of bioinformatics tools including Celera Assembler  Amazon also provides several large genomic datasets in its cloud:  Complete copy of GeneBank (200 Gb)  30x coverage sequencing reads of a trio of individuals from 1000 Genome Project (700Gb)  Genome databases from Ensembl including annotated human and 50 other species genomes

Cloud computing service providers from Nature Biotechnology news article “Up in a cloud?”

Academic cloud computing  Growing number of academic compute cloud projects are based on open source cloud management software.  Open Cloud Consortium is an example of one such project

Cloud computing disadvantages  Cost of migrating existing systems into the new and unfamiliar environment  Security and privacy of the data  Need additional level of data encryption  Need new software that will restrict access to only authorized users  Major question: does cloud computing make economic sense?

Comparing relative costs of renting vs. buying computational services  According to technical report on Cloud Computing by UC Berkeley Reliable Adaptive Distributed Systems Laboratory  When all the costs of maintaining a large compute cluster are considered the cost of renting a data center from Amazon is marginally more expensive than buying one.  When the flexibility of the cloud to support a virtual data center that downsizes or grows as needed is factored in, the economics start to look better

Obstacles in moving toward cloud computing for genomics  Network bandwidth – transferring 100 gigabit next-generation sequencing data file across 10Gb/sec connection will take less than a day only if hogging much of the institution’s bandwidth.  Cloud computing service providers will have to offer some flexibility in how large datasets get into system.

By Jaliya Ekanayake, Thilina Gunarathne, and Judy Qui IEEE transactions on parallel and distributed systems Cloud Technologies for Bioinformatics Applications

Motivation for presenting the article  Exponential increase in the size of datasets implies that parallelism is essential to process the information in a timely fashion.  The algorithms presented in this paper might reduce computational analysis time significantly whether used on our own cluster or on rented Amazon virtual cluster.  This paper presents a nice example of how these new parallel computing algorithms can be applied for bioinformatics

MapReduce  MapReduce is a software framework introduced by Google to support distributed computing on large data sets on computer clusters.  Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key.  Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program's execution across a set of machines, handling machine failures, and managing the required inter-machine communication. This allows programmers without any experience with parallel and distributed systems to easily utilize the resources of a large distributed system.

Tow steps of MapReduce  Map: The master node takes the input, chops it up into smaller sub-problems, and distributes those to worker nodes. The worker node processes that smaller problem, and passes the answer back to its master node.  Reduce: The master node then takes the answers to all the sub-problems and combines them in a way to get the output - the answer to the problem it was originally trying to solve.

MapReduce Illustrations

Objectives  Examine 3 technologies:  Microsoft Drayd/DryadLINQ  Apache Hadoop  Message Passing Interface (MPI)  These 3 technologies are applied to 2 bioinformatics applications:  Expressed Sequence Tags (EST)  Alu clustering

Technolgies  MPI – The goal of MPI is to provide a widely used standard for writing message passing programs. MPI is used in distributed and parallel computing.  Dryad/DryadLINQ is a distributed execution engine for coarse grain data parallel applications. It combines the MapReduce programming style with dataflow graphs to solve computational problems.  Apache Hadoop is an open source Java implementation of MapReduce. Hadoop schedules the MapReduce computation tasks depending on the data locality.

Bioinformatics applications  Alu Sequence Classification is one of the most challenging problems for sequence clustering because Alus represent the largest repeat families in human genome. For the specific objectives of the paper an open source version of the Smith Waterman – Gotoh (SW-G) algorithm was used.  EST (Expressed Sequence Tag) corresponds to messenger RNAs transcribed from genes residing on chromosomes. Each individual EST sequence represents a fragment of mRNA, and the EST assembly aims to reconstruct full-length mRNA sequence for each expressed gene. Specific implementation used for this research is CAP3 software.

Scalability of SW-G implementations

Scalability of CAP3 implementations

How different technologies perform on cloud computing clusters  The authors were interested to study the overhead of Virtual Machines for application that use frameworks like Hadoop and Dryad. Performance degradation of Hadoop SWG application on virtual environment ranges from 15% to 25%

How different technologies perform on cloud computing clusters cont’d The performance degradation in CAP3 remains constant near 20%. CAP3 application does not show the decrease of VM overhead with the increase of problem size as with SWG application.

Conclusions  The flexibility of clouds and MapReduce suggest they will become the preferred approaches.  The support for handling large data sets, the concept of moving computation to data, and the better quality of service provided by the cloud technologies simplify the implementation of some problems over traditional systems.  Experiments showed that DryadLINQ and Hadoop get similar performance and that virtual machines five overheads of around 20%.  The support for inhomogeneous data is important and currently Hadoop performs better than DryadLINQ.  MapReduce offer higher level interface and the user needs less explicit control of the parallelism.

Any questions? Thank you for your time