….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.

Slides:



Advertisements
Similar presentations
Cloud Computing Mick Watson Director of ARK-Genomics The Roslin Institute.
Advertisements

IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
Part IV MANUFACTURING SYSTEMS
ICS103 Programming in C Lecture 1: Overview of Computers & Programming
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
September 4, 2014 Using National Cyberinfrastructure Tom Doak Carrie Ganote National Center for Genome Analysis Support.
BigBed/bigWig remote file access Hiram Clawson UCSC Center for Biomolecular Science & Engineering.
Authors: Thilina Gunarathne, Tak-Lon Wu, Judy Qiu, Geoffrey Fox Publish: HPDC'10, June 20–25, 2010, Chicago, Illinois, USA ACM Speaker: Jia Bao Lin.
CSE182-L12 Gene Finding.
How to build your own computer And why it will save you time and money.
Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers
Titus Brown Qingpeng Zhang John Blischak Welcome!.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
CHAPTER 4: INTRODUCTION TO COMPUTER ORGANIZATION AND PROGRAMMING DESIGN Lec. Ghader Kurdi.
MANAGEMENT OF OPERATIONS METHODS OF PRODUCTION. LEARNING INTENTIONS AND SUCCESS CRITERIA LEARNING INTENTIONS: I understand the different production methods.
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Customized cloud platform for computing on your terms !
Bioinformatics Core Facility Ernesto Lowy February 2012.
Networked File System CS Introduction to Operating Systems.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
TECHNICAL WRITING October 31 st, With a partner Write simple “step-by-step” instructions for sending a Kakao Talk message with a phone.
Cluster Computing Applications for Bioinformatics Thurs., Aug. 9, 2007 Introduction to cluster computing Working with Linux operating systems Overview.
How I learned to quit worrying Deanna M. Church Staff Scientist, Short Course in Medical Genetics 2013 And love multiple coordinate.
An Introduction to Design Patterns. Introduction Promote reuse. Use the experiences of software developers. A shared library/lingo used by developers.
June 11, 2013 Intro to Bioinformatics – Assembling a Transcriptome Tom Doak Carrie Ganote National Center for Genome Analysis Support.
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
Block1 Wrapping Your Nugget Around Distributed Processing.
| nectar.org.au NECTAR TRAINING Module 3 Common use cases.
The iPlant Collaborative
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
MC 2 : Map Concurrency Characterization for MapReduce on the Cloud Mohammad Hammoud and Majd Sakr 1.
CS 461b/661b: Bioinformatics Tools and Applications Software Algorithm Mathematical Models Biology Experiments and Data.
GENOME CONSORTIUM ON ACTIVE TEACHING USING NEXT-GENERATION SEQUENCING Vince Buonaccorsi.
Clouds in Bioinformatics Rob Knight HHMI and University of Colorado at Boulder.
AWS Amazon Web Services Georges Akpoly CS252. Overview of AWS Amazon Elastic Compute Cloud (EC2) Amazon Simple Storage Service (S3) Amazon Simple Queue.
Spliced Transcripts Alignment & Reconstruction
NCBI Genome Workbench Chuong Huynh NIH/NLM/NCBI Sao Paulo, Brasil July 15, 2004 Slides from Michael Dicuccio’s Genome Workbench.
Introduction to RNAseq
The iPlant Collaborative
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.
Moderní metody analýzy genomu - analýza Mgr. Nikola Tom Brno,
Systems Analyst (Module V) Ashima Wadhwa. The Systems Analyst - A Key Resource Many organizations consider information systems and computer applications.
CIP HPC CIP - HPC HPC = High Performance Computer It’s not a regular computer, it’s bigger, faster, more powerful, and more.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
High Throughput Sequence (HTS) data analysis 1.Storage and retrieving of HTS data. 2.Representation of HTS data. 3.Visualization of HTS data. 4.Discovering.
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Canadian Bioinformatics Workshops
Manufacturing systems Brian Russell. Exam expectations Issues associated with Manufacturing are regularly tested in the written paper. Questions often.
Canadian Bioinformatics Workshops
QuasR: Quantify and Annotate Short Reads in R Anita Lerch, Dimos Gaidatzis, Florian Hahne and Michael Stadler Friedrich Miescher Institute for Biomedical.
Computing challenges in working with genomics-scale data
Using command line tools to process sequencing data
Quality Control & Preprocessing of Metagenomic Data
Transcriptomics II De novo assembly
Genome Sequence Annotation Server
Scaling Up Scientific Workflows with Makeflow
Canadian Bioinformatics Workshops
Yonglan Zheng Galaxy Hands-on Demo Step-by-step Yonglan Zheng
Objective of This Course
Introduction to G-OnRamp
MANUFACTURING SYSTEMS
Maximize read usage through mapping strategies
Follow-up from last night: XSEDE credits
Welcome - webinar instructions
Computational Pipeline Strategies
From Use Cases to Implementation
Campus and Phoenix Resources
Presentation transcript:

….. The cloud The cluster…..

What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2

Basically virtual computers…

To you.

What is a virtual computer?

What is a “regular” computer? Core 1 Core 2 Core 4 Core 3 8 GB

Core 1 Core 2 8 GB Core 3 Core 4

transcript assembly mrbayes – model 1 mrbayes – model 2

But it’s even cooler than that. You can have it your way! – Each machine can be setup just like your computer Programs, settings, etc. – Different machines for different tasks – Or one large machine for all tasks – Caveat – pretty much command line only

Momentary Digression What is the command line? – Text-based means of interacting with your computer – More likely to use on OSX or Linux – Fast – Somewhat obtuse

So, why, again, is this helpful? The Cloud can make similar resources available at a fraction of their overall cost. It’s essentially “on- demand” computing power. 48 Cores, 256 GB RAM = $33,500

Benefits of The Cloud Pay by the hour Use what you need No purchase/depreciation of equipment Almost instant access to many resources – If you need 1 node, no problem – If you need 500 nodes, no problem

Costs of The Cloud Few safety nets – With flexibility comes the power to do wrong Interactions can be complex – Requires proficiency in seemingly arcane tools (the CLI) Can be expensive Must rely on “others”

68.4 GB RAM 8 Cores

z $2.00/hr.

Why would you use this? Data pre-processing – Read trimming, Adapter trimming Genome assembly Long-running processes that tie up machines – mrbayes, raxml, best – alignments (blast, blat, lastz, bwa)

Practical example De novo Genome assembly – Have many reads – Need to put them together – Generally RAM intensive – Generally slow

Actual example Start an Amazon ec2 “instance” Add in necessary software Add 454 assembly software Get data to machine Start assembly Let it run Download assembled data

Reads Align and orient Assemble

Why is this hard? Must ensure correct ends overlap Must put correct pieces together Must do this quickly – Do things in RAM/memory Must deal with massive amounts of data – 0.5 to 2 to 20 GB or more

What, exactly, is a “cluster” Group of machines interacting to achieve a common goal

1000 Work Units Clusters

125 Work Units ~ 8X speedup or 1/8 th time

Why? Very long running processes/complex jobs – Genome:Genome alignments – Substitution models for thousands of loci – Species trees for thousands of loci Sometimes the only way to accomplish a “genome-scale” job in a reasonable time- frame

Practical example chr1 Similar

Practical example chr1 chr2 chr3 chr4

Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4

Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4

Cluster Caveats Sometimes not suited to certain jobs – Essentially those without component parts – Some modeling (e.g. mcmc) Complex – More moving parts = more to break

Clusters in the Cloud You have a big, complicated job You need many computers for a job You need to run job infrequently You don’t have massive computer resources

The Cloud as a service Alternative meaning of The Cloud Essentially web-powered software “Galaxy” is one such service

Galaxy Very powerful analyses Relatively simple to use Repeatable Understandable Extendable

Galaxy – Basic services Convert fastq to fasta Summarize fastq reads Fasta + Qual to Fastq Trim fastq reads Merge data sets Convert SFF

Galaxy – Advanced services Intersect genomic regions Merge genomic regions Map with bowtie Map with bwa Use bwa to identify variants Convert genome coordinates

Actual example Finding “missing” genes – You have a genome sequence – You have gene annotation (i,e. refseq) – You have aligned mRNA data – You want to know where these do not overlap

Galaxy is very flexible Runs locally Runs on network Runs on cluster Runs in cloud Runs on cluster in cloud

Galaxy has some pre-requisites You know what you want to do You generally know how to do it You know what the data are that you need You know how to ensure the results are correct Galaxy abstracts away the complexity of the implementation steps