….. The cloud The cluster…..
What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2
Basically virtual computers…
To you.
What is a virtual computer?
What is a “regular” computer? Core 1 Core 2 Core 4 Core 3 8 GB
Core 1 Core 2 8 GB Core 3 Core 4
transcript assembly mrbayes – model 1 mrbayes – model 2
But it’s even cooler than that. You can have it your way! – Each machine can be setup just like your computer Programs, settings, etc. – Different machines for different tasks – Or one large machine for all tasks – Caveat – pretty much command line only
Momentary Digression What is the command line? – Text-based means of interacting with your computer – More likely to use on OSX or Linux – Fast – Somewhat obtuse
So, why, again, is this helpful? The Cloud can make similar resources available at a fraction of their overall cost. It’s essentially “on- demand” computing power. 48 Cores, 256 GB RAM = $33,500
Benefits of The Cloud Pay by the hour Use what you need No purchase/depreciation of equipment Almost instant access to many resources – If you need 1 node, no problem – If you need 500 nodes, no problem
Costs of The Cloud Few safety nets – With flexibility comes the power to do wrong Interactions can be complex – Requires proficiency in seemingly arcane tools (the CLI) Can be expensive Must rely on “others”
68.4 GB RAM 8 Cores
z $2.00/hr.
Why would you use this? Data pre-processing – Read trimming, Adapter trimming Genome assembly Long-running processes that tie up machines – mrbayes, raxml, best – alignments (blast, blat, lastz, bwa)
Practical example De novo Genome assembly – Have many reads – Need to put them together – Generally RAM intensive – Generally slow
Actual example Start an Amazon ec2 “instance” Add in necessary software Add 454 assembly software Get data to machine Start assembly Let it run Download assembled data
Reads Align and orient Assemble
Why is this hard? Must ensure correct ends overlap Must put correct pieces together Must do this quickly – Do things in RAM/memory Must deal with massive amounts of data – 0.5 to 2 to 20 GB or more
What, exactly, is a “cluster” Group of machines interacting to achieve a common goal
1000 Work Units Clusters
125 Work Units ~ 8X speedup or 1/8 th time
Why? Very long running processes/complex jobs – Genome:Genome alignments – Substitution models for thousands of loci – Species trees for thousands of loci Sometimes the only way to accomplish a “genome-scale” job in a reasonable time- frame
Practical example chr1 Similar
Practical example chr1 chr2 chr3 chr4
Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4
Practical example chr1 chr2 chr3 chr4 chr1 chr2 chr3 chr4
Cluster Caveats Sometimes not suited to certain jobs – Essentially those without component parts – Some modeling (e.g. mcmc) Complex – More moving parts = more to break
Clusters in the Cloud You have a big, complicated job You need many computers for a job You need to run job infrequently You don’t have massive computer resources
The Cloud as a service Alternative meaning of The Cloud Essentially web-powered software “Galaxy” is one such service
Galaxy Very powerful analyses Relatively simple to use Repeatable Understandable Extendable
Galaxy – Basic services Convert fastq to fasta Summarize fastq reads Fasta + Qual to Fastq Trim fastq reads Merge data sets Convert SFF
Galaxy – Advanced services Intersect genomic regions Merge genomic regions Map with bowtie Map with bwa Use bwa to identify variants Convert genome coordinates
Actual example Finding “missing” genes – You have a genome sequence – You have gene annotation (i,e. refseq) – You have aligned mRNA data – You want to know where these do not overlap
Galaxy is very flexible Runs locally Runs on network Runs on cluster Runs in cloud Runs on cluster in cloud
Galaxy has some pre-requisites You know what you want to do You generally know how to do it You know what the data are that you need You know how to ensure the results are correct Galaxy abstracts away the complexity of the implementation steps