Download presentation
Presentation is loading. Please wait.
1
Scalable systems
2
50 characters to 3 billion characters
Need for scaling Mapping a sequence read to the human genome
3
Need for scaling 50 characters to 3 billion characters
with sequencing errors Need for scaling Mapping a sequence read to the human genome
4
Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels Need for scaling Mapping a sequence read to the human genome
5
Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account Need for scaling Mapping a sequence read to the human genome
6
Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads Need for scaling Mapping a sequence read to the human genome
7
Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads to compare dozens of samples Need for scaling Mapping a sequence read to the human genome
8
Computing power vs heterogeneity and power requirements
Using GPUs Computing power vs heterogeneity and power requirements
9
CUDA, OpenVIDIA Towards open standards
10
First public servers built around GPUs
11
MapReduce Input & Output: each a set of key/value pairs
Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages
12
Parallel RMAP
13
Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation
14
Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation
15
Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation
16
Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation
17
Hadoop Apache Software Foundation Distributed File System (HDFS)
Open MapReduce Implementation
18
Cloud computing: handling peak usage
(ownership, cost, private clouds, cloud on a chip), peaks s.org/ EC2, Eucalyptus Cloud BioLinux Ownership Cost Cloud Vs Grid Cloud on a Chip
19
Times Machine 405.000 TIFF files 3.3 million articles
Welcome. TimesMachine can take you back to any issue from Volume 1, Number 1 of The New-York Daily Times, on September 18, 1851, through The New York Times of December 30, Choose a date in history and flip electronically through the pages, displayed with their original look and feel. This all adds up to terabytes of data, in a less-than-web-friendly format. So reusing the EC2/S3/Hadoop method I discussed back in November, I got to work writing a few lines of code. Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files — all of it ready to be assembled into a TimesMachine. By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours. TIFF files 3.3 million articles XML files with coordinates AWS/Hadoop processing in 36 hours
20
Crossbow Local vs external
21
Crossbow Bootstrap approach
22
Ergatis: workflow creation system
23
Ergatis: monitoring interface
24
Stand-alone servers built on Ergatis
25
CloVR Shrink-wrapped workflows on a Cloud system
26
CloVR
27
Do-it-yourself
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.