Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable systems.

Similar presentations


Presentation on theme: "Scalable systems."— Presentation transcript:

1 Scalable systems

2 50 characters to 3 billion characters
Need for scaling Mapping a sequence read to the human genome

3 Need for scaling 50 characters to 3 billion characters
with sequencing errors Need for scaling Mapping a sequence read to the human genome

4 Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels Need for scaling Mapping a sequence read to the human genome

5 Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account Need for scaling Mapping a sequence read to the human genome

6 Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads Need for scaling Mapping a sequence read to the human genome

7 Need for scaling 50 characters to 3 billion characters
with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads to compare dozens of samples Need for scaling Mapping a sequence read to the human genome

8 Computing power vs heterogeneity and power requirements
Using GPUs Computing power vs heterogeneity and power requirements

9 CUDA, OpenVIDIA Towards open standards

10 First public servers built around GPUs

11 MapReduce Input & Output: each a set of key/value pairs
Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages

12 Parallel RMAP

13 Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation

14 Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation

15 Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation

16 Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation

17 Hadoop Apache Software Foundation Distributed File System (HDFS)
Open MapReduce Implementation

18 Cloud computing: handling peak usage
(ownership, cost, private clouds, cloud on a chip), peaks s.org/ EC2, Eucalyptus Cloud BioLinux Ownership Cost Cloud Vs Grid Cloud on a Chip

19 Times Machine 405.000 TIFF files 3.3 million articles
Welcome. TimesMachine can take you back to any issue from Volume 1, Number 1 of The New-York Daily Times, on September 18, 1851, through The New York Times of December 30, Choose a date in history and flip electronically through the pages, displayed with their original look and feel. This all adds up to terabytes of data, in a less-than-web-friendly format. So reusing the EC2/S3/Hadoop method I discussed back in November, I got to work writing a few lines of code. Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files — all of it ready to be assembled into a TimesMachine. By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours. TIFF files 3.3 million articles XML files with coordinates AWS/Hadoop processing in 36 hours

20 Crossbow Local vs external

21 Crossbow Bootstrap approach

22 Ergatis: workflow creation system

23 Ergatis: monitoring interface

24 Stand-alone servers built on Ergatis

25 CloVR Shrink-wrapped workflows on a Cloud system

26 CloVR

27 Do-it-yourself


Download ppt "Scalable systems."

Similar presentations


Ads by Google