Scalable systems.

Scalable systems

50 characters to 3 billion characters
Need for scaling Mapping a sequence read to the human genome

Need for scaling 50 characters to 3 billion characters
with sequencing errors Need for scaling Mapping a sequence read to the human genome

with sequencing errors while handling substitutions, InDels Need for scaling Mapping a sequence read to the human genome

with sequencing errors while handling substitutions, InDels taking read quality into account Need for scaling Mapping a sequence read to the human genome

with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads Need for scaling Mapping a sequence read to the human genome

with sequencing errors while handling substitutions, InDels taking read quality into account for a billion sequence reads to compare dozens of samples Need for scaling Mapping a sequence read to the human genome

Computing power vs heterogeneity and power requirements
Using GPUs Computing power vs heterogeneity and power requirements

CUDA, OpenVIDIA Towards open standards

First public servers built around GPUs

MapReduce Input & Output: each a set of key/value pairs
Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other languages

Parallel RMAP

Genome Analysis Toolkit (GATK)
Walkers as a reference implementationStandardization of workflows, reproducibility, documentation Genome Analysis Toolkit (GATK) Standardization of workflows, documentation

Hadoop Apache Software Foundation Distributed File System (HDFS)
Open MapReduce Implementation

Cloud computing: handling peak usage
(ownership, cost, private clouds, cloud on a chip), peaks s.org/ EC2, Eucalyptus Cloud BioLinux Ownership Cost Cloud Vs Grid Cloud on a Chip

Times Machine 405.000 TIFF files 3.3 million articles
Welcome. TimesMachine can take you back to any issue from Volume 1, Number 1 of The New-York Daily Times, on September 18, 1851, through The New York Times of December 30, Choose a date in history and flip electronically through the pages, displayed with their original look and feel. This all adds up to terabytes of data, in a less-than-web-friendly format. So reusing the EC2/S3/Hadoop method I discussed back in November, I got to work writing a few lines of code. Using Amazon Web Services, Hadoop and our own code, we ingested 405,000 very large TIFF images, 3.3 million articles in SGML and 405,000 xml files mapping articles to rectangular regions in the TIFF’s. This data was converted to a more web-friendly 810,000 PNG images (thumbnails and full images) and 405,000 JavaScript files — all of it ready to be assembled into a TimesMachine. By leveraging the power of AWS and Hadoop, we were able to utilize hundreds of machines concurrently and process all the data in less than 36 hours. TIFF files 3.3 million articles XML files with coordinates AWS/Hadoop processing in 36 hours

Crossbow Local vs external

Crossbow Bootstrap approach

Ergatis: workflow creation system

Ergatis: monitoring interface

Stand-alone servers built on Ergatis

CloVR Shrink-wrapped workflows on a Cloud system

Do-it-yourself

Scalable systems.

Similar presentations

Presentation on theme: "Scalable systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable systems.

Similar presentations

Presentation on theme: "Scalable systems."— Presentation transcript:

Similar presentations

About project

Feedback