Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
Advertisements

Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
11:15:01 Storage device. Computer memory Primary storage 11:15:01.
Block1 Wrapping Your Nugget Around Distributed Processing.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
Using Deduplicating Storage for Efficient Disk Image Deployment Xing Lin, Mike Hibler, Eric Eide, Robert Ricci University of Utah.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 12: File System Implementation.
File organization Secondary Storage Devices Lec#7 Presenter: Dr Emad Nabil.
Information Technology (IT). Information Technology – technology used to create, store, exchange, and use information in its various forms (business data,
Lecture 2. A Computer System for Labs
Getting the Most out of Scientific Computing Resources
CMSC 611: Advanced Computer Architecture
STORAGE DEVICES Towards the end of this unit you will be able to identify the type of storage devices and their storage capacity.
Getting the Most out of Scientific Computing Resources
Operating System (013022) Dr. H. Iwidat
Activity 1 6 minutes Research Activity: What is RAM? What is ROM?
Lecture 16: Data Storage Wednesday, November 6, 2006.
Ramya Kandasamy CS 147 Section 3
How will execution time grow with SIZE?
Kay Ousterhout, Christopher Canel, Sylvia Ratnasamy, Scott Shenker
Swapping Segmented paging allows us to have non-contiguous allocations
IT 344: Operating Systems Winter 2008 Module 13 Secondary Storage
Chapter III, Desktop Imaging Systems and Issues: Lesson II Storing Image Data
Lecture 11: DMBS Internals
Data Storage In today’s lesson we will look at:
CSE 451: Operating Systems Spring 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) John Zahorjan Allen Center.
RAID Disk Arrays Hank Levy 1.
STORAGE DEVICES Towards the end of this unit you will be able to identify the type of storage devices and their storage capacity.
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Data Structures and Algorithms
ICOM 6005 – Database Management Systems Design
Memory and cache CPU Memory I/O.
Advanced Topics in Data Management
Building a Database on S3
RAID Disk Arrays Hank Levy 1.
CSE 451: Operating Systems Spring 2005 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Autumn 2010 Module 19 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
CSE 451: Operating Systems Winter 2009 Module 13 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Types of Computers Mainframe/Server
Cooperative Caching, Simplified
CSE 451: Operating Systems Winter 2007 Module 13 Secondary Storage
Mark Zbikowski and Gary Kimura
CSE 451: Operating Systems Autumn 2004 Redundant Arrays of Inexpensive Disks (RAID) Hank Levy 1.
1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.
Secondary Storage Management Brian Bershad
Persistence: hard disk drive
CSE 451: Operating Systems Secondary Storage
CSE 451: Operating Systems Winter 2012 Redundant Arrays of Inexpensive Disks (RAID) and OS structure Mark Zbikowski Gary Kimura 1.
Chapter 2: Operating-System Structures
2.C Memory GCSE Computing Langley Park School for Boys.
Introduction to Operating Systems
CSE 451: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
RAID Disk Arrays Hank Levy 1.
Chapter 14: File-System Implementation
CSE 451: Operating Systems Spring 2005 Module 13 Secondary Storage
Secondary Storage Management Hank Levy
CSE 451: Operating Systems Autumn 2004 Secondary Storage
CSE 451: Operating Systems Winter 2004 Module 13 Secondary Storage
Chapter Five Large and Fast: Exploiting Memory Hierarchy
Year 10 Computer Science Hardware - CPU and RAM.
CSE 451: Operating Systems Winter 2004 Module 17 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
IT 344: Operating Systems Winter 2007 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Chia-Chi Teng CTB
Chapter 2: Operating-System Structures
CSE 332: Data Abstractions Memory Hierarchy
CSE 451: Operating Systems Winter 2006 Module 18 Redundant Arrays of Inexpensive Disks (RAID) Ed Lazowska Allen Center 570.
Presentation transcript:

Canadian Bioinformatics Workshops

Data Management Asim Siddiqui Bioinformatics Workshop Next Generation Sequencing 26 th July 2008

It’s a new world Next gen sequencers generated Gb of data Architecture matters

Why should you care? Architecture matters –If you are choosing the compute resources for your lab, designing the correct system architecture is important to get the most use out of the systems –If you are using the compute resources of your lab, understanding the system architecture will help you get the most out of the system

What you won’t learn. This lecture will provide you with a basic understanding of computer systems, but.... Consult your local expert If you are the local expert, make friends at a large genome centre

The basics CPU Disk space RAM Bandwidth

CPU The speed of your CPU determines how quickly it can process instructions Many bioinformatics operations fall into the “embarrassingly parallel” category Getting a results faster is as simple as adding more CPUs => clusters

How many CPUs? For current throughputs, you will need ~8 CPUs per sequencer to handle data rates 8 way boxes are relatively easy to come by.

RAM RAM is fast storage close to the CPU By loading data from disk to RAM, the CPU can execute instructions much more rapidly

How much RAM? Typical sizing is 2GB of RAM per CPU This works fine for most aligners Assemblers typically need much more RAM If you don’t have enough RAM, the CPU will need to make use of the disk storage – When a computer has run out of RAM it is said to be “swapping”

Disk space Unlike RAM, information is retained after the machine is switched off Speed of access is slower than RAM Magnetic disks have a seek time and read time –Read data from a block is faster than seeking all over the disk to get the data Can be RAIDed to improve performance

How much disk space? An Illumina GA2 generates 5.35 GB per run (3 days) Including quality values and additional files results in 60GB per run Each machine will generate 7.3TB of data per year –Plus you will need to ~double that to store alignments and other data derived from the reads themselves Scaling

Space, or lack thereof, is the problem Disk space is probably the biggest problem today 60GB is for sequence data and quality values Should you store images?

To store or not to store? Storing images or their derivative, intensity values, is probably the biggest question at this time –SOliD/Illumina generate >1TB per run –Helicos ~50TB per run Fridge vs. amortization value of machine time

Understanding bandwidth Bandwidth of a connection represents the maximum rate of transfer between two points Most commonly, we think of network bandwidth, but there is also bandwidth –between the disk and CPU –between a RAID array and the CPU –between the RAM and the CPU

More bandwidth Depending on the algorithm, the CPU will process data at a particular rate The trick is always max out the CPU’s utilization Bandwidth to the CPU must be >= the rate at which the CPU can process data

Even more bandwidth E.g. An aligner can process X reads per second on a single CPU at a data rate of Y bytes/sec –~200 million reads in 10 hours –Each read 50 bases at 10 bytes per base –2.7 MB/sec Design 100TB storage and connect it to a CPU resource Design bandwidth to be 10 Mb/sec – plenty of spare bandwidth Now we want to complete the job in an hour and get permission to buy 10 more CPUs – great!

Not so great For the 10 CPUs to run at maximum speed, they need to be supplied data at 27MB/sec Our bandwidth is 10MB/sec Therefore, no matter how many CPUs we buy, the job will never run faster than ~ 2.5 hours

Balancing it all The best balance of compute resources is application dependant Aligners may require a different balance than assembler (or other algorithms) Decisions –If limited resource, design system to deal with the biggest bottleneck –Ideally, have different systems for different parts of the pipeline

Backing it up Where to start.... –Who’s backing up their data today? Back up to active disk is probably the easiest Backup to tape expensive –Person time –Need to refresh tapes –Slow Active disk can’t be taken offsite No great solutions out there... Except perhaps SEP machine...

Clouds Clouds, such as Amazon EC2, are emerging as a viable alternative to owning your own resources –Remote disk storage –Remote CPU –Pay as you go This could be a good option for a small lab

LIMS Generating all that data is no good if you don’t know where you’ve put it LIMS provide a mechanism for keeping track of all of the data Metadata (i.e. data about data) related to the experiments is stored in a database The database stores the location of the data files Tracking may be paper based or bar code based Essential for a centre running lots of expts

LIMS and other s/w Build in-house and off-the-shelf –Commercial or free solutions Is there a solution that meets your needs? –If missing some requirements, will the company/lab modify their s/w for your needs? How do you want to spend your time?

Emerging Standards Sequence Alignment Best practises Why are standards important? –Hint: wouldn’t you just like to focus on the science? “If you build it, will they come?”