Canadian Bioinformatics Workshops www.bioinformatics.ca.

Slides:



Advertisements
Similar presentations
XIr2 Recommended Performance Tuning Andy Erthal BI Practice Manager.
Advertisements

Buffers & Spoolers J L Martin Think about it… All I/O is relatively slow. For most of us, input by typing is painfully slow. From the CPUs point.
Hardware specification. The “brain” The brain processes instructions, performs calculations, and manages the flow of information through the body The.
1 Memory hierarchy and paging Electronic Computers M.
University of Notre Dame
10 REASONS Why it makes a good option for your DB IN-MEMORY DATABASES Presenter #10: Robert Vitolo.
OPNET Technologies, Inc. Performance versus Cost in a Cloud Computing Environment Yiping Ding OPNET Technologies, Inc. © 2009 OPNET Technologies, Inc.
Chapter 5 Computing Components. The (META) BIG IDEA Cool, idea but maybe too big DATA – Must be stored somewhere in a storage device PROCESSING – Data.
Ken Birman. Massive data centers We’ve discussed the emergence of massive data centers associated with web applications and cloud computing Generally.
Other Disk Details. 2 Disk Formatting After manufacturing disk has no information –Is stack of platters coated with magnetizable metal oxide Before use,
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
FAWN: A Fast Array of Wimpy Nodes Presented by: Aditi Bose & Hyma Chilukuri.
Lecture 1: Introduction CS170 Spring 2015 Chapter 1, the text book. T. Yang.
1 CS222: Principles of Database Management Fall 2010 Professor Chen Li Department of Computer Science University of California, Irvine Notes 01.
Chapter 6 Memory and Programmable Logic Devices
Secondary Storage Management Hank Levy. 8/7/20152 Secondary Storage • Secondary Storage is usually: –anything outside of “primary memory” –storage that.
Lecture 39: Review Session #1 Reminders –Final exam, Thursday 3:10pm Sloan 150 –Course evaluation (Blue Course Evaluation) Access through.
Virtual Network Servers. What is a Server? 1. A software application that provides a specific one or more services to other computers  Example: Apache.
RAMCloud Design Review Recovery Ryan Stutsman April 1,
Cloud Computing for the Enterprise November 18th, This work is licensed under a Creative Commons.
CS 346 – Chapter 10 Mass storage –Advantages? –Disk features –Disk scheduling –Disk formatting –Managing swap space –RAID.
Hardware Case that houses the computer Monitor Keyboard and Mouse Disk Drives – floppy disk, hard disk, CD Motherboard Power Supply (PSU) Speakers Ports.
Lecture 11: DMBS Internals
Chapter 3 Data Storage. Media Storage Main memory (Electronic Memory): Stores data currently being used Is made of semiconductor chips. Secondary Memory.
Bits and Bytes in a computers memory Inside the computer are millions of electronic switches. These are grouped together in bundles of 8. A switch can.
I/O – Chapter 8 Introduction Disk Storage and Dependability – 8.2 Buses and other connectors – 8.4 I/O performance measures – 8.6.
RAID COP 5611 Advanced Operating Systems Adapted from Andy Wang’s slides at FSU.
Chapter 2: CPU &Data Storage. CPU Each computer has at least one CPU Each computer has at least one CPU CPU execute instructions to carry out tasks –
HARDWARE: CPU & STORAGE How to Buy a Multimedia Computer System.
Introduction and Overview Questions answered in this lecture: What is an operating system? How have operating systems evolved? Why study operating systems?
Cloud Computing Characteristics A service provided by large internet-based specialised data centres that offers storage, processing and computer resources.
Memory  Main memory consists of a number of storage locations, each of which is identified by a unique address  The ability of the CPU to identify each.
Block1 Wrapping Your Nugget Around Distributed Processing.
CS 352 : Computer Organization and Design University of Wisconsin-Eau Claire Dan Ernst Memory Systems How to make the most out of cheap storage.
Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.
CSE332: Data Abstractions Lecture 8: Memory Hierarchy Tyler Robison Summer
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CS 153 Design of Operating Systems Spring 2015 Lecture 21: File Systems.
Lecture 3 Page 1 CS 111 Online Disk Drives An especially important and complex form of I/O device Still the primary method of providing stable storage.
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 4 Computer Systems Review.
Bio-IT World Conference and Expo ‘12, April 25, 2012 A Nation-Wide Area Networked File System for Very Large Scientific Data William K. Barnett, Ph.D.
Galaxy Community Conference July 27, 2012 The National Center for Genome Analysis Support and Galaxy William K. Barnett, Ph.D. (Director) Richard LeDuc,
PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,
Memory The term memory is referred to computer’s main memory, or RAM (Random Access Memory). RAM is the location where data and programs are stored (temporarily),
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.
Generations of Computing. The Computer Era Begins: The First Generation  1950s: First Generation for hardware and software Vacuum tubes worked as memory.
1 How will execution time grow with SIZE? int array[SIZE]; int sum = 0; for (int i = 0 ; i < ; ++ i) { for (int j = 0 ; j < SIZE ; ++ j) { sum +=
Computer Performance. Hard Drive - HDD Stores your files, programs, and information. If it gets full, you can’t save any more. Measured in bytes (KB,
1 Lecture 16: Data Storage Wednesday, November 6, 2006.
1 Chapter 1 Background Fundamentals of Java: AP Computer Science Essentials, 4th Edition Lambert / Osborne.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops bioinformatics.ca.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Information Technology (IT). Information Technology – technology used to create, store, exchange, and use information in its various forms (business data,
Lecture 16: Data Storage Wednesday, November 6, 2006.
Ramya Kandasamy CS 147 Section 3
How will execution time grow with SIZE?
Lecture 11: DMBS Internals
Data Structures and Algorithms
2.C Memory GCSE Computing Langley Park School for Boys.
Chapter 14: File-System Implementation
CSE 332: Data Abstractions Memory Hierarchy
Presentation transcript:

Canadian Bioinformatics Workshops

2Module #: Title of Module

Place an image representing the talk here Module 5 Data Management Place your institute’s logo here

Data Management bioinformatics.ca It’s a new world Next gen sequencers generated Gb (Tb?)of data Architecture matters

Data Management bioinformatics.ca Why should you care? Architecture matters – If you are choosing the compute resources for your lab, designing the correct system architecture is important to get the most use out of the systems – If you are using the compute resources of your lab, understanding the system architecture will help you get the most out of the system Different algorithms are optimized for different platform

Data Management bioinformatics.ca ,000 10, ,000 1,000, ,000 10, ,000 1,000,000 10,000, ,000,000 1,000,000,000 Disk Capacity vs Sequencing Capacity, Disk Storage (Mbytes/$) DNA Sequencing (bp/$) Hard disk storage (MB/$) Doubling time=14 mo Hard disk storage (MB/$) Doubling time=14 mo Pre-nextgen sequencing (bp/$) Doubling time=19 mo Pre-nextgen sequencing (bp/$) Doubling time=19 mo Nextgen sequencing (bp/$) Doubling time=4 mo Nextgen sequencing (bp/$) Doubling time=4 mo From Stein, 2010http://dx.doi.org/ /gb

Data Management bioinformatics.ca What you won’t learn. This lecture will provide you with a basic understanding of computer systems, but.... Consult your local expert If you are the local expert, make friends at a large genome centre

Data Management bioinformatics.ca The basics CPU Disk space RAM Bandwidth

Data Management bioinformatics.ca CPU The speed of your CPU determines how quickly it can process instructions Many bioinformatics operations fall into the “embarrassingly parallel” category Getting a results faster is as simple as adding more CPUs => clusters Though clusters still dominate, increasing attention is being paid to servers with large number of cores

Data Management bioinformatics.ca How many CPUs? For current throughputs, you will need ~8 CPU nodes per sequencer to handle data rates 8 way boxes are relatively easy to come by.

Data Management bioinformatics.ca RAM RAM is fast storage close to the CPU By loading data from disk to RAM, the CPU can execute instructions much more rapidly

Data Management bioinformatics.ca How much RAM? Typical sizing is 2GB of RAM per core This works fine for most aligners – Certain aligners require a minimum number of cores for optimal efficiency – 16-48GB RAM per node Assemblers typically need much more RAM If you don’t have enough RAM, the CPU will need to make use of the disk storage – When a computer has run out of RAM it is said to be “swapping” Some aligners have a minimum memory limit for optimal performance Human de novo requires lots of RAM!

Data Management bioinformatics.ca Disk space Unlike RAM, information is retained after the machine is switched off Speed of access is slower than RAM Magnetic disks have a seek time and read time – Read data from a block is faster than seeking all over the disk to get the data Can be RAIDed to improve performance

Data Management bioinformatics.ca How much disk space? Sequencers generate a lot of data Including quality values and additional files (alignments etc) create multiplier Binary formats such as BAM are helping, but there are limited binary formats for basic data – Life and Illumina are both moving to binary formats for basic data Debates on what needs to be stored Scaling

Data Management bioinformatics.ca Space, or lack thereof, is the problem Disk space remains the biggest problem Next: evolution in data is required Lots of data, but how much information? Key findings from resequencing experiments are SNVs and SVs – Do we need to store sequences? Food for thought… How much data required depends on accuracy of underlying data… data/information ratio…

Data Management bioinformatics.ca Understanding bandwidth Bandwidth of a connection represents the maximum rate of transfer between two points Most commonly, we think of network bandwidth, but there is also bandwidth – between the disk and CPU – between a RAID array and the CPU – between the RAM and the CPU

Data Management bioinformatics.ca More bandwidth Depending on the algorithm, the CPU will process data at a particular rate The trick is always max out the CPU’s utilization Bandwidth to the CPU must be >= the rate at which the CPU can process data

Data Management bioinformatics.ca Even more bandwidth E.g. An aligner can process X reads per second on a single CPU at a data rate of Y bytes/sec – ~200 million reads in 10 hours – Each read 50 bases at 10 bytes per base – 2.7 MB/sec Design 100TB storage and connect it to a CPU resource Design bandwidth to be 10 Mb/sec – plenty of spare bandwidth Now we want to complete the job in an hour and get permission to buy 10 more CPUs – great!

Data Management bioinformatics.ca Not so great For the 10 CPUs to run at maximum speed, they need to be supplied data at 27MB/sec Our bandwidth is 10MB/sec Therefore, no matter how many CPUs we buy, the job will never run faster than ~ 2.5 hours

Data Management bioinformatics.ca Balancing it all The best balance of compute resources is application dependant Aligners may require a different balance than assembler (or other algorithms) Decisions – If limited resource, design system to deal with the biggest bottleneck – Ideally, have different systems for different parts of the pipeline

Data Management bioinformatics.ca Backing it up Where to start.... – Who’s backing up their data today? Back up to active disk is probably the easiest Backup to tape expensive – Person time – Need to refresh tapes – Slow Active disk can’t be taken offsite No great solutions out there... Except perhaps SEP machine...

Data Management bioinformatics.ca Clouds Clouds, such as Amazon EC2, are establishing themselves as a viable alternative to owning your own resources – Remote disk storage – Remote CPU – Pay as you go There are a growing number of companies and open source platforms that are using cloud resources

Data Management bioinformatics.ca More on Clouds Clouds have gained a lot of attention in genomics Limiting factor is uploading data to the cloud – Most clouds have a bulk data upload facility

Data Management bioinformatics.ca LIMS Generating all that data is no good if you don’t know where you’ve put it LIMS provide a mechanism for keeping track of all of the data Metadata (i.e. data about data) related to the experiments is stored in a database The database stores the location of the data files Tracking may be paper based or bar code based Essential for a centre running lots of expts

Data Management bioinformatics.ca LIMS and other s/w Build in-house and off-the-shelf – Commercial or free solutions Is there a solution that meets your needs? – If missing some requirements, will the company/lab modify their s/w for your needs? How do you want to spend your time? – Now being applied to basic biofx tasks as well e.g. alignment

Data Management bioinformatics.ca Pushing the pain down the chain… The field needs to simplify basic processing tasks Transforming data to information should be easier

Module Name Here bioinformatics.ca We are on a Coffee Break & Networking Session