Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops
2
Module #: Title of Module
2
4
This presentation. Provided that:
You are free to: Copy, share, adapt, or re-mix; Photograph, film, or broadcast; Blog, live-blog, or post video of; This presentation. Provided that: You attribute the work to its author and respect the rights and licenses associated with its components. Slide Concept by Cameron Neylon, who has waived all copyright and related or neighbouring rights. This slide only ccZero. Social Media Icons adapted with permission from originals by Christopher Ross. Original images are available under GPL at;
5
Module 1 Intro to Cloud Computing and Virtual Machines
BF Francis Ouellette Bioinformatics on Big Data: Computing on the Human Genome September 29 – September 30, 2016
6
Disclaimer I do not (and will not) profit in any way, shape or form, from any of the brands, products or companies I may mention.
7
@bffo #CBWBD16
9
https://bioinformatics
10
Workshops planned for 2017: http://bioinformatics.ca/workshops
Bioinformatics for Cancer Genomics High-throughput Biology: From Sequence to Networks (CSHL) Introduction to R Exploratory Analysis of Biological Data using R Informatics for RNA-sequence Analysis Informatics on High Throughput Sequencing Data Pathway and Network Analysis of -omics Data Informatics and Statistics for Metabolomics Analysis of Metagenomic Data Bioinformatics for Big Data Epigenomic Data Analysis (other workshop? Stay tuned)
12
New for CBW: all on GitHub!
13
E-mail: course_info@bioinformatics.ca Web: http://bioinformatics.ca
Workshop announcement mailing list:
14
Soap-Box! Open Source Open Access Open Data Opencourseware
Open Access, Open Data and Open Source are essential for good Science. Openness is a responsibility, an obligation, and something that comes with the privilege of doing publicly funded work. Open Source Open Access Open Data Opencourseware
17
from the National Centre for Biotechnology Information
18
from the National Centre for Biotechnology Information
19
from the National Centre for Biotechnology Information
PANIC!
23
Learning Objectives of Module 1
Participants will get introduced to and understand: Scope of the “Bioinformatics on Big Data: Computing on the Human Genome” workshop Why we need to be computing in the cloud What we should be concerned about when doing so What it means to be working in the cloud What it means to be using a virtual machine
24
“Big Data” is a relative term!
This is what a 5MB hard drive looked like in 1956 This is what a 5 TB (1 million times more) looks like in 2016
26
Wikipedia cheat sheet
27
What is driving this data growth?
Technology! 2001 (Whitehead) (Illumina)
28
HiSeq X Sequencing Systems: 18,000 Whole Human Genome per year
1,800 years to sequence everybody in Canada 1.5 month to sequence all genomes from the PCAWG project
29
Cloud computing … and new software paradigm
Data sets are in the Petabyte and soon Exabyte scale. Data (and the security rules that come with it) will be somewhere (not in your own data centre), and you will move your software to it. Software development paradigm will change: no more reading of files into RAM, processing, and then writing output: you need to think about processing streaming data coming from a sequencing machine somewhere on the net.
30
Disk Capacity vs Sequencing Capacity, 1990-2009
Disk Storage (Mbytes/$) DNA Sequencing (bp/$) 1,000,000 1,000,000,000 Nextgen sequencing (bp/$) Doubling time=4 mo0 100,000,000 100,000 Hard disk storage (MB/$) Doubling time=14 mo 10,000,000 10,000 1,000,000 1,000 100,000 Pre-nextgen sequencing (bp/$) Doubling time=19 mo 10,000 100 1,000 10 100 1 10 1 1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 2012
31
About DNA and computers
We now have ~ $1000 genome, but now need to think more about the cost of the analysis. The doubling time of the reduction of sequencing in cost is in the “many months” range. The doubling time of storage and network bandwidth is “very small number of years” range. The doubling time of CPU speed is 18 months. The cost of sequencing a base pair will equal the cost of storing a base pair by in the next “very small number” of years.
32
What is the general biomedical scientists to do?
Lots of data Inadequate IT infrastructure in most labs Where do they go? Write more grants? Get more hardware? Look to the sky?
33
Genomic companies already there!
Typical sequencing company pipeline: ACGTACGTAAGTTCGGATGGCGTAGTCCCTTTTTGGGGTGTAGTGAGGCGCTGATTCGGAGAG All of the hard work done here!
34
Most people already there!
Google docs Dropbox Netflix Twitter Oxford Nanopore Illumina
36
Amazon Web Services (AWS)
Infinite storage (scalable): S3 (simple storage service) Compute per hour: EC2 (elastic cloud computing) Ready when you are High Performance Computing Multiple football fields of HPC throughout the world HPC are expanded at one contained at a time:
37
Some of the challenges with cloud computing:
Not cheap! ( Getting files to (free) and from (not free) there Not the best solution for everybody Standardization PHI: personal health information & security concerns It is a US company, so need to deal with the “Patriot act”.
38
Academic clouds: Compute Canada
Vision To make Canada a world leader in the use of advanced computing for research, discovery and innovation. Mission To enable excellence in research and innovation for the benefit of Canada by effectively, efficiently and sustainably deploying a state-of-the-art advanced research computing network supported by world-class expertise. To use this network to support a growing base of excellent researchers, and to serve them as a national voice for advanced research computing.
40
Compute Canada infrastructure
Usually only available to people from Canada Usable by all in this workshop Cancer Genome Collaboratory is developing an alternative sustainability model: data there, but you pay for compute cycles.
41
How to interact with the cloud?
Think of it as an High Performance Computing system that somebody else is taking care. The AWS touted concept of “elasticity” is also very useful: you use what you need, and then, turn it off when you are done.
42
Virtual Machine Monitor
App OS App OS App OS App OS App OS App OS Application(s) Operating Systems Virtual Machine Monitor Hardware Hardware Traditional Computer Virtual Machine
43
Virtual Machine vs Docker
44
https://www.docker.com/ Docker containers wrap a piece of software
Use a complete filesystem that contains everything needed to run This guarantees that the software will always run the same, regardless of its environment.
45
Human Data Personal health information, and things that can identify you are private. That also includes genomic sequences that can identify you. In the research community, society has provide a way for scientists to use this data, but scientists have to agree to some important rules.
46
In this workshop: You will learn about the ethics and rules allowing one to use human data. You will learn about VMs You will learn about docker You will learn about the Cancer Genome Collaboratory You will learn about PCAWG
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.