Presentation is loading. Please wait.

Presentation is loading. Please wait.

Overview Big Data Big Data in Genomics Enter: The Cloud

Similar presentations


Presentation on theme: "Overview Big Data Big Data in Genomics Enter: The Cloud"— Presentation transcript:

0 Genomics: a journey into the Cloud
June 2, 2015

1 Overview Big Data Big Data in Genomics Enter: The Cloud
Cloud Technologies: Hadoop/MapReduce Cloud Technologies: NoSQL Applications in Genomics Million Veterans Program Challenges and Lessons Learned Questions

2 Big Data Big Data describes the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored or siloed due to the limitations of traditional data management techniques Three V’s of Big Data Variety Volume Velocity Big Data is not just a “Lot of Data” but really about methods applied on large data sets to derive value from it Google is the second most valuable company and it’s main revenue source us Big Data, much like twitter and facebook If you notice that this statement indicates a paradigm shift in the way data is used In essence, it’s about analyzing lots of different types of data in real time to derive value from it. Why do we care? – government, corporations

3 Big Data in Genomics Hypothesis drive vs. data driven approach
Cause and effect paradigm is inconsequential Data analytics techniques Hidden Markov Models Support Vector Machines Boltzmann Chains In a traditional sense, you’re not using data to “test your hypothesis” but “asking questions from the data” and letting data guide your interrogation Similarly, Data driven discoveries are agnostic to cause and effect paradigm, i.e. if altering a certain variable leads to a different outcome, or if you have enough data and variables to safely predict an outcome, then that’s all you need to know, without asking why or how… medicine, inputs and outputs without why and how So if you’ve done your job right, Big Data will reveal information and trends that could not have been possible without it Big Data analytics techniques used to mine twitter data to discover consumer sentiment can also be used to predict flu outbreaks or genes associated with cancer

4 If I had a peso… Transistor, basic unit of electronics – cost is halved every 18 months We’re very much in the era of big data and because of this cost graph, we need new ways to transmit, analyze, and store genomic data This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

5 Enter: The Cloud Definition: Deploying groups of remote servers and software networks that allow centralized data storage and online access to computer services or resources Benefits of the Cloud Resource Pooling Economies of Scale Rapid Elasticity and Scaling On-demand Storage and Compute Co-locating Data and Analytics Service Models Deployment Models Enable Big Data Analytics by Eliminating transfer bottlenecks by storing all you data sets in one location and bringing analytical tools to the data Providing the ability to scale massively in terms of storage and compute Reducing cost by on-demand provisioning – you need 100 processors today but tomorrow you need 0 Moving from a commodity model to a service model Harnessing the power of bigger data sets by eliminating silos and larger data sets to be aggregated and shared

6 Big Data and analytics: a match made in the Cloud
Cloud Service models demonstrate the migration of infrastructure, platform, and software tools as services instead of commodities Harnessing the power of Big Data and Cloud Computing entails bringing data and analytics together Hadoop/MapReduce platform is the most widely used platform for Big Data Analytics in the Cloud Software as a Service - Galaxy - GATK - Hbase Platform as a Service - Hadoop/MapReduce - Spark - MAPR Infrastructure as a Service - Amazon Web Services - Microsoft Azure - Rackspace Now you have leveraged some benefits of the cloud by putting all your data together with analytical tools, eliminating transfer, on-demand compute and such… But the real power of the cloud comes in to focus when you’re able to automate scalability and elasticity across platforms… this is where Hadoop comes into focus

7 Google’s Solution to the Big Data Problem

8 Harnessing the Power of the Cloud: Hadoop/MapReduce
Hadoop/MapReduce are frameworks for automatically scaling storage and compute Data and computations spread over thousands of computers HDFS handles Storage while MapReduce handles Compute Hadoop, developed by Yahoo is an open source implementation MapReduce is Google's framework for large data computations GATK is an alternative implementation specifically for NGS Benefits Scalable, Efficient, Reliable Easy to Program Runs on commodity computers Fast for very large jobs Fault tolerant Challenges Redesigning / Retooling applications Data Storage efficiency Threshold to reap processing benefits Slow for small jobs

9 What is HDFS? Hadoop Distributed File System breaks down data into chunks and distributes it across a cluster of machines Replication factor, block storage, Automatic recovery from node failure

10 How does HDFS work? NameNode: The “Master” node determines how chunks are data are distributed across DataNodes DataNodes: Stores “chunks” of data and replicates it across other DataNodes

11 What is MapReduce? MapReduce is a programming model for processing large data sets with parallel distributed algorithms on a cluster

12 How does MapReduce work? – part 1

13 How does MapReduce work – part 2
Each task is mapped on to a data block on HDFS Essentially taking analytical operation to the data in its most granular form

14 How does MapReduce work – part 3

15 Harnessing the Power of the Cloud: NoSQL
NoSQL or Not Only SQL is a class of databases that is modeled in means other than the tabular format of relational databases Column based instead of row based Efficient scale out architecture Flexible schema suited to object oriented programming Four basic types Document databases Graph stores Key-value stores Wide-column stores Schema-less architecture pushes out database relationships to the software level

16 Hadoop applications in genomics
Short Read Mapping Typical query/subject example Query: Read libraries split into smaller chunks by MapReduce Subject: Genome split into blocks by HDFS Genome Assembly De Bruijn Graphs Genome Wide Association Studies NoSQL SNP indexing Genomic Sequence Manager

17 The Million Veterans Program (MVP)
National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development Goal is to study how genes and environment factors affect veterans’ health Building one of the world's largest medical databases containing biological samples and health information from one million veterans Blood samples for genomic profiling Single Nucleotide Polymorphism (SNP) Array Analysis Next Generation Sequencing (NGS) Analysis Personal health surveys and military deployment history Electronic health records Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years Total Data Volume expected to exceed 10 Petabytes in 5 years This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

18 Overview This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

19 MVP Data Warehouse Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

20 Cloud Broker Cloud Portal manages access control for different types of data and users Cloud Engine co-locates data with analytical tools Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources Geographically distributed computational resources pooled through a virtual private cloud This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

21 Data Lake – Key Value Data Store
Patient PT-00589A Condition Diabetes Type II Survey S A3288 Deployment Vietnam War Sample SHIP Patient PT-00589A Tier 1 Tier 2 Tier 3 Access Control Sample SHIP Survey S A3288 Sample SHIP SNP rs SNP rs Genome Loc Chr7: SNP rs Condition Diabetes Type II Gene TCF7L2 Condition Diabetes Type II Genome Loc Chr7: Genotype T SNP rs Gene TCF7L2 This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

22 Challenges and Lessons Learned
Petabyte scale genomics data poses storage, transfer, and processing challenges Cloud computing offers optimal solutions for data storage and analytics Next generation algorithms with built-in scalability features (e.g. Apache Hadoop/MapReduce) Co-locating data and analytical tools to reduce data replication and transfer bottlenecks Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time – YMMV Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks Data integration and metadata annotation are critical in deriving knowledge from data Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines Data integration can be powered by annotation using multiple ontologies Data annotation upon ingest is crucial in a rapidly changing genomic sequencing landscape This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

23 Questions


Download ppt "Overview Big Data Big Data in Genomics Enter: The Cloud"

Similar presentations


Ads by Google