Overview Big Data Big Data in Genomics Enter: The Cloud

Slides:



Advertisements
Similar presentations
C-Store: Data Management in the Cloud Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY Jun 5, 2009.
Advertisements

Big Data Management and Analytics Introduction Spring 2015 Dr. Latifur Khan 1.
 Need for a new processing platform (BigData)  Origin of Hadoop  What is Hadoop & what it is not ?  Hadoop architecture  Hadoop components (Common/HDFS/MapReduce)
1 6/29/2015 XLDB ‘09 Luke Lonergan
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
Big Data A big step towards innovation, competition and productivity.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
This presentation was scheduled to be delivered by Brian Mitchell, Lead Architect, Microsoft Big Data COE Follow him Contact him.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Data Mining on the Web via Cloud Computing COMS E6125 Web Enhanced Information Management Presented By Hemanth Murthy.
Ch 4. The Evolution of Analytic Scalability
A Brief Overview by Aditya Dutt March 18 th ’ Aditya Inc.
Data Management Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition.
Cloud Distributed Computing Environment Content of this lecture is primarily from the book “Hadoop, The Definite Guide 2/e)
MapReduce April 2012 Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
Presented by CH.Anusha.  Apache Hadoop framework  HDFS and MapReduce  Hadoop distributed file system  JobTracker and TaskTracker  Apache Hadoop NextGen.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop Basics -Venkat Cherukupalli. What is Hadoop? Open Source Distributed processing Large data sets across clusters Commodity, shared-nothing servers.
W HAT IS H ADOOP ? Hadoop is an open-source software framework for storing and processing big data in a distributed fashion on large clusters of commodity.
Introduction to Apache Hadoop Zibo Wang. Introduction  What is Apache Hadoop?  Apache Hadoop is a software framework which provides open source libraries.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Introduction to Hadoop and HDFS
f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read
An Introduction to HDInsight June 27 th,
Ihr Logo Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang.
Grid Computing at Yahoo! Sameer Paranjpye Mahadev Konar Yahoo!
Big Data Analytics Large-Scale Data Management Big Data Analytics Data Science and Analytics How to manage very large amounts of data and extract value.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
1 Melanie Alexander. Agenda Define Big Data Trends Business Value Challenges What to consider Supplier Negotiation Contract Negotiation Summary 2.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
What we know or see What’s actually there Wikipedia : In information technology, big data is a collection of data sets so large and complex that it.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Big Data Analytics Platforms. Our Team NameApplication Viborov MichaelApache Spark Bordeynik YanivApache Storm Abu Jabal FerasHPCC Oun JosephGoogle BigQuery.
{ Tanya Chaturvedi MBA(ISM) Hadoop is a software framework for distributed processing of large datasets across large clusters of computers.
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Information Eastman. Business Process Skills Order to Cash, Forecasting & Budgeting, etc. Process Modeling Project Management Technical Skills.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 11: BIG DATA AND.
Big Data analytics in the Cloud Ahmed Alhanaei. What is Cloud computing?  Cloud computing is Internet-based computing, whereby shared resources, software.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
BIG DATA/ Hadoop Interview Questions.
What is it and why it matters? Hadoop. What Is Hadoop? Hadoop is an open-source software framework for storing data and running applications on clusters.
Abstract MarkLogic Database – Only Enterprise NoSQL DB Aashi Rastogi, Sanket V. Patel Department of Computer Science University of Bridgeport, Bridgeport,
Large Scale Semantic Data Integration and Analytics through Cloud: A Case Study in Bioinformatics Tat Thang Parallel and Distributed Computing Centre,
Million Veteran Program: Industry Day Genomic Data Processing and Storage Saiju Pyarajan, PhD and Philip Tsao, PhD Million Veteran Program: Industry Day.
Lecture 1 Book: Hadoop in Action by Chuck Lam Online course – “Cloud Computing Concepts” lecture notes by Indranil Gupta.
BI 202 Data in the Cloud Creating SharePoint 2013 BI Solutions using Azure 6/20/2014 SharePoint Fest NYC.
Organizations Are Embracing New Opportunities
Big Data is a Big Deal!.
SNS COLLEGE OF TECHNOLOGY
Big Data Enterprise Patterns
Hadoop Aakash Kag What Why How 1.
Introduction to Distributed Platforms
Million Veteran Program Data Marts and Data Access
CS122B: Projects in Databases and Web Applications Winter 2017
Chapter 14 Big Data Analytics and NoSQL
Azure's Performance, Scalability, SQL Servers Automate Real Time Data Transfer at Low Cost MINI-CASE STUDY “Azure offers high performance, scalable, and.
Ch 4. The Evolution of Analytic Scalability
Overview of big data tools
Big Data Young Lee BUS 550.
Charles Tappert Seidenberg School of CSIS, Pace University
Big DATA.
Big-Data Analytics with Azure HDInsight
Customer 360.
Presentation transcript:

Genomics: a journey into the Cloud June 2, 2015

Overview Big Data Big Data in Genomics Enter: The Cloud Cloud Technologies: Hadoop/MapReduce Cloud Technologies: NoSQL Applications in Genomics Million Veterans Program Challenges and Lessons Learned Questions

Big Data Big Data describes the realization of greater business intelligence by storing, processing, and analyzing data that was previously ignored or siloed due to the limitations of traditional data management techniques Three V’s of Big Data Variety Volume Velocity Big Data is not just a “Lot of Data” but really about methods applied on large data sets to derive value from it Google is the second most valuable company and it’s main revenue source us Big Data, much like twitter and facebook If you notice that this statement indicates a paradigm shift in the way data is used In essence, it’s about analyzing lots of different types of data in real time to derive value from it. Why do we care? – government, corporations

Big Data in Genomics Hypothesis drive vs. data driven approach Cause and effect paradigm is inconsequential Data analytics techniques Hidden Markov Models Support Vector Machines Boltzmann Chains In a traditional sense, you’re not using data to “test your hypothesis” but “asking questions from the data” and letting data guide your interrogation Similarly, Data driven discoveries are agnostic to cause and effect paradigm, i.e. if altering a certain variable leads to a different outcome, or if you have enough data and variables to safely predict an outcome, then that’s all you need to know, without asking why or how… medicine, inputs and outputs without why and how So if you’ve done your job right, Big Data will reveal information and trends that could not have been possible without it Big Data analytics techniques used to mine twitter data to discover consumer sentiment can also be used to predict flu outbreaks or genes associated with cancer

If I had a peso… Transistor, basic unit of electronics – cost is halved every 18 months We’re very much in the era of big data and because of this cost graph, we need new ways to transmit, analyze, and store genomic data This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Enter: The Cloud Definition: Deploying groups of remote servers and software networks that allow centralized data storage and online access to computer services or resources Benefits of the Cloud Resource Pooling Economies of Scale Rapid Elasticity and Scaling On-demand Storage and Compute Co-locating Data and Analytics Service Models Deployment Models Enable Big Data Analytics by Eliminating transfer bottlenecks by storing all you data sets in one location and bringing analytical tools to the data Providing the ability to scale massively in terms of storage and compute Reducing cost by on-demand provisioning – you need 100 processors today but tomorrow you need 0 Moving from a commodity model to a service model Harnessing the power of bigger data sets by eliminating silos and larger data sets to be aggregated and shared

Big Data and analytics: a match made in the Cloud Cloud Service models demonstrate the migration of infrastructure, platform, and software tools as services instead of commodities Harnessing the power of Big Data and Cloud Computing entails bringing data and analytics together Hadoop/MapReduce platform is the most widely used platform for Big Data Analytics in the Cloud Software as a Service - Galaxy - GATK - Hbase Platform as a Service - Hadoop/MapReduce - Spark - MAPR Infrastructure as a Service - Amazon Web Services - Microsoft Azure - Rackspace Now you have leveraged some benefits of the cloud by putting all your data together with analytical tools, eliminating transfer, on-demand compute and such… But the real power of the cloud comes in to focus when you’re able to automate scalability and elasticity across platforms… this is where Hadoop comes into focus

Google’s Solution to the Big Data Problem

Harnessing the Power of the Cloud: Hadoop/MapReduce Hadoop/MapReduce are frameworks for automatically scaling storage and compute Data and computations spread over thousands of computers HDFS handles Storage while MapReduce handles Compute Hadoop, developed by Yahoo is an open source implementation MapReduce is Google's framework for large data computations GATK is an alternative implementation specifically for NGS Benefits Scalable, Efficient, Reliable Easy to Program Runs on commodity computers Fast for very large jobs Fault tolerant Challenges Redesigning / Retooling applications Data Storage efficiency Threshold to reap processing benefits Slow for small jobs

What is HDFS? Hadoop Distributed File System breaks down data into chunks and distributes it across a cluster of machines Replication factor, block storage, Automatic recovery from node failure

How does HDFS work? NameNode: The “Master” node determines how chunks are data are distributed across DataNodes DataNodes: Stores “chunks” of data and replicates it across other DataNodes

What is MapReduce? MapReduce is a programming model for processing large data sets with parallel distributed algorithms on a cluster

How does MapReduce work? – part 1

How does MapReduce work – part 2 Each task is mapped on to a data block on HDFS Essentially taking analytical operation to the data in its most granular form

How does MapReduce work – part 3

Harnessing the Power of the Cloud: NoSQL NoSQL or Not Only SQL is a class of databases that is modeled in means other than the tabular format of relational databases Column based instead of row based Efficient scale out architecture Flexible schema suited to object oriented programming Four basic types Document databases Graph stores Key-value stores Wide-column stores Schema-less architecture pushes out database relationships to the software level

Hadoop applications in genomics Short Read Mapping Typical query/subject example Query: Read libraries split into smaller chunks by MapReduce Subject: Genome split into blocks by HDFS Genome Assembly De Bruijn Graphs Genome Wide Association Studies NoSQL SNP indexing Genomic Sequence Manager

The Million Veterans Program (MVP) National voluntary research program funded by the Department of Veterans Affairs Office of Research & Development Goal is to study how genes and environment factors affect veterans’ health Building one of the world's largest medical databases containing biological samples and health information from one million veterans Blood samples for genomic profiling Single Nucleotide Polymorphism (SNP) Array Analysis Next Generation Sequencing (NGS) Analysis Personal health surveys and military deployment history Electronic health records Genomic Informatics for Integrative Science (GenISIS) comprises hardware, platform, and tools to manage, store, and analyze MVP data Current recruitment has passed 400K samples with a goal of 1 Million samples in 5 years Total Data Volume expected to exceed 10 Petabytes in 5 years This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Overview This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

MVP Data Warehouse Metadata extracted from vendor generated genomic data using SNP Arrays Genotyping, Whole Genome Sequencing, and Whole Exome Sequencing will be cataloged in a Metadata Database Genomic data will be linked with corresponding de-identified clinical and survey data by an Honest Broker system Terminology and Annotation Server will allow researchers to incorporate a wide array of genomic and clinical annotations to integrate genomic, survey, and clinical data Query Mart will enable researchers to build cohorts and subset data using clinical and genomic information and export to the Data Mart for further analysis This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Cloud Broker Cloud Portal manages access control for different types of data and users Cloud Engine co-locates data with analytical tools Intelligent Orchestration Tool maps data and processes to storage and compute clusters to efficiently manage resources Geographically distributed computational resources pooled through a virtual private cloud This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Data Lake – Key Value Data Store Patient PT-00589A Condition Diabetes Type II Survey S-2014-06-18-A3288 Deployment Vietnam War Sample SHIP000675221 Patient PT-00589A Tier 1 Tier 2 Tier 3 Access Control Sample SHIP000675221 Survey S-2014-06-18-A3288 Sample SHIP000675221 SNP rs4362914 SNP rs4362914 Genome Loc Chr7:4344859978 SNP rs4362914 Condition Diabetes Type II Gene TCF7L2 Condition Diabetes Type II Genome Loc Chr7:4344859978 Genotype T SNP rs4362914 Gene TCF7L2 This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Challenges and Lessons Learned Petabyte scale genomics data poses storage, transfer, and processing challenges Cloud computing offers optimal solutions for data storage and analytics Next generation algorithms with built-in scalability features (e.g. Apache Hadoop/MapReduce) Co-locating data and analytical tools to reduce data replication and transfer bottlenecks Genomic data is PHI and should be protected using Data-in-Motion and Data-at-Rest best practices Encryption and decryption of genomic datasets constitute a significant fraction of data transfer and analysis time – YMMV Efficient architectural design of storage and processing systems diminish security risks and encryption/decryption bottlenecks Data integration and metadata annotation are critical in deriving knowledge from data Lack of unified standard formats in genomics necessitates substantial effort in highly specialized analytical pipelines Data integration can be powered by annotation using multiple ontologies Data annotation upon ingest is crucial in a rapidly changing genomic sequencing landscape This document contains Booz Allen Hamilton Inc. proprietary and confidential business information.

Questions