Using Big D to Fight the Big C David Patterson February 14, 2013.

Slides:



Advertisements
Similar presentations
Three Perspectives & Two Problems Shivnath Babu Duke University.
Advertisements

The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
LESSON 1: What is Genetic Research? PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
Parallel Research at Illinois Parallel Everywhere
Running Large Graph Algorithms – Evaluation of Current State-of-the-Art Andy Yoo Lawrence Livermore National Laboratory – Google Tech Talk Feb Summarized.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
Wrapup. NHGRI strategic plan What does the NIH think genomics should be for the next 10 years? [Nature, Feb. 2011]
Matei Zaharia University of California, Berkeley Spark in Action Fast Big Data Analytics using Scala UC BERKELEY.
Berkeley Data Analytics Stack (BDAS) Overview Ion Stoica UC Berkeley UC BERKELEY.
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Discretized Streams An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker,
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica Spark Fast, Interactive,
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
Parallel Programming Henri Bal Rob van Nieuwpoort Vrije Universiteit Amsterdam Faculty of Sciences.
MSSG: A Framework for Massive-Scale Semantic Graphs Timothy D. R. Hartley, Umit Catalyurek, Fusun Ozguner, Andy Yoo, Scott Kohn, Keith Henderson Dept.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
Bioinformatics and Phylogenetic Analysis
Bindley Bioscience Center Vision: Nurture interactive communication and interdisciplinary discovery with flexible laboratory project spaces and an open.
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
3 rd Summer School in Computational Biology September 10, 2014 Frank Emmert-Streib & Salissou Moutari Computational Biology and Machine Learning Laboratory.
Cancer Genomics and Computing UC Berkeley CS David Patterson (UCB) and Taylor Sittler (UCSF) August 28, 2011.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
CLL Research Consortium FISH studies, Core C June, 2005 NCI Submission.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Copyright © 2012 Cleversafe, Inc. All rights reserved. 1 Combining the Power of Hadoop with Object-Based Dispersed Storage.
Bioinformatics Jan Taylor. A bit about me Biochemistry and Molecular Biology Computer Science, Computational Biology Multivariate statistics Machine learning.
D4M – Signal Processing On Databases
Clearstorydata.com Using Spark and Shark for Fast Cycle Analysis on Diverse Data Vaibhav Nivargi.
CceHUB A Knowledge Discovery Environment for Cancer Care Engineering Research Ann Christine Catlin HUBzero Workshop November 7, 2008.
Bioinformatics and it’s methods Prepared by: Petro Rogutskyi
Storage in Big Data Systems
PIER & PHI Overview of Challenges & Opportunities Ryan Huebsch † Joe Hellerstein † °, Boon Thau Loo †, Sam Mardanbeigi †, Scott Shenker †‡, Ion Stoica.
Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.
Master’s Degrees in Bioinformatics in Switzerland: Past, present and near future Patricia M. Palagi Swiss Institute of Bioinformatics.
1 CS 294: Big Data System Research: Trends and Challenges Fall 2015 (MW 9:30-11:00, 310 Soda Hall) Ion Stoica and Ali Ghodsi (
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
©Edited by Mingrui Zhang, CS Department, Winona State University, 2008 Identifying Lung Cancer Risks.
SHIP protein identified as a B-cell tumor suppressor Lymphoma is a cancer of the immune system. White blood cells divide again and again, spreading abnormally.
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
How Companies are Using Spark And where the Edge in Big Data will be Matei Zaharia.
Genomes To Life Biology for 21 st Century A Joint Initiative of the Office of Advanced Scientific Computing Research and Office of Biological and Environmental.
Bringing Genomics Home Your DNA: A Blueprint for Better Health Dr. Brad Popovich Chief Scientific Officer Genome British Columbia March 24, 2015 / Vancouver,
Big Data Analytics Carlos Ordonez. Big Data Analytics research Input? BIG DATA (large data sets, large files, many documents, many tables, fast growing)
WMU CS 6260 Parallel Computations II Spring 2013 Presentation #1 about Semester Project Feb/18/2013 Professor: Dr. de Doncker Name: Sandino Vargas Xuanyu.
Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Site Technology TOI Fest Q Celebration From Keyword-based Search to Semantic Search, How Big Data Enables That?
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
Enabling Genomic BIG DATA with Content Centric Networking J.J. Garcia-Luna-Aceves UC Santa Cruz
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
INTERPRETING GENETIC MUTATIONAL DATA FOR CLINICAL ONCOLOGY Ben Ho Park, M.D., Ph.D. Associate Professor of Oncology Johns Hopkins University May 2014.
Beyond Hadoop The leading open source system for processing big data continues to evolve, but new approaches with added features are on the rise. Ibrahim.
WHAT IS THE IMPACT OF THE HUMAN GENOME PROJECT FOR DRUG DEVELOPMENT? Arman & Fin.
Center for Bioinformatics and Genomic Systems Engineering Bioinformatics, Computational and Systems Biology Research in Life Science and Agriculture.
CSE 5810 Biomedical Informatics and Cloud Computing Zhitong Fei Computer Science & Engineering Department The University of Connecticut CSE5810: Introduction.
The Future of Whole Human Genome Data Management and Analysis, Available on the Microsoft Azure Platform Today MICROSOFT AZURE APP BUILDER PROFILE: SPIRAL.
An Overview of The Cancer Genome Atlas (TCGA)
Resilient Distributed Datasets A Fault-Tolerant Abstraction for In-Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave,
Raju Subba Open Source Project: Apache Spark. Introduction Big Data Analytics Engine and it is open source Spark provides APIs in Scala, Java, Python.
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Connected Infrastructure
SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data - Aditi Thuse.
Pathology Spatial Analysis February 2017
Spark Presentation.
Connected Infrastructure
ATOM Accelerating Therapeutics for Opportunities in Medicine
Introduction to Spark.
Genomes and Their Evolution
Overview of big data tools
Vrije Universiteit Amsterdam
Presentation transcript:

Using Big D to Fight the Big C David Patterson February 14, 2013

Outline AMPLab Overview How can Computer Scientists Help? Genetics 101 Berkeley’s fastest genome aligner: SNAP Fighting Cancer in the Future A 1M Genome Cancer Warehouse Conclusion 2

AMP Lab: Algorithms, Machines & People Adaptive/Active Machine Learning and Analytics Cloud Computing CrowdSourcing/ Human Computation Massive and Diverse Data Massive and Diverse Data Machine Learning, Databases, Systems, + Networking Release Berkeley Data Analysis Stack (BDAS)

AMP Expedition 4

MLBase (Declarative Machine Learning) BlinkDB (approx QP) Berkeley Data Analytics System 5 HDFS Shark (SQL) + Streaming AMPLab (released)3 rd partyAMPLab (in progress) Streaming Hadoop MR MPI Graphlab etc. Spark Shared RDDs (distributed memory) Mesos (cluster resource manager)

What is Spark? Fast, MapReduce-like engine –In-memory storage for very fast iterative queries –General execution graphs –Up to 100x faster than Hadoop (2-10x even on-disk data) Language-integrated API in Scala, Java, + Python Compatible with Hadoop’s storage APIs –Can access HDFS, HBase, S3, SequenceFiles, etc Matei Zahari will talk about Spark in PhD Session

Where CS can Help with War on Cancer 1.Create easy-to-use, fast, accurate, reliable genetic analysis software pipelines 2.Create massive, cheap, easy-to-use, privacy- protecting repository for cancer treatments showing tumor genomes over time, therapies, and outcomes

Cancer: Good and Bad News

Genetics 101 Double stranded DNA: ½ Mom, ½ Dad Base pairs: links between 2 bases Guanine-Cytosine, Adenine-Thymine Human DNA = 3.2B base pairs Gene: unit of heredity that corresponds to stretches of DNA Humans have ~25,000 genes, avg ~25,000 bp / gene Metabolic role of gene to produce enzymes, which controls a protein Pathway: chemical reactions in a cell that maintain organism, controlled in part by genes

Cancer 101 Normal cells have built-in limits to cell division/growth Cancer cells are immortal and mutate Cancer tumors are heterogeneous colonies of cancer cells Later spread throughout body (metastasize) Some pathways are associated with cancer cell growth / survival Taxonomy: location in body vs. genetics 1000s of subtypes of cancer? 1.6M new cases per year in US

Genetic Sequencing Overview consensus reference genome variant reconstructed genome variants

Read Alignment SNP Calling Structural Variant Detection Reconstructed Genome Reads Genetic Data Processing Software Pipeline

“The reality of implementing such a pipeline and optimizing underlying tasks is complex” 119 th step Malachi Griffith, Washington University, August 19, 2012 “Cancer genome and transcriptome sequencing – analysis challenges and bottlenecks”

Lack of SW Engineering by Scientists 2008 survey –Most scientists are self-taught in programming –Only ⅓ think formal training in SW Eng is important –< ½ have a good understanding of SW testing For example, bug in SW supplied by another research lab forced UCSD Scripps Prof to retract 5 papers –Science, Journal of Molecular Biology, and Proceedings of the National Academy of Sciences “Computational science:...Error…why scientific programming does not compute,” by Zeeya Merali, 13 October 2010, Nature 467,

Pipeline goals Build a faster, more scalable, more accurate pipeline Apply to both medicine & research  focus on cancer Interdisciplinary team: UC Berkeley, Intel, Microsoft, UCSF

AMP-Microsoft-Intel Genome Team UC Students/ Post-Docs – Ma’ayan Bresler – Kristal Curtis – Jesse Liptrap – Ameet Talwalkar – Jonathan Terhorst – Matei Zaharia –Yuchen Zhang UC Faculty – Michael Jordan – David Patterson – Satish Rao – Scott Shenker – Yun Song – Ion Stoica Expertise – Computational Biology/Medicine – Machine Learning – Systems External – Bill Bolosky (MS/MSR) – Mishali Naik (Intel) – Paolo Narvaez (Intel) – Ravi Pandya (MS) – Abirami Prabhakaran (Intel) – Taylor Sittler (UCSF) – Gans Srinivasa (Intel) – Arun Wiita (UCSF)

First Result: SNAP Scalable Nucleotide Alignment Program (SNAP) Longer seeds, Overlapping seeds, O(nd) vs. O(n 2 ) edit distance [Ukkonen] Aligner Reads/ sec Time (hours) Bowtie2 14, BWA 9, SNAP180,000 2 Fraction Errors % Aligned For 16 cores

SNAP C++ pipeline engine Many-core shared-memory parallelism Fast multi-threaded I/O Low-level optimization Many-core shared-memory parallelism Fast multi-threaded I/O Low-level optimization End-to-end in-memory pipeline Efficient memory usage End-to-end in-memory pipeline Efficient memory usage Common algorithms + data structures Common algorithms + data structures

Pipeline design Modern computer systems Spark Scala API SNAP pipeline engine

Where CS can Help with War on Cancer 1.Create easy-to-use, fast, accurate, reliable genetic analysis software pipelines 2.Create massive, cheap, easy-to-use, privacy- protecting repository for cancer treatments showing tumor genomes over time, therapies, and outcomes

Fighting Cancer in Future Patient arrives at oncologist withentire sequence of cancer’s genome – Mutations organized into key pathways Software identifies key pathwayscontributing to growth of cancer Therapies target these pathways aftertumor removed Patients starts with 1 st drug cocktail, switch to 2 nd when cancer mutates, switch to 3 rd when mutates again … – Take some medicine for rest of life? 2050???? Emperor of All Maladies, page 464

Feel Like Character in SciFi Movie Terminator 2 to prevent SkyNet from being invented: Sarah Connors 22

A Million Cancer Genome Warehouse Why 1M? Sufficient number of samples for cancer subtypes to discover meaningful patterns in the data (enough statistical power) (UCSF) (UCSC)

National effort: The Cancer Genome Atlas 10,000 tumors from 20 adult cancers DATA CENTER ANALYSIS CENTER SEQUENCING CENTER ANALYSIS CENTER TCGA CENTERS Baylor College of Medicine University of Texas, M.D. Anderson Cancer Ctr TCGA CENTERS Baylor College of Medicine University of Texas, M.D. Anderson Cancer Ctr ANALYSIS CENTER TCGA CENTERS BC Cancer Research Center Fred Hutchinson Cancer Research Center Complete Genomics Inc. Pacific NW National Laboratory University of Southern California Oregon Health & Science University Institute for Systems Biology University of California, Santa Cruz TCGA CENTERS BC Cancer Research Center Fred Hutchinson Cancer Research Center Complete Genomics Inc. Pacific NW National Laboratory University of Southern California Oregon Health & Science University Institute for Systems Biology University of California, Santa Cruz ANALYSIS CENTER TCGA CENTERS Boise State University TCGA CENTERS Boise State University PROTEOME CENTER ANALYSIS CENTER TCGA CENTERS Brigham & Women’s Hospital and Harvard Medical School Broad Institute John Hopkins University Memorial Sloan-Kettering Cancer Center TCGA CENTERS Brigham & Women’s Hospital and Harvard Medical School Broad Institute John Hopkins University Memorial Sloan-Kettering Cancer Center TCGA CENTERS Vanderbilt University Washington University Genome Institute TCGA CENTERS Vanderbilt University Washington University Genome Institute TCGA CENTERS University of North Carolina TCGA CENTERS University of North Carolina PROTEOME CENTER TCGA CENTERS International Genomics Consortium TCGA CENTERS International Genomics Consortium BIOSPECIMEN CORE BIOSPECIMEN CORE GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER TCGA CENTERS Nationwide Children’s Hospital TCGA CENTERS Nationwide Children’s Hospital GENOME CENTER GENOME CENTER GENOME CENTER GENOME CENTER PROTEOME CENTER ANALYSIS CENTER BIOSPECIMEN CORE BIOSPECIMEN CORE PROTEOME CENTER SEQUENCING CENTER TCGA Centers: Biospecimen Core Resource Genome Characterization Centers (GCCs) Genome Sequencing Centers (GSCs) Proteome Characterization Centers (PCCs) Data Coordination Center (DCC) Genome Data Analysis Centers (GDACs) TCGA Centers: Biospecimen Core Resource Genome Characterization Centers (GCCs) Genome Sequencing Centers (GSCs) Proteome Characterization Centers (PCCs) Data Coordination Center (DCC) Genome Data Analysis Centers (GDACs) GENOME CENTER GENOME CENTER DATA COORDINATING CENTER PROTEOME CENTER

Cancer Genomics Hub (CGHub) Built at UCSD supercomputer center to store DNA information in for The Cancer Genome Atlas Designed for 50,000 genomes with average of 100 gigabytes per genome: 5 petabytes Currently 24,000 files from ~5,500 cases, ~60 gigabytes/case, in total 2 PB of downloads in first 6 months, routine 3 Gbit/s Total Cost ~ $100/year/genome at 50K genomes, i.e. $5M/year. The technology cost is about ½ the total Co-location opportunities in same data center for groups who want to compute on the data

What would it cost to store and analyze 1M Cancer Genomes in 2014? Internet network bandwidth 100X cost datacenter networking, computation should be near data Cloud computing demonstrates large economies of scale of large computer warehouses White paper estimates ~ $25/genome/year in 2014 to store and analyze 100 petabytes reliably –25,000 disks and 100,000 processor cores –Including operating costs: space, electricity, operators –Including 2 nd center to protect against disasters Even at $1000 for wet lab costs, $25/year is cheap

Non-Technical Challenges for 1MGW Patient data: Ethicists v. Disease Advocates Cannot aggregate data for research without consent –E.g., Wilbanks patient “Portable Legal Consent” Can’t use “my” data until we publish –CGHub lifts competing publication moratorium after data published or after 18 months, whichever first Claiming IP rights on discoveries made from data –CGHub data is not encumbered by IP provisions Access: all qualified researchers from all institutions –CGHub data open to all qualified researcher Who: Government? Business? Non-Profit?

Conclusion: Societal-Scale Big Data App Genetic sequencing costs 1,000,000X less –$1000 per genome by end of year? Cancer: genetic disease that kills 0.6M/yr Chance for Computer Scientists to use Big Data technology to help fight Cancer(!) –Fast, accurate, easy to use genetics analysis pipeline –Fast, cheap, easy to use, privacy protecting repository of cancer genetics, treatments, outcomes Accelerate Personalized Cancer Therapy from ~2050 to ~2018?

Using Big D to Fight the Big C: Opportunity or Obligation? If a chance that Computer Scientists could help millions of cancer patients live longer and better lives, as moral people, aren’t we obligated to try? David Patterson, “Computer Scientists May Have What It Takes to Help Cure Cancer,” New York Times, 12/5/