Cancer Genomics and Computing UC Berkeley CS 294-75 David Patterson (UCB) and Taylor Sittler (UCSF) August 28, 2011.

Slides:



Advertisements
Similar presentations
Frank van Harmelen Vrije Universiteit Amsterdam The Information Universe of the (Near) Futur e Creative Commons License: allowed to share & remix, but.
Advertisements

Frank van Harmelen Vrije Universiteit Amsterdam The Web of data and LarKC’s role in it Creative Commons License: allowed to share & remix, but must attribute.
Microsoft A Vision for Health. Consumerism/ Choice A Challenging World Public Health Healthcare spend increasing as % of GDP spend Increasing social cost.
The Datacenter Needs an Operating System Matei Zaharia, Benjamin Hindman, Andy Konwinski, Ali Ghodsi, Anthony Joseph, Randy Katz, Scott Shenker, Ion Stoica.
Large Scale Computing Systems
Siniša Ivković, Goran Rakočević, Prof. Veljko Milutinovic University of Belgrade School of Electrical Engineering.
Nokia Technology Institute Natural Partner for Innovation.
FUTURE TECHNOLOGIES Lecture 13.  In this lecture we will discuss some of the important technologies of the future  Autonomic Computing  Cloud Computing.
LESSON 1: What is Genetic Research? PowerPoint slides to accompany Using Bioinformatics : Genetic Research.
CS 431 The Semester in Elevator Speak Carl Lagoze – Cornell University May 5, 2004.
THE DATACENTER NEEDS AN OPERATING SYSTEM MATEI ZAHARIA, BENJAMIN HINDMAN, ANDY KONWINSKI, ALI GHODSI, ANTHONY JOSEPH, RANDY KATZ, SCOTT SHENKER, ION STOICA.
Turning Data into Value Ion Stoica CEO, Databricks (also, UC Berkeley and Conviva) UC BERKELEY.
A Berkeley View of Big Data Ion Stoica UC Berkeley BEARS February 17, 2011.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
Mesos A Platform for Fine-Grained Resource Sharing in Data Centers Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D. Joseph, Randy.
Bioinformatics: a Multidisciplinary Challenge Ron Y. Pinter Dept. of Computer Science Technion March 12, 2003.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
SNAP: Fast, accurate sequence alignment enabling biological applications Ravi Pandya, Microsoft Research ASHG 10/19/2014.
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University.
1 Reliable Adaptive Distributed Systems Armando Fox, Michael Jordan, Randy H. Katz, David Patterson, George Necula, Ion Stoica, Doug Tygar.
Introduction to Data Science Kamal Al Nasr, Matthew Hayes and Jean-Claude Pedjeu Computer Science and Mathematical Sciences College of Engineering Tennessee.
AMPCamp Introduction to Berkeley Data Analytics Systems (BDAS)
Big Data A big step towards innovation, competition and productivity.
CS Machine Learning. What is Machine Learning? Adapt to / learn from data  To optimize a performance function Can be used to:  Extract knowledge.
SYSTEMS SUPPORT FOR GRAPHICAL LEARNING Ken Birman 1 CS6410 Fall /18/2014.
Annual SERC Research Review - Student Presentation, October 5-6, Extending Model Based System Engineering to Utilize 3D Virtual Environments Peter.
Hadoop Team: Role of Hadoop in the IDEAL Project ●Jose Cadena ●Chengyuan Wen ●Mengsu Chen CS5604 Spring 2015 Instructor: Dr. Edward Fox.
Cloud Computing: The New Reality or Just Vapor? CMG2010 December 9, 2010 Paper 5137 Session 504 Dr. Tim R. Norton Simalytic Solutions, LLC
Tyson Condie.
Science, research and developmentEuropean Commission Chile-EC S&T Agreement Brussels, 24 September 2002 Life Sciences, Genomics and Biotechnology for Health.
Cell Cycle, Cancer, and the Biology Student Workbench An intro to what BSW can do 2002 Steve Moore & Kathy Gabric.
Counseling for IIT/NIT Aspirants Presented by YES Centre and The Hindu Group 16 June 2013, Hyderabad.
Sage Bionetworks Mission Sage Bionetworks is a non-profit organization with a vision to create a “commons” where integrative bionetworks are evolved by.
Charles Tappert Seidenberg School of CSIS, Pace University
Using Big D to Fight the Big C David Patterson February 14, 2013.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
Above the Clouds : A Berkeley View of Cloud Computing
Dr. T. Y. Lin | SJSU | CS 157A | Fall 2011 Chapter 1 THE WORLDS OF DATABASE SYSTEMS 1.
Mesos A Platform for Fine-Grained Resource Sharing in the Data Center Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony Joseph, Randy.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Bioinformatics Core Facility Guglielmo Roma January 2011.
Initial sequencing and analysis of the human genome Averya Johnson Nick Patrick Aaron Lerner Joel Burrill Computer Science 4G October 18, 2005.
Bringing Genomics Home Your DNA: A Blueprint for Better Health Dr. Brad Popovich Chief Scientific Officer Genome British Columbia March 24, 2015 / Vancouver,
Sage Bionetworks A non-profit organization with a vision to enable networked team approaches to building better models of disease BIOMEDICINE INFORMATION.
Overview of Bioinformatics 1 Module Denis Manley..
+ Big Data IST210 Class Lecture. + Big Data Summary by EMC Corporation ( More videos that.
Bioinformatics Curriculum Issues, goals, curriculum.
Wearable sensors: MJFF's wearable sensor platform to quantify PD in clinical trials and across patient populations Ken Kubota Director of Data Science.
Matei Zaharia, in collaboration with Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Haoyuan Li, Justin Ma, Murphy McCauley, Joshua Rosen, Reynold Xin,
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Bioinformatics: Cool stuff you can do with Computers and Biology Oded Magger Tel Aviv University / Autodesk inc. GIP course 2010.
Axis AI Solves Challenges of Complex Data Extraction and Document Classification through Advanced Natural Language Processing and Machine Learning MICROSOFT.
Smart Grid Big Data: Automating Analysis of Distribution Systems Steve Pascoe Manager Business Development E&O - NISC.
Bringing Genomics Home Your DNA: A Blueprint for Better Health Genome British Columbia November 18, 2015 / West Vancouver, BC.
Data Analytics (CS40003) Introduction to Data Lecture #1
To develop the scientific evidence base that will lessen the burden of cancer in the United States and around the world. NCI Mission Key message:
Bringing Genomics Home Your DNA: A Blueprint for Better Health
More than IaaS Academic Cloud Services for Researchers
Chapter 1- Introduction
Big Data Enterprise Patterns
Rocky K. C. Chang September 4, 2017
KnowEnG: A SCALABLE KNOWLEDGE ENGINE FOR LARGE SCALE GENOMIC DATA
U.S. Healthcare Artificial Intelligence Market to grow at 38% CAGR from 2017 to 2024
Hadoop Clusters Tess Fulkerson.
XtremeData on the Microsoft Azure Cloud Platform:
CompSci 1: Principles of Computer Science Lecture 1 Course Overview
Cloud Computing MapReduce in Heterogeneous Environments
TECHNOLOGY, ENGINEERING AND DATA CONTINUING AND PROFESSIONAL EDUCATION
SOFTWARE ENGINEERING CS-5337: Introduction
Presentation transcript:

Cancer Genomics and Computing UC Berkeley CS David Patterson (UCB) and Taylor Sittler (UCSF) August 28, 2011

Outline Cancer Challenge Big Data Technology Opportunity/Obligation Big Data / AMP Lab Course Overview Genetics 101, by Dr. Andy Poggio 2

Big Data, Genomics, and Cancer Cancer: a perversion of a normal cell –Limitless growth, evolves, then spreads: immortal disease Cancer is a genetic disease Accidental DNA cell copy flaws + carcinogen-caused mutations lead to cancer Turns on growth accelerator (oncogenes) or turns off tumor brakes (anti-oncogenes) 3

Big Data, Genomics, and Cancer “Indeed, as the fraction of those affected by cancer creeps inexorably in some nations form one in four to one in three to on in two, cancer will, indeed, become the new normal–an inevitability. The question will not be if we will encounter this immortal disease in our lives, but when.” ¼ US deaths, 7M/year worldwide ⅓ US women will get cancer ½ US men will get cancer 4

Five Steps from Genes to Diagnosis 1.Sequencing machine that identifies many 300 base pair segments from a cancer tumor 2.Create full genome sequence of the cancer tumor from many segments 3.Verify correctness of this sequence 4.Insights from comparing many sequences to cancer tumor genome 5.Diagnose and suggest of therapeutic targets for cure or non-progression based on tumor genome comparisons and patient records 5

1. Sequencing Machine Costs Improving faster than Moore’s Law 2007: $1,000, : $10, : $1, wet lab processing problem => 2012 digital processing problem Looks like sequencing machines not the bottleneck in speed or cost 6

2 & 3. Full sequence and verification UCSF proposal to use open source SW, Cloud, Hadoop, Hypertable, Berkeley tools (Spark, Mesos) >1 year on PC to <1 day in the cloud 7

4. Compare Sequences Sequence Alignment Machine Learning + Data Analytics Cloud Programming Frameworks 8

5. Clinical Diagnosis Suggest effective therapeutic targets for cure or stabilization –Often events are relatively rare mutations, and not identifiable by traditional statistical methods (solutions lie in long tail of rare mutations) Use patient records + natural language processing? Crowd Sourcing? –Can turn it into a “foldit” like game? 9

Big Data Opportunity The Cancer Genome Atlas (TCGA) –20 cancer types, 500 tumors each: 5 petabytes –Datacenter of David Haussler opens 10/1/11 –Place Berkeley cluster next to 5 PB of data –Novelty: Academic Access yet Important Big Data 10 Slide from David Haussler, UCSC, “Cancer Genomics,” AMP retreat, 5/24/11

TCGA Potential Impact? “We fully expect that 10 years from now, each cancer patient is going to want to get a genomic analysis of their cancer and will expect customized therapy based on that information.” Brad Ozenberger TCGA program director “Cracking Cancer's Code” Time Magazine June 2,

Information Technology Obstacle? “There is a growing gap between the generation of massively parallel sequencing output and the ability to process and analyze the resulting data. New users are left to navigate a bewildering maze of base calling, alignment, assembly and analysis tools with often incomplete documentation and no idea how to compare and validate their outputs. Bridging this gap is essential, or the coveted $1,000 genome will come with a $20,000 analysis price tag.” John D McPherson “Next-generation gap,” Nature Methods, October 15,

An Opportunity or Obligation? Given increasing genomic databases, next breakthroughs in cancer fight more likely to come from computer scientists than from biological scientists If it is plausible that we could help millions of cancer patients live longer and better lives, as moral people, don’t we have an obligation to try? 13

NoYes No Pure Basic Research (Bohr) Pure Applied Research (Edison) Research is inspired by: Consideration of use? Quest for Fundamental Understanding? From Pasteur’s Quadrant: Basic Science and Technological Innovation, Donald E. Stokes, 1997 Slide from “Engineering Education and the Challenges of the 21st Century,” Charles Vest, 9/22/09 Use-inspired Basic Research (Pasteur) Big Data and Pasteur’s Quadrant Attack Big Data by Helping Fight Cancer?

Early Result Sequence alignment 100X to 1000X faster due to better algorithm, use of clusters Matei Zaharia, Kristal Curtis, Taylor Sittler 15

Outline Cancer Challenge Big Data Technology Opportunity/Obligation Big Data / AMP Lab Course Overview Genetics 101, by Dr. Andy Poggio 16

Big Data is … Massive –Facebook: TB/day: 83 million pictures –Google: > 25 PB/day processed data Growing –More devices (cell phones), More people (3 rd world), Bigger disks (2TB/$100) Dirty –Diverse, No Schema, Uncurated, Inconsistent Syntax and Semantics 17 Many biologists don't trust any data that they didn't produce themselves. An important distinction is that biologists typically do trust tools they didn't create themselves, despite a similar potential for inaccuracy. They tend to acknowledge data bugs but not software bugs.

Current Biologist Tools These algorithms implemented as tools are critical Tools come from diverse individuals or small, ad hoc groups in the computational biology community –Not writing in modern ways or modern languages; Perl is popular Obvious improvements needed in interoperability and parallelism “The current state of genomics data is turning biologists into Perl programmers.” Gene Myers, Celera Human Genome Project 18

“Big Data”: Working Definition 19

3 Dimensions to Improve Data Analysis 20 1.Improve scale, efficiency, and quality of algorithms to increase value from Big Data (Algorithms) 2.Use cloud computing to get value from Big Data and enhance datacenter infrastructure to cut costs of Big Data management (Machines) 3.Leverage human activity and intelligence to extract value from Big Data cases that are hard for algorithms (People)

Algorithms, Machines, People Today’s solutions: 21 Algorithms Machines People search Watson/IBM

The AMPLab 22 search Watson/IBM Machines People Algorithms sense at scale by integrating Algorithms, Machines, and People Making sense at scale by integrating Algorithms, Machines, and People

CS Course Overview Mondays 5:30-7PM, 2 units Beginning–read background material, learn jargon –(see who does best on 9/12 jargon test!) Middle– Distinguished speakers and read their papers, select project, project choice and project progress End– Project Presentations + Reflect on what we learned + Identify low hanging fruit See 23

Outline Cancer Challenge Big Data Technology Opportunity/Obligation Big Data / AMP Lab Course Overview Genetics 101, by Dr. Andy Poggio 24

Backup Slides 25

AMP Faculty and Sponsors Started March 2011 Faculty –Alex Bayen (mobile sensing platforms) –Armando Fox (systems) –Michael Franklin (databases) Director –Michael Jordan (machine learning) –Anthony Joseph (security & privacy) –Randy Katz (systems) –David Patterson (systems) –Ion Stoica (systems) –Scott Shenker (networking) 26