C. Titus Brown Asst Prof, CSE and Micro Michigan State University

Slides:



Advertisements
Similar presentations
De Bruijn NG (1946). A Combinatorial Problem. Koninklijke Nederlandse Akademie v. Wetenschappen 49: 758–764. Eddy SR (2009). A new generation of homology.
Advertisements

Marius Nicolae Computer Science and Engineering Department
Natasha Pavlovikj, Kevin Begcy, Sairam Behera, Malachy Campbell, Harkamal Walia, Jitender S.Deogun University of Nebraska-Lincoln Evaluating Distributed.
Approaches for our growing metagenomes Kostas Konstantinidis Carlton S. Wilder Associate Professor School of Civil and Environmental Engineering & School.
SAN DIEGO SUPERCOMPUTER CENTER Blue Gene for Protein Structure Prediction (Predicting CASP Targets in Record Time) Ross C. Walker.
Jeff Dangl, UNC Chapel Hill Phil Hugenholtz, Susannah Tringe, JGI Ruth Ley, Cornell Rhizosphere Grand Challenge Pilot Project Scott Clingenpeel Project.
Next-generation sequencing
Bioinformatics for Whole-Genome Shotgun Sequencing of Microbial Communities By Kevin Chen, Lior Pachter PLoS Computational Biology, 2005 David Kelley.
An Efficient Parallel Approach for Identifying Protein Families from Large-scale Metagenomics Data Changjun Wu, Ananth Kalyanaraman School of Electrical.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
Characterizing and Predicting TCP Throughput on the Wide Area Network Dong Lu, Yi Qiao, Peter Dinda, Fabian Bustamante Department of Computer Science Northwestern.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Titus Brown Qingpeng Zhang John Blischak Welcome!.
Computer Science 101 Introduction to Programming.
De-novo Assembly Day 4.
C. Titus Brown Assistant Professor CSE, MMG, BEACON Michigan State University The pro-shotgun-assembly talk.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
CS 394C March 19, 2012 Tandy Warnow.
Todd J. Treangen, Steven L. Salzberg
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
Graph Algorithms for Irregular, Unstructured Data John Feo Center for Adaptive Supercomputing Software Pacific Northwest National Laboratory July, 2010.
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
Improving the Accuracy of Genome Assemblies July 17 th 2012 Roy Ronen *,1, Christina Boucher *,1, Hamidreza Chitsaz 2 and Pavel Pevzner 1 1. University.
Sequence assembly using paired- end short tags Pramila Ariyaratne Genome Institute of Singapore SOC-FOS-SICS Joint Workshop on Computational Analysis of.
IPlant Collaborative Tools and Services Workshop iPlant Collaborative Tools and Services Workshop Collaborating with iPlant.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
Metagenomics Assembly Hubert DENISE
The Future of the iPlant Cyberinfrastructure: Coming Attractions.
Cloud infrastructure for training in Life Sciences Manuel Corpas The Genome Analysis Centre.
The iPlant Collaborative
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Predator/Prey Simulation for Investigating Emergent Behavior Jay Shaffstall.
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.
10 Billion Piece Jigsaw Puzzles John Cleary Real Time Genomics.
Building a Real Workflow Thursday morning, 9:00 am Lauren Michael Research Computing Facilitator University of Wisconsin - Madison.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
1 HKU CS Bioinformatics Research Siu Ming Yiu Department of Computer Science The University of Hong Kong Other faculty members: Prof. Francis Chin Prof.
Genetics 760: Genomic Methods for Genetic Analysis Course Organizer: Jim TAs: Tim
University of Connecticut School of Engineering Assembler Reference Abyss Simpson et al., J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones,
De novo assembly validation
Accurate estimation of microbial communities using 16S tags
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
MERmaid: Distributed de novo Assembler Richard Xia, Albert Kim, Jarrod Chapman, Dan Rokhsar.
Zach Miller Computer Sciences Department University of Wisconsin-Madison Supporting the Computation Needs.
Why do Research? To learn more about the world. To learn more about us. To improve our lives and protect the environment. Its Fun!
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
INTRODUCTION TO WIRELESS SENSOR NETWORKS
Big Data Analytics and HPC Platforms
HPC In The Cloud Case Study: Proteomics Workflow
CyVerse Tools and Services
Tools and Services Workshop
Joslynn Lee – Data Science Educator
Considerations for metagenomics data analysis and summary of workflows
Metafast High-throughput tool for metagenome comparison
Introduction to Algorithms
Single-molecule sequencing and chromatin conformation capture enable de novo reference assembly of the domestic goat genome.
به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.
CSCI1600: Embedded and Real Time Software
CS110: Discussion about Spark
Metagenomics Microbial community DNA extraction
DNA Interlude.
CSCI1600: Embedded and Real Time Software
Campus and Phoenix Resources
Presentation transcript:

C. Titus Brown Asst Prof, CSE and Micro Michigan State University

Acknowledgements Lab members involvedCollaborators Adina Howe (w/Tiedje) Jason Pell Arend Hintze Rosangela Canino-Koning Qingpeng Zhang Elijah Lowe Likit Preeyanon Jiarong Guo Tim Brom Kanchan Pavangadkar Eric McDonald Jim Tiedje, MSU Janet Jansson, LBNL Susannah Tringe, JGI Funding USDA NIFA; NSF IOS; BEACON.

Open science, blogging, etc. All pub-ready work is available through arXiv. “k-mer percolation” “diginorm” Future papers will go there on submission, too. I discuss this stuff regularly on my blog (ivory.idyll.org/blog/) and Twitter All source code (Python/C++) is freely available, open source, documented, tested, etc. (github.com/ctb/khmer)/ …life’s too short to hide useful approaches! ~20-50 people independently using our approaches.

Number of Sequences Number of OTUs Soil contains thousands to millions of species (“Collector’s curves” of ~species)

SAMPLING LOCATIONS

A “Grand Challenge” dataset (DOE/JGI)

Assembly It was the best of times, it was the wor, it was the worst of times, it was the isdom, it was the age of foolishness mes, it was the age of wisdom, it was th It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness …but for lots and lots of fragments!

Repeats do cause problems: Assemble based on word overlaps:

De Bruijn graph assembly: k-mers as nodes, connected using overlaps. J.R. Miller et al. / Genomics (2010)

Fun Facts about de Bruijn graphs Memory usage lower bound scales as # of unique k-mers. Real sequence + Sequencing errors Assembly is a “big graph” problem – famously difficult to parallelize. We are looking at graphs with billion nodes. Need bigmem machines.

3 years of compressible probabilistic de Bruijn graphs & lossy compression algorithms =>

Contig assembly now scales with underlying genome size Transcriptomes, microbial genomes incl MDA, and most metagenomes can be assembled in under 50 GB of RAM, with identical or improved results. Has solved some specific problems with eukaryotic genome assembly, too. Can do ~300 Gbp agricultural soil sample in 300 GB of RAM, ~15 days (compare: 3 TB for others).

What do we get from assembling soil? Total Assembly Total Contigs % Reads Assembled Predicted protein coding rplb genes 2.5 bill4.5 mill19%5.3 mill bill5.9 mill22%6.8 mill466 This estimates number of species ^ Adina Howe Putting it in perspective: Total equivalent of ~1200 bacterial genomes Human genome ~3 billion bp (20% of reads assembled; est 50 Gbp needed for thorough sampling.)

What’s next? (~2 years) Improved digital normalization algorithms. Lossy compression for resequencing analysis. Eukaryotic genome assembly, incl high-heterozygosity samples. ~1.1 pass error-correction approach. “Infinite” metagenomic/mRNAseq assembly. Application to 454, PacBio, etc. data. Reasonably confident we can solve all assembly scaling problems.

…workflows? Mostly we’re running on single machines now: single pass, fixed memory algorithms => discard your HPC. One big remaining problem (much bigger metagenome data!) remains ahead… Want to enable sophisticated users to efficiently run pipeline, tweak parameters, and re-run. How??? Hack Makefiles? NONONONONONONONONONO. It might catch on!

Things we don’t need more of: 1. Assemblers. 2. Workflow management systems.

Things people like to write 1. Assemblers. 2. Workflow management systems.

Slightly more serious thoughts For workflows distributed with my tool, Want to enable good default behavior. Want to avoid stifling exploration and adaptation. GUI as an option, but not a necessity.

Introducing… ipython notebook Interactive ipython prompt, with Mathematica style cells ipython already contains DAG dependency tools, various kinds of scheduling, and multiple topologies for parallel work spawning and gathering. And it’s python! At first glance, may not seem that awesome. It is worth a second & third glance, and an hour or three of your time.

…but it’s still young. ipynb is not built for long-running processes. Right now I’m using it for “terminal” data analysis, display, and publications; ‘make’ (just ‘make’) for doing analyses. Many plans for Periodic reporting Collaboration environments Replicable research …all of which is easily supported at a protocol level by the nb.

Our problems We have “Big Data” and not “Big Compute” problems. We need bigmem more than anything else. …so nobody likes us . We also need a workflow system that supports interaction. We really like Amazon. People at the NIH are exploring…