Nature Reviews/2012. Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays.

Slides:



Advertisements
Similar presentations
DS-01 Disaster Risk Reduction and Early Warning Definition
Advertisements

GMD German National Research Center for Information Technology Darmstadt University of Technology Perspectives and Priorities for Digital Libraries Research.
Issues in methods and reuse for hypermedia ethnography Presented at QUADS Showcase day September 28, 2006 Louise Corti.
The Role of Environmental Monitoring in the Green Economy Strategy K Nathan Hill March 2010.
A Unified Approach to Combat Counterfeiting: Use of the Digital Object Architecture and ITU-T Recommendation X.1255 Robert E. Kahn President & CEO CNRI,
Supporting Simulations on the Cloud using Workflows & Virtual Machines Gary Polhill Macaulay Land Use Research Institute Edoardo Pignotti Computing Science,
Collaboration Proposal Proposed Interaction between CSDL and AAI With Narration by Nicholas J. Parks.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
The Challenges of Repeatable Experiment Archiving – Lessons from DETER Stephen Schwab SPARTA, Inc. d.b.a. Cobham Analytic Solutions May 25, 2010.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
XSEDE 13 July 24, Galaxy Team: PSC Team:
1 Richard White Design decisions: architecture 1 July 2005 BiodiversityWorld Grid Workshop NeSC, Edinburgh, 30 June - 1 July 2005 Design decisions: architecture.
National Cancer Institute U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES National Institutes of Health NCI Perspective on Informatics and Clinical Decision.
Fungal Semantic Web Stephen Scott, Scott Henninger, Leen-Kiat Soh (CSE) Etsuko Moriyama, Ken Nickerson, Audrey Atkin (Biological Sciences) Steve Harris.
Thee-Framework for Education & Research The e-Framework for Education & Research an Overview TEN Competence, Jan 2007 Bill Olivier,
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Chapter 8: Development of Business Intelligence
DATA PRESERVATION IN ALICE FEDERICO CARMINATI. MOTIVATION ALICE is a 150 M CHF investment by a large scientific community The ALICE data is unique and.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
GMOD in the Cloud Genome Informatics November 3, 2011 Scott Cain GMOD Project Coordinator Ontario Institute for Cancer Research
Rainbow Facilitating Restorative Functionality Within Distributed Autonomic Systems Philip Miseldine, Prof. Taleb-Bendiab Liverpool John Moores University.
Moving forward our shared data agenda: a view from the publishing industry ICSTI, March 2012.
Department of Biomedical Informatics Service Oriented Bioscience Cluster at OSC Umit V. Catalyurek Associate Professor Dept. of Biomedical Informatics.
1 ISA&D7‏/8‏/ ISA&D7‏/8‏/2013 Systems Development Life Cycle Phases and Activities in the SDLC Variations of the SDLC models.
Configuration Management (CM)
Colour of Ocean Data, Brussels, November 2002 Colour of Ocean Data: Discussion Panel Lesley Rickards British Oceanographic Data Centre.
IPlant cyberifrastructure to support ecological modeling Presented at the Species Distribution Modeling Group at the American Museum of Natural History.
NGS data analysis CCM Seminar series Michael Liang:
Model-Driven Analysis Frameworks for Embedded Systems George Edwards USC Center for Systems and Software Engineering
Topic Rathachai Chawuthai Information Management CSIM / AIT Review Draft/Issued document 0.1.
Digital Earth Communities GEOSS Interoperability for Weather Ocean and Water GEOSS Common Infrastructure Evolution Roberto Cossu ESA
The Digital Library for Earth System Science: Contributing resources and collections Meeting with GLOBE 5/29/03 Holly Devaul.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
1 Computing Challenges for the Square Kilometre Array Mathai Joseph & Harrick Vin Tata Research Development & Design Centre Pune, India CHEP Mumbai 16.
Achieving the MDGs: RBA Training Workshop Module 6: Investments in Public Management May 9-12, 2005.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Developed at the Broad Institute of MIT and Harvard Reich M, Liefeld T, Gould J, Lerner J, Tamayo P, and Mesirov JP. GenePattern 2.0. Nature Genetics 38.
S. Shumilov – Zürich Analytical Visualization Framework - a visual data processing and knowledge discovery system Ivan Denisovich, Serge Shumilov Department.
Chapter 6 CASE Tools Software Engineering Chapter 6-- CASE TOOLS
SCAPE Rainer Schmidt SCAPE Training Event September 16 th – 17 th, 2013 The British Library Building Scalable Environments Technologies and SCAPE Platform.
Theme 2: Data & Models One of the central processes of science is the interplay between models and data Data informs model generation and selection Models.
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
MODEL-BASED SOFTWARE ARCHITECTURES.  Models of software are used in an increasing number of projects to handle the complexity of application domains.
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
Foundations of Information Systems in Business. System ® System  A system is an interrelated set of business procedures used within one business unit.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
CS Architecture of Web Information Systems Spring 04 April 16 th 2004 Shay David sd256 at cornell.edu Social Networks in Scholarly publishing.
EMI INFSO-RI EMI Quality Assurance Tools Lorenzo Dini (CERN) SA2.4 Task Leader.
METHODOLOGICAL ISSUES IN QUALITATIVE DATA SHARING AND ARCHIVING THE PROJECT MIQDAS has been exploring the methodological.
Example projects using metadata and thesauri: the Biodiversity World Project Richard White Cardiff University, UK
2nd Texas A&M Big Data Workshop Development of “Big Data” Scientific Workflow Management Tools for the Materials Genome Initiative: “Materials Galaxy”
Data Services Task Team WGISS-22 meeting Annapolis, the US, September 12th 2006 Shinobu Kawahito, JAXA/RESTEC.
Working with your archive organization: Broadening your user community Robert R. Downs, PhD Socioeconomic Data and Applications Center (SEDAC) Center for.
Kenneth Pelman September 21, Introduction and Problem Statement Evaluation Plans Tool Description Significance and Limitations Future Research.
The Global Scene Wouter Los University of Amsterdam The Netherlands.
Using Docker in a CyVerse World The main portion of this tutorial should take about 45 minutes to go through, and assumes you have already gone through.
Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.
Preserving containers EUAN COCHRANE DIGITAL PRESERVATION MANAGER YALE UNIVERSITY LIBRARY.
Open Ag Data : Landscape Analysis ●Who is involved in collecting data on agricultural investments, and from whom? ●How is data publicly shared? Which.
INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.
1 The XMSF Profile Overlay to the FEDEP Dr. Katherine L. Morse, SAIC Mr. Robert Lutz, JHU APL
MIRACLE Cloud-based reproducible data analysis and visualization for outputs of agent-based models Xiongbing Jin, Kirsten Robinson, Allen Lee, Gary Polhill,
Tools and Services Workshop
University of Chicago and ANL
Joslynn Lee – Data Science Educator
INTAROS WP5 Data integration and management
Model-Driven Analysis Frameworks for Embedded Systems
Storing and Accessing G-OnRamp’s Assembly Hubs outside of Galaxy
Presentation transcript:

Nature Reviews/2012

Next-Generation Sequencing (NGS): Data Generation NGS will generate more broadly applicable data for various novel functional assays – Protein-DNA binding – Histone modification – Transcript levels – Spatial interactions – Combination of applications into larger studies 1000 Genomes Project

Next-Generation Sequencing (NGS): Data Interpretation Meaningful interpretation of sequencing data is important Rely heavily on complex computation Major problems – Low adoption of existing practice – Difficulty of reproducibility

Problem1: Low Adoption of Existing Practices Example: Variant discovery A series of accepted and accessible practices from “1000 Genomes Projects” – 299 articles in 2011 cited this project – Only 10 studies used the recommended tools – Only 4 studies used the full workflow Not following tested practices undermines the quality of biomedical research Why low adoption? – Over complicated logistical challenges (e.g. resort input data) – Limited application of toolkit (e.g. handful of well- annotated genomes) – Little agreement on what is considered to be the “best practice”

Problem2 Difficulty of Reproducibility Example: Read mapping To repeat a mapping experiment: primary data, software and its version, parameter setting, name of reference genome – 19 studies cited “1000 genomes projects”, only 6 satisfy all details – 50 random selected papers using burrows-wheeler aligner, only 7 provides all details Most results in today’s publications cannot be accurately verified, reproduced, adopted or used Why difficult? – Lack of mechanism for documenting analytical steps

Solution: Democratization of Biomedical Computation To achieve democratization – Developing best practices – Removing obstacles associated with heterogeneous software – Facilitating the interactive exploration of analysis parameters – Promoting the concepts of analysis transparency and reproducibility

Potential of Integrative Frameworks Combinations of diverse tools under the umbrella of an unified interface – E.g. BioExtract, Galaxy, GenePattern, GeneProf Advantages 1.Making data analysis transparent and reproducible 2.Making use of high-performance computing infrastructure 3.Improving long-term archiving

1. Promoting Transparency and Reproducibility Automatic tracking, recording and disseminating all details of computational analyses – GenePattern: embed details into Microsoft Word documents while preparing publication – Galaxy: create interactive Web-based supplements with analysis details Allow readers to inspect the described analysis in details

2. Using High-performance Computing Infrastructure High-performance computing resources – Computing clusters at institutions or nationwide efforts, e.g. XSEDE – Private and public clouds Not accessible to the broad biomedical community – Virtual machines or application-programming interface With integrative frameworks, anyone can deploy an solution on any type of resource – E.g. CloudMan User interface for managing computing clusters on cloud resources

3. Improving Long-term Archiving General vulnerability of centralized resources: longevity of hosted analysis services – Depend on various external factors, e.g. funding climate With integrative frameworks – Create snapshots of a particular analysis – Compose virtual machine images from analysis to be stored as an archival resource E.g. Dryad system or Figshare – Export complete collection of analysis automatically for archival Anyone can recreate a new virtual instance with this archival – Improved reproducibility

Future Directions: Tools Distribution Current practice – Tools needs to be compiled, installed and supplied with associated data E.g. short-read mapper requires genome indices Better practice – Digital platforms providing a set of tools to be automatically installed into users’ integrative framework environment Pioneer work: e.g. Gparc, Galaxy Tool Shed – Allow sharing of analysis workflows, data sets, visualizations and any other analysis artifacts

Future Directions: Integrate Analysis and Visualization Current practice – Visualization is the last step of an analysis Better practice – Visualization as an active component during analysis Advantages – Users are able to directly sense how parameter changes affect the final result in real time – In the context of publication, it aids readers to evaluate and inspect the results

Conclusion To sustain the growing application of NGS, data interpretation must be as accessible as data generation Necessary to bridge the gap between experimentalists and computational scientists – For experimentalists, embrace unavoidable computational components – For computational scientists, ensure the software is appeal to be used Emergence of integrative frameworks – Tracking details precisely – Ensuring transparency and reproducibility