Challenges in data management and running DNA sequencing experiments on grid EGI community forum – March 27, 2012 Barbera van Schaik, Mark Santcroos, Vladimir.

Slides:



Advertisements
Similar presentations
Next-Generation Sequencing: Methodology and Application
Advertisements

Bioinformatics & Medical Informatics MIK seminars 2011 Antoine van Kampen Bioinformatics Laboratory Academic Medical Center Amsterdam.
Bioinformatics for genomics Kickoff Bioinformatics Expertise Center 10 November 2009 Judith Boer Dept. of Human Genetics.
From DNA to patient care MIK seminars 2013 Antoine van Kampen Bioinformatics Laboratory Academic Medical Center Amsterdam www.
Virus discovery-454 sequencing
Genetic Approaches to Rare Diseases: What has worked and what may work for AHC Erin L. Heinzen, Pharm.D, Ph.D Center for Human Genome Variation Duke University.
Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS Ravi K Madduri University of Chicago and ANL.
Laboratory of Experimental Virology Virus Discovery 454 sequencing Michel de Vries
Informatics Support for Vaccine Projects Using and extending the UCSC bioinformatics infrastructure.
Using the Drupal Content Management Software (CMS) as a framework for OMICS/Imaging-based collaboration.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
NGS Analysis Using Galaxy
Whole Exome Sequencing for Variant Discovery and Prioritisation
DRAW+SneakPeek: Analysis Workflow and Quality Metric Management for DNA-Seq Experiments O. Valladares 1,2, C.-F. Lin 1,2, D. M. Childress 1,2, E. Klevak.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
E-BIOGENOUEST: A REGIONAL LIFE SCIENCES INITIATIVE FOR DATA INTEGRATION Datacite Annual Conference Nancy Olivier Collin – IRISA/INRIA
Bioinformatics Core Facility Ernesto Lowy February 2012.
Genomics Virtual Lab: analyze your data with a mouse click Igor Makunin School of Agriculture and Food Sciences, UQ, April 8, 2015.
A Grid Environment for Medical Imaging A Grid Environment for Medical Imaging LRMN Sorina POP, Tristan GLATARD.
Report on CSU HPC (High-Performance Computing) Study Ricky Yu–Kwong Kwok Co-Chair, Research Advisory Committee ISTeC August 18,
NGS data analysis CCM Seminar series Michael Liang:
CSIU Submission of BLAST jobs via the Galaxy Interface Rob Quick Open Science Grid – Operations Area Coordinator Indiana University.
Adding GO GO Workshop 3-6 August GOanna results and GOanna2ga 2. gene association files 3. getting GO for your dataset 4. adding more GO (introduction)
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
EMBRACE An example of Grid Integration (I): The EMBRACE project Jean SALZEMANN CNRS/IN2P3.
Turning science problems into HTC jobs Wednesday, July 29, 2011 Zach Miller Condor Team University of Wisconsin-Madison.
Current Challenges in Metagenomics: an Overview Chandan Pal 17 th December, GoBiG Meeting.
Bioinformatics Core Facility Guglielmo Roma January 2011.
NextGen Pipeline: Enabling the Plant Science Community Tom Brutnell (lead), Steve Rounsley (co-lead), Matt Vaughn (Engagement Lead) Ed Buckler, Justin.
Supporting Scientific Collaboration Online SCOPE Workshop at San Diego Supercomputer Center March 19-22, 2008.
The iPlant Collaborative Community Cyberinfrastructure for Life Science Tools and Services Workshop Discovery Environment Overview.
Overview of Bioinformatics 1 Module Denis Manley..
Stian Soiland-Reyes myGrid, School of Computer Science University of Manchester, UK UKOLN DevSci: Workflow Tools Bath,
Data Workflow Overview Genomics High- Throughput Facility Genome Analyzer IIx Institute for Genomics and Bioinformatics Computation Resources Storage Capacity.
UK NGS Sequencing Update July 2009 Dr Gerard Bishop - Division of Biology Dr Sarah Butcher – Centre for Bioinformatics.
An modular approach to fMRI metadata in a Virtual Laboratory - generic tools for specific problems M. Scott Marshall, Kasper van den Berg, Kamel Boulebiar,
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
Analyzing Time Course Data: How can we pick the disappearing needle across multiple haystacks? IEEE-HPEC Bioinformatics Challenge Day Dr. C. Nicole Rosenzweig.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
BIOINFOGRID: Bioinformatics Grid Application for life science MILANESI, Luciano National Research Council Institute of.
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Securing the Grid & other Middleware Challenges Ian Foster Mathematics and Computer Science Division Argonne National Laboratory and Department of Computer.
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.
INFSO-RI Enabling Grids for E-sciencE EGEE-2 NA4 Biomed Bioinformatics in CNRS Christophe Blanchet Institute of Biology and Chemistry.
DIRAC Project A.Tsaregorodtsev (CPPM) on behalf of the LHCb DIRAC team A Community Grid Solution The DIRAC (Distributed Infrastructure with Remote Agent.
….. The cloud The cluster…... What is “the cloud”? 1.Many computers “in the sky” 2.A service “in the sky” 3.Sometimes #1 and #2.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
Scaling bio-analyses from computational clusters to grids George Byelas University Medical Centre Groningen, the Netherlands IWSG-2013, Zürich, Switzerland,
Milanesi Luciano Catania, Italy 13/03/2007 Bioinformatics challenges in European projects in Grid. Milanesi Luciano National Research Council Institute.
Experiment Support CERN IT Department CH-1211 Geneva 23 Switzerland t DBES The Common Solutions Strategy of the Experiment Support group.
SCI-BUS Sílvia Delgado Olabarriaga e-BioScience Group Bioinformatics Laboratory Dept of Epidemiology, Biostatistics and Bioinformatics.
CyVerse Workshop Discovery Environment Overview. Welcome to the Discovery Environment A Simple Interface to Hundreds of Bioinformatics Apps, Powerful.
Using SHIWA Workflow Interoperability Tools for Neuroimaging Data Analysis Applications Vladimir Korkhov 1, Dagmar Krefting 2, Tamas Kukla 3, Gabor Terstyanszky.
Transforming Science Through Data-driven Discovery Workshop Overview Ohio State University MCIC Jason Williams – Lead, CyVerse – Education, Outreach, Training.
Galaxy based BLAST submission to distributed high throughput computing resources Rob Quick and Soichi Hayashi Open Science Grid Operations Indiana University.
Rafael Jimenez ELIXIR CTO BioMedBridges Life science requirements from e-infrastructure: initial results from a joint BioMedBridges workshop Stephanie.
Cancer Genomics Core Lab
Tools and Services Workshop
University of Chicago and ANL
Short Read Sequencing Analysis Workshop
Recap: introduction to e-science
MIK 2.1 DBNS - introduction to WS-PGRADE, 2013
Genome organization and Bioinformatics
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Computational Pipeline Strategies
KEY CONCEPT Entire genomes are sequenced, studied, and compared.
Presentation transcript:

Challenges in data management and running DNA sequencing experiments on grid EGI community forum – March 27, 2012 Barbera van Schaik, Mark Santcroos, Vladimir Korkhov, Aldo Jongejan, Marcel Willemsen, Antoine van Kampen and Silvia Olabarriaga

Introduction to the groups grid Sequence facility Research laboratories Bioinformatics NGS team e-BioScience team

Proof of concept: 30x speed-up Application is currently used by the virus discovery unit Presented at EGEE 2010: BLAST for virus discovery “Last week we did a new sequence run and we found 3 new viruses the next day!”

How (1) e-BioInfra architecture Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine Tristan Glatard (2008) International Journal of High Performance Computing Applications

How (2) Workflow technology Agile development Iteration strategy Re-use of components Replace components when better tools are available Visual representation of analysis steps in workflow J. Montagnat et al (2009) Workshop on Workflows in Support of Large-Scale Science (WORKS'09)

Changes: diversity of analyses Which gene(s) cause disease Z? Are there specific microRNAs in HIV infected patients? We have sequenced 20 bacterial genomes, what are the commonalities / differences? Which genes are differentially expressed in situation X versus Y? Workflows have been implemented for these cases

Common in most projects: BWA Aligns sequences to a reference database –Human genome –HIV genome –Bacterial genome Especially designed for shorter sequences Puts entire database in memory and aligns all experiment sequences Run time almost linear to the amount of sequences

Changes: expansion of the DNA sequence facility ~1 GB per run~60 GB per run~120 GB per run In total around 16 TB per year After data analysis: 10x size of the input data

Datasets per grid job became larger 8 GB 16 GB 70 GB ? GB Result: job time outs and disk quota per job reached

Improvements for BWA – split the input data Split Merge + speed-up + smaller files per job more jobs → more failed jobs

Implemented loops in workflow Checks if all files are generated Check Split Process

More changes and challenges: analyzing many big datasets Total raw data: 45 TB After alignment: 10x increase Project partners are performing consecutive analyses on grid

But first… getting the data on grid storage This step less than ideal –It took one week to transfer 10TB Luckily there is a more efficient system now These type of transfers (HD > grid storage) will definitely occur more often

After data analysis: share results Tomorrow 11:20, Tom Visser (Sara), this room LFC WIKI

Changes in the workflow engine - Needed to convert all component descriptions and workflows + End-users from Virus Discovery didn’t notice (except changes web-service URL and monitoring dashboard) 2

New changes ahead Bioinformaticians just got introduced to the portal Need to convert all 150 applications (again) ?

Why go through all this trouble? Why not write scripts in stead of workflows? Why not buy a bigger cluster?

Tools for next generation sequencing new tools for sequencing in the past two years! Better method available? Just replace component.

And … more data is expected Data throughput for each DNA sequence method

Genome projects Human genome project (1 individual) Exome sequencing (~10 individuals) Genome of the Netherlands (770 individuals) 1000 genome project (1000 individuals) UK 10K project (10,000 individuals) … URLs are in notes of this presentation

Measure non-protein- coding gene activity Finally: An example of an in-house project Measure gene activity Search for mutations causing disease (exome sequencing)

Verification of de novo mutations De novo mutations found in Nicolaides Baraitser patients Reviewers: Are these mutations specific for the disease? Deadline: yesterday :) Variants of 223 healthy people Variants of 770 healthy people Implementation workflow and gather input data: 2 weeks Run time: 1 day Repeat with more samples Run time: 1.5 day Annotation of variants

How e-science changes the work for bioinformaticians and biomedical reachers Respond to requests quickly Share both data and methods Analyze multiple datasets at once Work on several projects simultaneously

Acknowledgements Virus discovery unit, AMC Lia van der Hoek Bas Oude Munnink Michel de Vries Department of genome analysis, AMC Frank Baas Ted Bradley Marja Jakobs Department of Pediatrics, AMC Raoul Hennekam Laboratory division of AMC Bioinformatics Laboratory, AMC Antoine van Kampen NGS bioinformatics team Aldo Jongejan Marcel Willemsen e-Bioscience team Silvia Olabarriaga Mark Santcroos Vladimir Korkhov Souley Madougou Kyriacos Neocleous Shayan Shahand University of Amsterdam Piter de Boer BiG Grid Jan Just Keijser Tom Visser Grid support Modalis, France Johan Montagnat Creatis, France Tristan Glatard