Download presentation
Presentation is loading. Please wait.
Published byMiranda Stokes Modified over 9 years ago
1
Challenges in data management and running DNA sequencing experiments on grid EGI community forum – March 27, 2012 Barbera van Schaik, Mark Santcroos, Vladimir Korkhov, Aldo Jongejan, Marcel Willemsen, Antoine van Kampen and Silvia Olabarriaga
2
Introduction to the groups grid Sequence facility Research laboratories Bioinformatics NGS team e-BioScience team
3
Proof of concept: 30x speed-up Application is currently used by the virus discovery unit Presented at EGEE 2010: BLAST for virus discovery “Last week we did a new sequence run and we found 3 new viruses the next day!”
4
How (1) e-BioInfra architecture Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In Biomedicine Tristan Glatard (2008) International Journal of High Performance Computing Applications
5
How (2) Workflow technology Agile development Iteration strategy Re-use of components Replace components when better tools are available Visual representation of analysis steps in workflow J. Montagnat et al (2009) Workshop on Workflows in Support of Large-Scale Science (WORKS'09)
6
Changes: diversity of analyses Which gene(s) cause disease Z? Are there specific microRNAs in HIV infected patients? We have sequenced 20 bacterial genomes, what are the commonalities / differences? Which genes are differentially expressed in situation X versus Y? Workflows have been implemented for these cases
7
Common in most projects: BWA Aligns sequences to a reference database –Human genome –HIV genome –Bacterial genome Especially designed for shorter sequences Puts entire database in memory and aligns all experiment sequences Run time almost linear to the amount of sequences http://bio-bwa.sourceforge.net/
8
Changes: expansion of the DNA sequence facility ~1 GB per run~60 GB per run~120 GB per run In total around 16 TB per year After data analysis: 10x size of the input data
9
Datasets per grid job became larger 8 GB 16 GB 70 GB ? GB Result: job time outs and disk quota per job reached
10
Improvements for BWA – split the input data Split Merge + speed-up + smaller files per job more jobs → more failed jobs
11
Implemented loops in workflow http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/CreatingWorkflows Checks if all files are generated Check Split Process
12
More changes and challenges: analyzing many big datasets http://www.dutchgenomeproject.com/ Total raw data: 45 TB After alignment: 10x increase Project partners are performing consecutive analyses on grid
13
But first… getting the data on grid storage This step less than ideal –It took one week to transfer 10TB Luckily there is a more efficient system now These type of transfers (HD > grid storage) will definitely occur more often
14
After data analysis: share results Tomorrow 11:20, Tom Visser (Sara), this room http://www.beehub.nl/ LFC WIKI
15
Changes in the workflow engine - Needed to convert all component descriptions and workflows + End-users from Virus Discovery didn’t notice (except changes web-service URL and monitoring dashboard) 2
16
New changes ahead Bioinformaticians just got introduced to the portal Need to convert all 150 applications (again) ?
17
Why go through all this trouble? Why not write scripts in stead of workflows? Why not buy a bigger cluster?
18
Tools for next generation sequencing http://seqanswers.com/wiki/Software 500 new tools for sequencing in the past two years! Better method available? Just replace component.
19
And … more data is expected http://www.wellcome.ac.uk/Education-resources/Teaching-and-education/Big-Picture/All-issues/Genes-Genomes-and-Health/WTDV027167.htm Data throughput for each DNA sequence method
20
Genome projects Human genome project (1 individual) Exome sequencing (~10 individuals) Genome of the Netherlands (770 individuals) 1000 genome project (1000 individuals) UK 10K project (10,000 individuals) … URLs are in notes of this presentation
21
Measure non-protein- coding gene activity Finally: An example of an in-house project Measure gene activity Search for mutations causing disease (exome sequencing)
22
Verification of de novo mutations De novo mutations found in Nicolaides Baraitser patients Reviewers: Are these mutations specific for the disease? Deadline: yesterday :) Variants of 223 healthy people Variants of 770 healthy people Implementation workflow and gather input data: 2 weeks Run time: 1 day Repeat with more samples Run time: 1.5 day Annotation of variants
23
How e-science changes the work for bioinformaticians and biomedical reachers Respond to requests quickly Share both data and methods Analyze multiple datasets at once Work on several projects simultaneously
24
Acknowledgements Virus discovery unit, AMC Lia van der Hoek Bas Oude Munnink Michel de Vries Department of genome analysis, AMC Frank Baas Ted Bradley Marja Jakobs Department of Pediatrics, AMC Raoul Hennekam Laboratory division of AMC Bioinformatics Laboratory, AMC Antoine van Kampen NGS bioinformatics team Aldo Jongejan Marcel Willemsen e-Bioscience team Silvia Olabarriaga Mark Santcroos Vladimir Korkhov Souley Madougou Kyriacos Neocleous Shayan Shahand University of Amsterdam Piter de Boer BiG Grid Jan Just Keijser Tom Visser Grid support Modalis, France Johan Montagnat Creatis, France Tristan Glatard http://www.bioinformaticslaboratory.nl/
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.