Download presentation
Presentation is loading. Please wait.
1
a high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007
2
Applied Bioinformatics - WUR Wageningen University and Research Center Plant Research International Applied Bioinformatics sequencing center - Greenomics bioinformatics
3
overview introduction introduction to Cyrille2 scheduling executing some high throughput issues
4
finished sequence AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGTGCCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAAT GGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTTGTT AAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCCTTGTG TTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGTAATGGG TTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCAAAATTAT CCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACCCACCTG CCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGGCAGCTC TACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTAAAAAACAA AGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTAGAAAGGCCC AAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACATATGTATGAAAATT CACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTTATTACCTAGGGTTT TTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGCCTCCCATCACCTAAA ACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATCGCAACCTTATGGAACT GGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTCAAATTTTAATTTCACGA AGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAATCCTCTGCACCTATTAC AAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTAT AATATGGCACCACTTTTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACA GTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATT ATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTC CCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTC AGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG
5
finished sequence - annotate AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGATGCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAA TGGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTT GTTAAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCC TTGTGTTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGT AATGGGTTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCA AAATTATCCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACC CACCTGCCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGG CAGCTCTACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTA AAAAACAAAGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTA GAAAGGCCCAAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACAT ATGTATGAAAATTCACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTT ATTACCTAGGGTTTTTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGC CTCCCATCACCTAAAACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATC GCAACCTTATGGAACTGGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTC AAATTTTAATTTCACGAAGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAA TCCTCTGCACCTATTACAAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAG GACAAAATAGCCTGTATAATATGGCACCACTTTTGTAAAATCATGTAAGCACAGTAAATGGTAAAGGCTTTG AAAAACCTATCTTTAAACAGTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTA TGAATGTTCATACTTTATTATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAAC AGACTCTCACACACTCTTCCCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCAT GAGCCCAGGTGCCAGGTCAGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAG GCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG disease resistance gene
6
finished sequence - size AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGTGCCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAAT GGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTTGTT AAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCCTTGTG TTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGTAATGGG TTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCAAAATTAT CCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACCCACCTG CCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGGCAGCTC TACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTAAAAAACAA AGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTAGAAAGGCCC AAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACATATGTATGAAAATT CACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTTATTACCTAGGGTTT TTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGCCTCCCATCACCTAAA ACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATCGCAACCTTATGGAACT GGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTCAAATTTTAATTTCACGA AGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAATCCTCTGCACCTATTAC AAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTAT AATATGGCACCACTTTTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACA GTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATT ATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTC CCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTC AGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG Arabidopsis 130.000.000 bp Potato 840.000.000 bp Human 4.000.000.000 bp Wheat15.000.000.000 bp
7
genome annotation
8
upload sequence genome annotation
9
gene prediction upload sequence gene genome annotation
10
gene prediction upload sequence BLAST sequence gene blast report genome annotation
11
mirna prediction upload sequence(s) consolidate repeat finder gene prediction 2 gene prediction 1 BLASTPBLASTN predict targets Interpro genome annotation
12
execute (many) 3rd party tools time consuming data intensive – multiple data formats requires expert knowledge i.e. BLAST settings tools are mutual dependent output of one is the input of the next
13
what do we need? automated high throughput flexible extend with new tools add or change BACs change tools, i.e. a new BLAST database reliable error tracking problem recovery
14
cyrille2
15
What about Taverna? (or others?)
16
linux cluster pipeline database user interface biological database executor scheduler pipeline operator tools i.e. BLAST end user brief system overview status database
17
data transport & storage BioMOBY XML based standard web services remote execution of tools data transport uniform description of data uniform identification of data
18
pipeline database biological database status database iterative execution
19
pipeline database status database pipeline execution step 1 create pipeline seqgenblast biological database
20
biological database pipeline database status database pipeline execution step 2 upload sequence s seqgenblast
21
biological database pipeline database status database pipeline execution step 2 register in the status database ss seqgenblast pointer
22
biological database pipeline database status database ss seqgenblast scheduler pointers pointer unique object identification id articlename namespace database id
23
another biological database pipeline database status database scheduler system is db agnostic ss seqgenblast pointer unique object identification id articlename namespace another database id
24
biological database pipeline database status database pipeline execution step 3 scheduler operates on the status db ssg seqgenblast
25
biological database pipeline database status database pipeline execution step 4 executor ssggggg seqgenblast
26
biological database pipeline database status database pipeline execution step 4 executor ssggggg seqgenblast
27
biological database pipeline database status database pipeline execution back to step 3 scheduler ssgggggbb seqgenblast
28
biological database pipeline database status database pipeline execution step 4 executor......... until finished ssgggggbbbbbb seqgenblast
29
biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbss seqgenblast
30
biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssg seqgenblast
31
biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssggg seqgenblast
32
biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssgggb seqgenblast
33
biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssgggbbb seqgenblast
34
biological database pipeline database status database blast ssgggggbbbbbbssgggbbb change BLAST db seqgen change BLAST db
35
biological database pipeline database status database ssgggggbbssgggb pipeline execution change BLAST db seqgenblast
36
biological database pipeline database status database pipeline execution change BLAST db ssgggggbbbbbbssgggbbb seqgenblast
37
scheduler
38
operates on the status database status database stores all object relations allows complex scheduling operations
39
status database pipeline database pipeline execution more complex schedule steps sgggbbbbsggbb seqgenblast gbbgbb
40
pipeline database status database scheduler more complex schedule steps all objects at once sgggsgg seqgenX ggX
41
pipeline database status database scheduler more complex schedule steps all objects per analysis sgggsgg seqgenX ggXX
42
scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs Align cDNA (Sim4)
43
scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit ≈segment containing cDNA
44
scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit Align cDNA (Sim4)
45
scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit Align cDNA (Sim4)
46
executor
47
status database pipeline database executor sgggbbsggb seqgenblast gbgb
48
SGE cluster status database pipeline database executor sgggbbsggb seqgenblast gbgbbbbbb
49
cyrille2 core node get from database tool wrapper store in db single job execution
50
cyrille2 core node get from database tool wrapper store in db biological database BioMOBY XML pointerdata single job execution status database
51
cyrille2 core node get from database tool wrapper store in db biological database tool datapointerdata single job execution status database BioMOBY XML
52
cyrille2 core node get from database tool wrapper store in db biological database biological database BioMOBY XML datapointerdatapointer single job execution tool status database status database
53
high throughput issues
54
high throughput issues - lots of data webservices and XML do not like lots of data Cyrille2 sends pointers prevent using XML limit use of webservices
55
high throughput issues –jobs many small jobs jobs taking a long time Cyrille2 possible to create batches of many small jobs execution model can handle long running jobs
56
high throughput issues – don’t repeat yourself update of the BLAST database new version of a single input object (sequence) Cyrille2 repeat only that part of the pipeline that needs repeating
57
questions? some screenshots
64
people Mark Fiers Erwin Datema Ate van der Burgt Joost de Groot Sander Peters Jan van Haarst Marjo van Staveren Rene Klein Lankhorst Roeland van Ham
65
questions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.