Presentation is loading. Please wait.

Presentation is loading. Please wait.

A high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007.

Similar presentations


Presentation on theme: "A high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007."— Presentation transcript:

1 a high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007

2 Applied Bioinformatics - WUR Wageningen University and Research Center Plant Research International Applied Bioinformatics sequencing center - Greenomics bioinformatics

3 overview introduction introduction to Cyrille2 scheduling executing some high throughput issues

4 finished sequence AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGTGCCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAAT GGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTTGTT AAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCCTTGTG TTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGTAATGGG TTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCAAAATTAT CCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACCCACCTG CCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGGCAGCTC TACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTAAAAAACAA AGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTAGAAAGGCCC AAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACATATGTATGAAAATT CACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTTATTACCTAGGGTTT TTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGCCTCCCATCACCTAAA ACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATCGCAACCTTATGGAACT GGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTCAAATTTTAATTTCACGA AGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAATCCTCTGCACCTATTAC AAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTAT AATATGGCACCACTTTTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACA GTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATT ATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTC CCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTC AGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG

5 finished sequence - annotate AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGATGCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAA TGGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTT GTTAAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCC TTGTGTTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGT AATGGGTTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCA AAATTATCCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACC CACCTGCCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGG CAGCTCTACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTA AAAAACAAAGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTA GAAAGGCCCAAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACAT ATGTATGAAAATTCACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTT ATTACCTAGGGTTTTTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGC CTCCCATCACCTAAAACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATC GCAACCTTATGGAACTGGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTC AAATTTTAATTTCACGAAGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAA TCCTCTGCACCTATTACAAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAG GACAAAATAGCCTGTATAATATGGCACCACTTTTGTAAAATCATGTAAGCACAGTAAATGGTAAAGGCTTTG AAAAACCTATCTTTAAACAGTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTA TGAATGTTCATACTTTATTATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAAC AGACTCTCACACACTCTTCCCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCAT GAGCCCAGGTGCCAGGTCAGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAG GCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG disease resistance gene

6 finished sequence - size AAACCTCCTTAATTTTTTTCCCCCCTCCCCTGGGGTTAAATTCATGGCATCAAGAGGCCAAGGGTACTGAGCA GATGAGGGACTGTCCTACTTCAATCCAGTCTTGCAATCATCTCACACTTCTGTGAGTTTTTTTTTCTCTTTCCCA TTCTTTTATATTATAGTCACCTAACTACGCATTCCCGTGCCCTAAACTTTTTTGAACATTCCTTTTTCTCATTAAT GGCATCAGCATCTGGAACATAATATGTGATTTTAGCTGGCCACGCTGGATGATTTTAAAAAATAGCTTCTTGTT AAATTAGCTGCACCCAACCCTGACAAAATGTGCAGAACTTCTGAGATAAAAATATTAGCATTATTCCCCTTGTG TTTTGTTCATTTATAATTTCCTGTTTATAGATATTATTTGCTTCAAAATATGAACAGAGCAGGACTGGTAATGGG TTGACGGTTACTCATTGGTTGATTGGTCATGGGGTTTCATATACTGTTCTGTATTTGGAAATGGGCAAAATTAT CCTTAACCATAATTTAGAGCCCCGGTGATGCAGTGGTTCAGTGTTTCGCCGCTAACCACTTTGAACCCACCTG CCACTTTGTGGGAGAAAAGACGGCAGTCTGCTCCAGTACAGATTTCAGCCTAGTAAAACTATGGGGCAGCTC TACTCTGTTCTATAGGGTTGTTGTGAGACAATCGACGCAACGGCAAACAGCAACAAACCATAATTAAAAAACAA AGCCACTGTATACCCAGTGCCATCGAGTCGACCGACTCATAGCGACCCTATAGGACAGTAGTAGAAAGGCCC AAATTATCACACAGCCACGTTGTGAACTATTAAGCATATTTTTTTTAAAAATCCCATATACATATGTATGAAAATT CACCAACACATCTAAAATAGTTCTTTTTAACTTCTGGATAATGGTGTCTTAGTTTACATTTATTACCTAGGGTTT TTGTTTGTTTTACAAATGTAATACTTTCATAACCTGAAAATTAATTTGTTACAGAGCTGCCTCCCATCACCTAAA ACTCGCCATCACCTGTTTAAGCCTCATGTTCTCAAATTTCCTACTCAAGTATTTTTATCGCAACCTTATGGAACT GGGGATTCTGACAGGCAGATAGATCCTTAAGATCTGTGTGGACTGGCAGATCACTCAAATTTTAATTTCACGA AGATTATTTTAATGAAAAAAAGCCTCAGTGTCTGACAAATTGTATGTTCTCCGTGAAATCCTCTGCACCTATTAC AAAGGCATAGAAGTATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTAT AATATGGCACCACTTTTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACA GTGTCTCTTCGTGAATTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATT ATATAGACACACATTCCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTC CCAAATGCTGCCACAACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTC AGGCTCTATGCCTTCCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG ATATTCACTAACATGGGGGAAGAGTTCACAGTTTGAAAAAAGGACAAAATAGCCTGTATAATATGGCACCACTT TTGTAAAATCATGTAAGCACAGAAAATGGTAAAGGCTTTGAAAAACCTATCTTTAAACAGTGTCTCTTCGTGAA TTATGGGTGTTTTTAGCTTGTCTGCAGTTTTTTTCCCAGTATGAATGTTCATACTTTATTATATAGACACACATT CCTCAACCCTCTAGGCTCTCTCCACCCCCTCGAAGTAAACAGACTCTCACACACTCTTCCCAAATGCTGCCAC AACAGGATCTAACTTTTCCCCTGGACCTCTGTGACTATCATGAGCCCAGGTGCCAGGTCAGGCTCTATGCCTT CCATCCCGATGAACCCACATTTGGATTCATTCTCTGACAAGGCAAGAGAATGCAGG Arabidopsis 130.000.000 bp Potato 840.000.000 bp Human 4.000.000.000 bp Wheat15.000.000.000 bp

7 genome annotation

8 upload sequence genome annotation

9 gene prediction upload sequence gene genome annotation

10 gene prediction upload sequence BLAST sequence gene blast report genome annotation

11 mirna prediction upload sequence(s) consolidate repeat finder gene prediction 2 gene prediction 1 BLASTPBLASTN predict targets Interpro genome annotation

12 execute (many) 3rd party tools time consuming data intensive – multiple data formats requires expert knowledge i.e. BLAST settings tools are mutual dependent output of one is the input of the next

13 what do we need? automated high throughput flexible extend with new tools add or change BACs change tools, i.e. a new BLAST database reliable error tracking problem recovery

14 cyrille2

15 What about Taverna? (or others?)

16 linux cluster pipeline database user interface biological database executor scheduler pipeline operator tools i.e. BLAST end user brief system overview status database

17 data transport & storage BioMOBY XML based standard web services remote execution of tools data transport uniform description of data uniform identification of data

18 pipeline database biological database status database iterative execution

19 pipeline database status database pipeline execution step 1 create pipeline seqgenblast biological database

20 biological database pipeline database status database pipeline execution step 2 upload sequence s seqgenblast

21 biological database pipeline database status database pipeline execution step 2 register in the status database ss seqgenblast pointer

22 biological database pipeline database status database ss seqgenblast scheduler pointers pointer unique object identification id articlename namespace database id

23 another biological database pipeline database status database scheduler system is db agnostic ss seqgenblast pointer unique object identification id articlename namespace another database id

24 biological database pipeline database status database pipeline execution step 3 scheduler operates on the status db ssg seqgenblast

25 biological database pipeline database status database pipeline execution step 4 executor ssggggg seqgenblast

26 biological database pipeline database status database pipeline execution step 4 executor ssggggg seqgenblast

27 biological database pipeline database status database pipeline execution back to step 3 scheduler ssgggggbb seqgenblast

28 biological database pipeline database status database pipeline execution step 4 executor......... until finished ssgggggbbbbbb seqgenblast

29 biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbss seqgenblast

30 biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssg seqgenblast

31 biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssggg seqgenblast

32 biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssgggb seqgenblast

33 biological database pipeline database status database pipeline execution add a bac ssgggggbbbbbbssgggbbb seqgenblast

34 biological database pipeline database status database blast ssgggggbbbbbbssgggbbb change BLAST db seqgen change BLAST db

35 biological database pipeline database status database ssgggggbbssgggb pipeline execution change BLAST db seqgenblast

36 biological database pipeline database status database pipeline execution change BLAST db ssgggggbbbbbbssgggbbb seqgenblast

37 scheduler

38 operates on the status database status database stores all object relations allows complex scheduling operations

39 status database pipeline database pipeline execution more complex schedule steps sgggbbbbsggbb seqgenblast gbbgbb

40 pipeline database status database scheduler more complex schedule steps all objects at once sgggsgg seqgenX ggX

41 pipeline database status database scheduler more complex schedule steps all objects per analysis sgggsgg seqgenX ggXX

42 scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs Align cDNA (Sim4)

43 scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit ≈segment containing cDNA

44 scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit Align cDNA (Sim4)

45 scheduler more complex schedule steps based on object history Upload sequences Upload cDNAs create BLAST db BLAST BLAST hit Align cDNA (Sim4)

46 executor

47 status database pipeline database executor sgggbbsggb seqgenblast gbgb

48 SGE cluster status database pipeline database executor sgggbbsggb seqgenblast gbgbbbbbb

49 cyrille2 core node get from database tool wrapper store in db single job execution ‏

50 cyrille2 core node get from database tool wrapper store in db biological database BioMOBY XML pointerdata single job execution ‏ status database

51 cyrille2 core node get from database tool wrapper store in db biological database tool datapointerdata single job execution ‏ status database BioMOBY XML

52 cyrille2 core node get from database tool wrapper store in db biological database biological database BioMOBY XML datapointerdatapointer single job execution ‏ tool status database status database

53 high throughput issues

54 high throughput issues - lots of data webservices and XML do not like lots of data Cyrille2 sends pointers prevent using XML limit use of webservices

55 high throughput issues –jobs many small jobs jobs taking a long time Cyrille2 possible to create batches of many small jobs execution model can handle long running jobs

56 high throughput issues – don’t repeat yourself update of the BLAST database new version of a single input object (sequence) Cyrille2 repeat only that part of the pipeline that needs repeating

57 questions? some screenshots

58

59

60

61

62

63

64 people Mark Fiers Erwin Datema Ate van der Burgt Joost de Groot Sander Peters Jan van Haarst Marjo van Staveren Rene Klein Lankhorst Roeland van Ham

65 questions


Download ppt "A high throughput workflow management system Cyrille2 Mark Fiers NBIC practical course on web services and workflow management december 2007."

Similar presentations


Ads by Google