Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…

Similar presentations


Presentation on theme: "Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…"— Presentation transcript:

1 Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…

2 PlasmoDB workflow P.Falciparum Standard genome P.Gallenacium Standard genome P.Vivax Standard genome P.Yoelli Standard genome P.Berghei Standard genome P.Chabaudi Standard genome P.Knowlesi Standard genome P.Reichonowi Standard genome P.Falciparum Non-standard synteny

3 Standard Genome Workflow blastx Nrdb genome Splign In: Pf, Pb, Py, Pv blastp nrdb proteins molecular weight Isolelectric point molecular Weight Min/max psipred run TMHMM Load TMHMM taxonomySO NRDB Genome TIGR TGI Extract proteins Extract genomic sequence Copy proteins To cluster Copy genomic seqs To cluster Global steps (oval) Subflows (double line) Compile time Include/Exclude Calculate Translated protein In: Pf, Pk

4 Standard Genome Workflow blastx Nrdb genome Splign In: Pf, Pb, Py, Pv blastp nrdb proteins Calculate Translated protein In: Pf, Pk molecular weight Isolelectric point molecular Weight Min/max psipred run TMHMM Load TMHMM taxonomySO NRDB Genome TIGR TGI Extract proteins Extract genomic sequence Copy proteins To cluster Copy genomic seqs To cluster

5 NRDB Copy from download site Shorten defline NRDB resource Copy to cluster

6 Resources acquire unpack ext db Ext db rls insert

7 Psipred fix protein IDs For psipred create psipred Task dir copy Data Dir to cluster copy psipred Protein file to cluster start psipred On cluster wait for cluster copy psipred Files from cluster fix psipred File names make Alg Inv load psipred create psipred Data dir

8 BLAST Create Similarity dir Start blast Wait for cluster Copy files From cluster extract IDs From Blast result Load Subject subset Load Result Optional step (runtime test)

9 Splign runSplign Extract subject Sequence Alt defline insertSplign Extract query Sequence Alt defline

10 Issues

11 Steps Subflows –Parameters –Constants –Interpolating variables Global steps –Steps that are only executed once by the whole workflow, even if in multiple subflows –Declare a namespace? Include/exclude –Compile time inclusion/exclusion –If not compiled in, flow passes right through Skip-able steps –Runtime exclusion, based on a dynamic test

12 Step Values Avoid side effects in file system (ok in database) –All files shared by steps must be passed as param values outputFiles inputFiles Avoid hard-coded values –Use Constants Avoid hand-coded values that change each build –Must be computed by step –Eg blast Y= value External Db Rls values –Always pass external db rls spec, eg Plasmodium Falciparum Chromosomes:2008-07-13 –Upgrade steps to conform to this Table names –Want to be able to reuse these values across steps –Always use same format, eg: Dots.ExternalNaSequence

13 Cluster Wait for cluster step –Sends email –(takes list of email addresses as config. Maybe we should set up mailing list?) Followed by a waitForHuman step. –By default is in “WAIT_FOR_HUMAN” state Orthogonal to other states and offline status –Pilot can turn that off, and it will run

14 Configuration Steps Configuration –Global Commonly used properties Not validated until runtime –Static Defined per step class Convenient, often all is necesssary –Cascading? –Multi-steps file Distinguish between stable properties and mutable ones –Version numbers often change Svn Pilot configuration?

15 File & Directory Structure Avoid side-effects Use explicit input/ouput params in xml file Move to a nested data directory structure? /files/cbil/data/cbil/Plasmodb/5.5/workflow/data/ Seqfiles/ nrdb.fsa Pvivax/ Seqfiles/ Psipred/ Assembly/ ESTs/ Initial/ Intermediate/ –Would use the namespace attribute, somehow Use path statement, eg: –../ –../tmhmm Steps directories –Use nested structure for subflows?

16 GUI Should it run in the web context? –Security issues –Avoids having to have installed software –Would work from home –All members of team could see the flow –Somehow restrict editability –Could be posted on real site as documentation? Overkill? Too detailed? Needs to handle subflows –Subflow node needs to show a summary of what is going on inside the subflow Multi-colored, to show various states inside it Gray out paths that are offline Expand/collapse?

17 Resource Pipeline Not worked out yet Needs to be handled by regular subflow Unpacks will need to be collapsed into a single unpack script Resources.xml file as needed by front end can be produced by a documentation run of the pipeline Does it need to be configured in xml, or would a properties file be good enough?

18 Documentation of the workflow Workflow must be able to run in “documentation” mode –Doesn’t run any steps –Instead, produces documentation as expected by front end Methods xml file Resources xml file

19 Slides after this are notes, and other junk

20 Standard resources taxonomyEnzymeDBSONRDB dbEST [tax_id] GOGO Codes Bibliographic Ref terms MO termsMO typesMOInterProMO Entry Orthomcl phyletic orthomcl

21 Plasmodb resources IEDB epitopes IEDB dbxrefs NA Genbank dbrefs AA Genbank dbrefs pdbPdb index

22 P.falciparum resources Zhang ESTs Apicopolast Florens 2002 Pf plastid Florent ESTs Pf mitochon Watanabe Pf transcripts Watanabe Pf ESTs Pf GO Associations Sanger IT SNPs SU SNPs Broad SNPs Combined SNPs DeRisi Oligos Winzeler Genetic Var. array DeRisi Dd2 DeRisi HB3 Winzeler Cell Cycle DeRisi 3D7 Scripps Array Winzeler Gametocyte DeRisi Array 7282 MTC KI Array Baum Meta data Durasingh Meta data GSE5247 Meta data Cowman Meta data Pfab Array E-MEXP 449 Meta data E-MEXP 439 Meta data Plasmodb Gene ids E-MEXP 128 Meta data Waters Meta data Waters Gametocyte Mass spec Daily Meta data GSE2265 Meta data GSE8099 Meta data interactome Waters Female Gametes mass Mutual info Plasmo map y2h Sage tag Array design Sage tag freqs Pf chr Genbank refs TIGR gene indexes Baum Array data Durasingh Array data GSE5247 Array data Cowman Array data E-MEXP 449 Array data E-MEXP 439 Array data E-MEXP 128 Array data Waters Array data Daily Array data GSE2265 array data GSE8099 Array data Baum RAD anal Durasingh RAD anal GSE5247 RAD anal Cowman RAD anal E-MEXP 449 RAD anal E-MEXP 439 RAD anal E-MEXP 128 RAD anal Waters RAD anal Daily RAD anal GSE2265 RAD anal GSE8099 RAD anal Waters male Gametes mass Waters mixed Gametes mass PASA Db refs Hagai EC Winzeler Db refs Winzeler Lit refs Predicted Protein structs mr4 Cowman subcellular Haldar subcellular Merozoite peptides lasonder oocycts Florens 2004 Broad SNP coverage evigan Lasonder Oocycts sporozoites Entrez Dbrefs Pubmed dbrefs Broad bar code Broad 3k genotyping Lasonder salivary sporozoites

23 P. vivax resources Watanabe Pv transcripts Pv contigs Watanabe Pv ESTs Pv dbrefsPv GB dbrefsPv mitochon Pv chromosomes TIGR gene indexes

24 C.parvum C.hominis Synteny start End PlasmoToxo Api End Start


Download ppt "Model of a real workflow A subset of the plasmodb pipeline (in progress!) And issues to discuss…"

Similar presentations


Ads by Google