Presentation is loading. Please wait.

Presentation is loading. Please wait.

Model of a real workflow

Similar presentations


Presentation on theme: "Model of a real workflow"— Presentation transcript:

1 Model of a real workflow
And issues to discuss…

2 PlasmoDB workflow P.Falciparum Standard genome P.Vivax Standard genome
P.Yoelli Standard genome P.Berghei Standard genome P.Chabaudi Standard genome P.Knowlesi Standard genome P.Reichonowi Standard genome P.Gallenacium Standard genome P.Falciparum Non-standard synteny

3 Standard Genome Workflow
Subflows (double line) Global steps (oval) Standard Genome Workflow NRDB taxonomy SO TIGR TGI Genome Compile time Include/Exclude Calculate Translated protein In: Pf, Pk Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster some steps will need to know if they have been done already in another subflow copying and fixing nrdb make these “global steps” – they just need a “global step name” molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Splign In: Pf, Pb, Py, Pv Load TMHMM

4 Standard Genome Workflow
NRDB taxonomy SO TIGR TGI Genome Calculate Translated protein In: Pf, Pk Extract genomic sequence Copy genomic seqs To cluster Extract proteins Copy proteins To cluster some steps will need to know if they have been done already in another subflow copying and fixing nrdb make these “global steps” – they just need a “global step name” molecular Weight Min/max molecular weight Isolelectric point run TMHMM blastx Nrdb genome psipred blastp nrdb proteins Splign In: Pf, Pb, Py, Pv Load TMHMM

5 NRDB NRDB resource Copy from download site Shorten defline Copy to cluster Copy to cluster This subflow should deal with different flavors of blast, and with the optional running of the loading of the subject subset. The steps will need to be parameterized to know how to do this.

6 Resources acquire unpack ext db Ext db rls insert some steps will need to know if they have been done already in another subflow copying and fixing nrdb

7 Psipred create psipred Data dir fix protein IDs For psipred create psipred Task dir copy Data Dir to cluster copy psipred Protein file to cluster start psipred On cluster wait for cluster some steps will need to know if they have been done already in another subflow copying and fixing nrdb copy psipred Files from cluster fix psipred File names make Alg Inv load psipred

8 BLAST Optional step (runtime test)
Create Similarity dir Start blast Wait for cluster Copy files From cluster extract IDs From Blast result Optional step (runtime test) This subflow should deal with different flavors of blast, and with the optional running of the loading of the subject subset. The steps will need to be parameterized to know how to do this. Load Subject subset Load Result

9 Splign Extract query Sequence Alt defline Extract subject Sequence Alt defline runSplign insertSplign This subflow should deal with different flavors of blast, and with the optional running of the loading of the subject subset. The steps will need to be parameterized to know how to do this.

10 Discussion

11 Graph file -- features --
Workflow xml file Subflows Parameters Constants Interpolating variables Global steps Steps that are only executed once by the whole workflow, even if in multiple subflows Declare a namespace? Include/exclude Compile time inclusion/exclusion If not compiled in, flow passes right through Skip-able steps Runtime exclusion, based on a dynamic test

12 Graph file -- sharing across projects --
Live in svn: ApiCommonData/Load/lib/xml/workflow Found by system in $GUS_HOME/lib/xml/workflow Shared across all projects Use include/exclude to specify project specific functionality Therefore, each build must be on its own branch, to avoid interference

13 Graph file -- step values --
Avoid side effects in file system (ok in database) All files shared by steps must be passed as param values outputFiles inputFiles Avoid hard-coded values Use Constants Avoid hand-coded values that change each build Must be computed by step Eg blast Y= value External Db Rls values Always pass external db rls spec, eg Plasmodium Falciparum Chromosomes: Upgrade steps to conform to this Table names Want to be able to reuse these values across steps Always use same format, eg: Dots.ExternalNaSequence

14 Graph file -- cluster --
Wait for cluster step Sends (takes list of addresses as config. Maybe we should set up mailing list?) Followed by a waitForHuman step. By default is in “WAIT_FOR_HUMAN” state Orthogonal to other states and offline status Pilot can turn that off, and it will run

15 Graph file -- resources pipeline --
We still use a resources.xml file Needed by the front end Pubmed Descriptions Data sources and attributions Handled by a regular subflow Only one unpack step Current multiple unpack steps need to be combined into a simple script Dedicated step classes: ApiCommonData::Load::Step::AcquireExternalResource ApiCommonData::Load::Step::UnpackExternalResource ApiCommonData::Load::Step::InsertExternalDatabase ApiCommonData::Load::Step::InsertExternalDatabaseRelease ApiCommonData::Load::Step::InsertExternalResource Are subclasses of ApiCommonData::Load::Step::AcquireExternalStep Knows how to parse the resources.xml file

16 Configuration files Steps Configuration
Global Commonly used properties Not validated until runtime Static Defined per step class Convenient, often all is necesssary Cascading? Multi-steps file Distinguish between stable properties and mutable ones Version numbers often change Svn? Pilot configuration?

17 Runtime File & Directory Structure
Avoid side-effects Use explicit input/ouput params in xml file Move to a nested data directory structure? /files/cbil/data/cbil/Plasmodb/5.5/workflow/data/ Seqfiles/ nrdb.fsa Pvivax/ Psipred/ Assembly/ ESTs/ Initial/ Intermediate/ Would use the namespace attribute, somehow Use path statement, eg: ../ ../tmhmm Steps directories Use nested structure for subflows?

18 External Files Repository
Do we need it? If so, what needs to be improved?

19 Documentation of the workflow
Workflow must be able to run in “documentation” mode Doesn’t run any steps Instead, produces documentation as expected by front end Methods xml file Resources xml file

20 GUI Should it run in the web context? Needs to handle subflows
Security issues Avoids having to have installed software Would work from home All members of team could see the flow Somehow restrict editability Could be posted on real site as documentation? Overkill? Too detailed? Needs to handle subflows Subflow node needs to show a summary of what is going on inside the subflow Multi-colored, to show various states inside it Gray out paths that are offline Expand/collapse?

21 Mini-flows like mini-pipes, but for workflows…

22 Slides after this are notes, and other junk

23 Standard resources taxonomy SO EnzymeDB GO Codes GO NRDB dbEST [tax_id] Bibliographic Ref terms MO terms MO types MO MO Entry InterPro Orthomcl phyletic orthomcl

24 Plasmodb resources IEDB epitopes IEDB dbxrefs NA Genbank dbrefs
AA Genbank dbrefs pdb Pdb index

25 P.falciparum resources
Watanabe Pf transcripts Watanabe Pf ESTs Zhang ESTs Florent ESTs Pf plastid Pf mitochon Pf GO Associations Sanger IT SNPs SU SNPs Broad SNPs Combined SNPs Winzeler Genetic Var. array MTC KI Array Winzeler Cell Cycle Winzeler Gametocyte Scripps Array Pfab Array DeRisi Array 7282 Daily Meta data GSE2265 Meta data Cowman Meta data Durasingh Meta data Baum Meta data Waters Meta data E-MEXP 128 Meta data E-MEXP 439 Meta data E-MEXP 449 Meta data GSE5247 Meta data GSE8099 Meta data Daily Array data GSE2265 array data Cowman Array data Durasingh Array data Baum Array data Waters Array data E-MEXP 128 Array data E-MEXP 439 Array data E-MEXP 449 Array data GSE5247 Array data GSE8099 Array data Daily RAD anal GSE2265 RAD anal Cowman RAD anal Durasingh RAD anal Baum RAD anal Waters RAD anal E-MEXP 128 RAD anal E-MEXP 439 RAD anal E-MEXP 449 RAD anal GSE5247 RAD anal GSE8099 RAD anal Waters Gametocyte Mass spec Waters Female Gametes mass Waters male Gametes mass Waters mixed Gametes mass Plasmodb Gene ids Sage tag Array design y2h Plasmo map interactome TIGR gene indexes Mutual info Pf chr Genbank refs PASA Db refs Hagai EC Winzeler Db refs Winzeler Lit refs Sage tag freqs Predicted Protein structs mr4 Cowman subcellular Haldar subcellular Apicopolast Florens 2002 Florens 2004 Broad SNP coverage Merozoite peptides lasonder oocycts Lasonder Oocycts sporozoites Lasonder salivary sporozoites evigan Broad bar code Broad 3k genotyping Entrez Dbrefs Pubmed dbrefs DeRisi Oligos DeRisi Dd2 DeRisi HB3 DeRisi 3D7

26 P. vivax resources Watanabe Pv transcripts Watanabe Pv ESTs Pv contigs
TIGR gene indexes Watanabe Pv transcripts Watanabe Pv ESTs Pv contigs Pv dbrefs Pv GB dbrefs Pv mitochon Pv chromosomes

27 Start Plasmo Toxo start C.parvum C.hominis Synteny End Api End


Download ppt "Model of a real workflow"

Similar presentations


Ads by Google