UMR ASP UMR ASP Structural & Comparative Genomics in Bread Wheat TriAnnotPipeline A LifeGrid Project based on AUVERGRID F. Giacomoni, M. Reichstadt, P. Leroy Génétique, Diversité & Ecophysiologie des Céréales - Clermont-Ferrand, France 3rd EGEE User Forum February 12th, 2008
Wheat as a challenge for Genomics Important Economic Crop Large Genome size Barley Rice Bread wheat Mb Mb 380 Mb Maize 85% Repeat sequences 70-80% 50-80% 50% 140 Mb 10% Mb Human ~ Mb A. thaliana
I.N.R.A. Work on the Wheat Genome Sequencing Annotating Discover Genes Find Transposable Elements Study other biological components AAAATCGATATAGAGTATGTAGACAAATTTTAAACCCGGGGGAGAGAGAGA DNA sequence Results after Annotation of the DNA Sequence
Eugene GenemarkHMM GeneID General Pipeline Structure of TriAnnot TriAnnot Pipeline GRID DataBase ( chado ) & Viewers ( GBrowse ) TriSet GeneFarm Manualcuration training data set Genes Manualcuration TEs TREPcons REPET DNA sequences TEs Manualcuration
WEB / Pipeline Production GBrowse Login/password DataBanks WEB / Pipeline Development DownLoad gff/ARTEMIS gameXml/APOLLO Manual Curation APOLLO GnpDB On Line Login/password RepeatMasker, est2genome, Gmap, BLAST, HMMPfam UpLoad Login/password Local Gnp Genome GFF gff Users TriAnnotPipelineGRID Architecture GRID & Cluster
Transposable Element & repeats Panel 1 BAC sequence FASTA format BAC with masked TE Block1aBlock1b BLASTx / TREPprot TRF SSR RepeatMasker TREPnr, TREPtotal RepBase, Annotation Masking Other biological target searches Panel 3 … nt, sts, htgs, gss tRNA miRNA mtDNA cpDNA Block5b Block5c Block5d BLASTn UGset / IRGSP/ TIGR pseudo Block5a Panel 2 Gene annotation Gene Structure Prediction ab initio Prediction GeneMarkHMM, GeneID, EuGene, GENSCAN, GeneZilla BLASTx BLASTx SwissProt / TrEMBL BAC with masked TEs & Genes Block2 BLAST/Gmap BLAST/Gmap with transcripts FL-cDNA, EST, mRNA Block3a Block3b Gene Model EVM + PASA EVM + PASA (US) RAP-like RAP-like (Japan) EUGENE EUGENE (France) Block3c Known Protein Putative Protein Domain Containing Protein Expressed Gene Conserved Hypothetical Gene Hypothetical Gene Gene Function IWGSC annotation guide line Block4 Best Hit proteins - At - Os - At - Os Best Hit TriAnnotPipelineGRID Detailed Architecture
PIPELINE PART : WEB INTERFACE PART with: Upload of BAC FASTA format sequence Programming parameters of the Annotation with 5 blocks Production of a step.xmlWheat Seq STEP_0:* 3 RepeatMasker vs 3 DataBanks STEP_1:* 8 BLASTn vs 8 DataBanks * 1 BLASTx vs 1 DataBank * 1 Tandem Repeat Finder STEP_2:* 1 EugeneIMM Rice * 1 GeneId * 4 GeneMarkHMM with 4 matrix STEP_3:* 1 tBLASTx vs 1 DataBank * 1 BLASTn vs 1 DataBank * 1 BLASTx vs 1 DataBank STEP_4:* 2 tBLASTn vs 2 DataBank RESULTS FILES (GFF Format)
PIPELINE PART: WEB INTERFACE PART with: Upload of BAC FASTA format sequence Programming parameters of the Annotation with 5 blocks Production of a step.xmlWheat Seq PIPELINE_GRID PART I (STEP_1A) PIPELINE LOCAL PART: STEP_1B: * 1 TRF STEP_2: * 1 EugeneIMM Rice * 1 GeneId * 4 GeneMarkHMM STEP_3C:* 3 Gene Modelling PIPELINE_GRID PART II (STEP_1B, 3A, 3B, 4A, 4B, 5A et 5D) 5 RM3 BLASTx 8 GMap 6 BLASTp1 PFAM1 tBLASTn 14 BLASTn 5 RepeatMasker (RM) RESULTS FILES (GFF Format) TriAnnotPipelineGRID Architecture
Bioinformatic algorithms SE Bioinformatic databases Bioinformatic algorithms Bioinformatic package Server User Interface Server part Grid part DB update service Computing Element (CE) UI JDL
Bioinformatic algorithms CE UI Server Get the parameter Create the XML step file Get the input (sequence) file Create the grid environment (JDL, shellscripts) Mask the repeated sequences RepeatMasker/Blast/ GMap/HMMer Retrieve the output Fill the database Get the parameter Create the XML step file Get the input (sequence) file Create the grid environment (JDL, shellscripts) Mask the repeated sequences RepeatMasker/Blast/ GMap/HMMer Retrieve the output Fill the database Get the parameter Create the XML step file Get the input (sequence) file Create the grid environment (JDL, shellscripts) Mask the repeated sequences RepeatMasker/Blast/ GMap/HMMer Retrieve the output Fill the database Computing Element (CE) UI JDL
Bioinformatic algorithms CE 1-Parameters + input file 2-Creation XML file 9-DB filling 3-copy input files 4-Creation environment 6-job running (BLAST/ HMMer/RepeatMasker/GMap) 5-job submission 7- job output 8-output transfer UI JDL
TriAnnotPipelineGRID Partners F. Giacomoni C. Charpentier N. Guilhot F. Choulet P. Leroy C. Feuillet T. Tanaka H. Ikawa H. Numa T. Itoh M. Alaux T. Flutre I. Blanc-Lenfle S. Reboux H. Quesneville B. Haas F. Legeai B. Kronmiller M. Reichstadt A. Claude M. Liauzu A. Mahul