Download presentation
Presentation is loading. Please wait.
Published byGordon Hutchinson Modified over 9 years ago
1
GMODTools, Argos & cetera A Replicable Genome infOrmation System of Common Components GMOD Meeting, Oct. 2004 Don Gilbert, gilbertd@indiana.edu
2
GMOD Tools for public data releases Argos framework for genome databases LuceGene fast document/object search Genome Directory System for genome data mining Unified Gene Pages (XML, web page) Genome DB building blocks
3
GMOD Tools: Bulkfiles cvs.sourceforge.net:/cvsroot/gmod checkout schema/GMODTools
4
Support common data update and public release tasks. GmodTools to load and extract reagent sequences (EST, cDNA, GSS) to/from Chado databases. GMOD Bulkfiles creates bulk genome sequence and feature files for public distribution from a Chado database. Citrina is a workflow tool to automate external databank updates, such as GenBank and Gene Ontologies. Genome Data Tools
5
12 New genomes to go Need to publish numerous new genomes Bulk files are standard public access: Sequence (fasta, …), features (gff,…), searches (Blast,..); 11 new Drosophila genomes; Daphnia genome; many more Chado database; XORT & other GMOD Tools to export data http://flybase.net/species
6
Bulkfiles Build release files from Chado DB Standardized files, headers DNA - fasta, raw Features - GFF3, gnomap Blast indices Lucene file indices Config files (blast, gbrowse,…)
7
Bulkfiles - BLAST indices
8
Bulkfiles - Map features
9
Bulkfiles OUTPUTS DNA files (full chromosomes) in raw and fasta formats GFF (v3) and FFF (used in FlyBase) feature files Fasta sequence for each feature set, with standardized headers (ID,names,db_xref,…)from feature files NCBI BLAST indices & configs Gbrowse config files with feature sets matching db Others added as needed (more easily than before)
10
Bulkfiles Logic Organism/database logic (mostly) in configuration files Dump all chado db features using simple sql to common intermediate table files Feature info is simple: type, location, name/id, and a few attributes (db_xrefs,.. GFF-like) Easier checking of SQL to get all features desired Fast (30 - 60 min for full fly genome) Postprocess table files to create public use formats Tested with FOUR different Chado dbs (Dmel, Dmel_hetero, Dpse_Dmel, and SGDLite)
11
Bulkfiles stages postprocess table files in stages Recode feature “oddities” to public view needs Better debugging of steps in the process Engineering time and configuration here Stages are loosely coupled; go back, tweak configurations, re-run partially as needed. convert common feature table + dna to several output formats in one step. combine features from several dbs and other sources like cytology here.
12
Bulkfiles config example <opt name="fbbulk-r3" relid="3" ROOT="${GMOD_ROOT}/" TMP="${GMOD_ROOT}/tmp" datadir="genomes/Drosophila_melanogaster" > FlyBase Chado DB r3.2 Configuration for feature and sequence bulk files from FlyBase chado data release 3.2.1 dmel Drosophila melanogaster D. melanogaster euchromatin genome data from FlyBase Release 3.2.1. See http://flybase.net/annot/dmel_r3.2.1.txt fbreleases <db driver="Pg" name="dmel_chado" host="localhost" port="7302" user="” password="" /> (FBgn|FBti)\d+ filesets featuresets
13
Bulkfiles quick test # get soft cvs -d $cvsd co -d GMODTools schema/GMODTools # load a genome chado db to Postgres wget http://sgdlite.princeton.edu/download/sgdlite/- 2004_05_19_sgdlite.sql.gz createdb sgdlite_20040519 (zcat *sgdlite.sql.gz | psql -d sgdlite_20040519 -f - ) >& log.load # generate file set for sgdbulk1 cd GMODTools env GMOD_ROOT=$PWD perl -I./lib/ bin/bulkfiles.pl sgdbulk1
14
ARGOS http://www.gmod.org/argos
15
ARGOS Genome DBs
16
Automate genome database install & update Eliminate { fetch, compile, install, configure,…} cycle Developers test, compile, config once; others copy/run Start new project quickly - copy existing project & edit to suit Clone servers easily (local cluster; global mirrors; company/lab; laptop) Compatible with most GMOD projects Secure collaborative genome db features Goal: easy for biologists to use with minimal informatics expertise ARGOS Focus
17
ARGOS Components
18
ARGOS INSTALL
20
Edit wFleaBase
21
Lucegene (‘Lucy Jean’) for Genome Information Search and Retrieval
22
Document/Object Search and Retrieval in Genome Databases high-volume data search and retrieval system for genomics and bioinformatics databases standard search features: booleans, phrase, near, relevance performance exceeds and extends relational databases suited to range of genome data: genes, literature, sequences, XML annotations, Medline abstracts, HTML, PDF and text documents. LuceGene
23
Example LuceGene libraries FlyBase database Annotation GAME XML, Medline XML (gamexml, medxml) Genes, Annotation, References (fbgn, fban, fbrf) Web, literature PDF Documents (docs) Unified Gene Page XML (ugpxml) Sequences, Genome Features (seqs) euGenes database Gene summaries, Sequences, Genome Features Unified Gene Page XML Web Documents wFleaBase database Sequences, Medline XML, Web documents
24
Josh Goodman (gmod) Paul Poole (gmod/iubio) Hardik Sheth (flybase) Nihar Sheth (flybase) Vasanth Singan (gmod) Victor Strelets (flybase) And to many developers whose work we learn from and borrow from Thanks to these folks
25
GMOD Tools Using to make flybase pub data; tested w/ SGD lite Argos framework Used now for 3 DBs; replicated UK, JP; several test dbs LuceGene indexer working well; need web face work Genome Directory System Prelim. http://flybase.net/ws/services/Directory Unified Gene Pages) Need time; collabs. Have FlyBase, euGenes UGP XML and other-mod web page scraper Tool Status
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.