A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo.

Slides:



Advertisements
Similar presentations
SOMA2 – Drug Design Environment. Drug design environment – SOMA2 The SOMA2 project Tekes (National Technology Agency of Finland) DRUG2000 program.
Advertisements

Visual Scripting of XML
CNPq - INRIA Projeto CEMT Instituto de Informática - UFRGS “Features of CEMT Workflow Model” Carlos Zeve.
Key-word Driven Automation Framework Shiva Kumar Soumya Dalvi May 25, 2007.
January 25, Current and Future Database (CH)  Indexing vgd_common (JM; 1Q)  Fully implement Taxonomy tables (JO, DD; 2Q)  Allow subspecies-level.
Bioinformatics for the Canadian Potato Genome Project David De Koeyer, Martin Lagüe and Rebecca Griffiths Wageningen September 18, 2004.
Visual Web Information Extraction With Lixto Robert Baumgartner Sergio Flesca Georg Gottlob.
DCS Architecture Bob Krzaczek. Key Design Requirement Distilled from the DCS Mission statement and the results of the Conceptual Design Review (June 1999):
August 29, 2002InforMax Confidential1 Vector PathBlazer Product Overview.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
1 BrainWave Biosolutions Limited Accelerating Life Science Research through Technology.
Russell Taylor Lecturer in Computing & Business Studies.
Aleksi Kallio CSC – IT Center for Science Chipster and collaboration with other bioinformatics platforms.
Connecting Diverse Web Search Facilities Udi Manber, Peter Bigot Department of Computer Science University of Arizona Aida Gikouria - M471 University of.
BIOCMS: Resource Integration and Web Application Framework for Bioinformatics DHUNDY R BASTOLA †, *, ANIL KHADKA †, MOHAMMAD SHAFIULLAH † AND HESHAM ALI.
Mgt 240 Lecture Website Construction: Software and Language Alternatives March 29, 2005.
Genome Annotation BCB 660 October 20, From Carson Holt.
Overview of Search Engines
DEiXTo.
Slide 1 of 9 Presenting 24x7 Scheduler The art of computer automation Press PageDown key or click to advance.
Genome database & information system for Daphnia Don Gilbert, October 2002 Talk doc at
WFleaBase Daphnia Genome Database from Common Components Daphnia Genomic Consortium Meeting, Sept Don Gilbert,
QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
Aurora: A Conceptual Model for Web-content Adaptation to Support the Universal Accessibility of Web-based Services Anita W. Huang, Neel Sundaresan Presented.
INFOBALT, October 22, 2004, Vinius IST4Balt project information dissemination using web-based knowledge systems Zigmas Bigelis EU projects consultant Asociation.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Genome Annotation using MAKER-P at iPlant Collaboration with Mark Yandell Lab (University of Utah) iPlant: Josh Stein (CSHL) Matt Vaughn.
Networks and Interactions Boo Virk v1.0.
The Network Performance Advisor J. W. Ferguson NLANR/DAST & NCSA.
Putting it all together Dynamic Data Base Access Norman White Stern School of Business.
Contents 1.Introduction, architecture 2.Live demonstration 3.Extensibility.
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
Welcome to DNA Subway Classroom-friendly Bioinformatics.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
Grup.bio.unipd.it CRIBI Genomics group Erika Feltrin PhD student in Biotechnology 6 months at EBI.
Application portlets within the PROGRESS HPC Portal Michał Kosiedowski
Browsing the Genome Using Genome Browsers to Visualize and Mine Data.
introducing the Java Data Processing Framework Paolo Ciccarese, PhD On behalf of the JDPF Team Pavia, December 11, 2007.
Jodi Humann, Stephen Ficklin, Taein Lee, Chun-Huai Cheng, Sook Jung, Jill Wegrzyn, David Neale and Dorrie Main An easy to use, web-based solution for specialty.
Implementing computational analysis through Web services Arnaud Kerhornou CRG/INB Barcelona - BioMed Workshop IRB November 2007.
Developed by James Estill, Dept. of Plant Biology, University of Georgia.
Annotator Interface Sharon Diskin GUS 3.0 Workshop June 18-21, 2002.
The Public Face of TAIR User Interface Design Responsiveness to User Input.
Annotating genomes using MAKER-P and iPlant. What Are Annotations? Annotations are descriptions of features of the genome –Structural: exons, introns,
Building a Topic Map Repository Xia Lin Drexel University Philadelphia, PA Jian Qin Syracuse University Syracuse, NY * Presented at Knowledge Technologies.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
1 Limitations of BLAST Can only search for a single query (e.g. find all genes similar to TTGGACAGGATCGA) What about more complex queries? “Find all genes.
User Profiling using Semantic Web Group members: Ashwin Somaiah Asha Stephen Charlie Sudharshan Reddy.
EBI is an Outstation of the European Molecular Biology Laboratory. Gautier Koscielny VectorBase Meeting 08 Feburary 2012, EBI VectorBase Text Search Engine.
ARGOS (A Replicable Genome InfOrmation System) for FlyBase and wFleaBase Don Gilbert, Hardik Sheth, Vasanth Singan { gilbertd, hsheth, vsingan
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
K. Harrison CERN, 22nd September 2004 GANGA: ADA USER INTERFACE - Ganga release status - Job-Options Editor - Python support for AJDL - Job Builder - Python.
© Geodise Project, University of Southampton, Workflow Application Fenglian Xu 07/05/03.
The Bovine Genome Database Abstract The Bovine Genome Database (BGD, facilitates the integration of bovine genomic data. BGD is.
Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.
BUSINESS SENSITIVE 1 SAAW - Sequence Annotation and Analysis Workshop Boyu Yang and Gene Godbold Battelle Memorial Institute, Charlottesville Operations.
InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.
1 RIC 2009 Symbolic Nuclear Analysis Package - SNAP version 1.0: Features and Applications Chester Gingrich RES/DSA/CDB 3/12/09.
Simulation Production System
IST 220 – Intro to Databases
Genome Sequence Annotation Server
Sequence based searches:
Genome Sequence Annotation Server
Lettuce/Sunflower EST CGPDB project.
Functional Annotation Final Results
Genome Annotation Continued
Cuong Nguyen, Deng Xin, Dongmei, Zheng Wang
Lesson 3 Bioinformatics Laboratory
A web-based platform for structural and functional annotation of model and non-model organisms Jodi Humann, Taein Lee, Stephen Ficklin,
Presentation transcript:

A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP 2

Analyzing and processing sequencing reads is a tedious and error-prone job Multistep process All sequences are submitted to the same processing steps Sequences processed by a given step are the input for the next one Require different programs Integrated system – PIPELINE Sequence processing and annotation 2 AG-ICB-USP

Problem: how to build pipelines Creating scripts for new pipelines involves good programming knowledge Once created, most pipelines are difficult to change and customize Many programs must be used Phred, Cross_match, Phrap, CAP3, Blast, HMMer, InterproScan, TMHMM, etc. 2 AG-ICB-USP

Each program needs a specific environment to work (e.g. directories with specific names) Each program produces output in different ways and formats Integrating programs is a hard task 2 Problem: how to build pipelines AG-ICB-USP

Solution: creating an environment to build pipelines Abstract the environment of each program Abstract output format Easily specify “coupling” of different programs Document how the pipe was built Easy to inspect and monitor Easy to store (e.g. in a database) Requirements: 2 AG-ICB-USP

EGene To develop a simple to use and configure platform for pipeline construction Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available They are too complex for the small-/mid-sized labs Platform should be generic Useful for any sequencing project Platform should provide components for the most common tasks New components should be easy to develop Aims and characteristics: AG-ICB-USP 2

EGene: a generic platform for pipeline construction Written in Perl language Modular Easy to build specific components to interact with third-party programs EGene components can be integrated to fulfill user-specific needs CoEd – a graphical configuration editor written in Java – user-friendly interface Characteristics: 2 AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

AG-ICB-USP

Sequence processing pipeline The Eimeria ORESTES project Size filtering Filter-size End trimming Trim-ends.pl Quality filtering Filter-quality.pl Vector masking and screening Cross_Match Primer screening and masking Cross_Match Base calling and quality assignment Phred Input chromatogram files Assembly CAP3 Human sequence filtering Blast Chicken sequence filtering Blast Bacterial sequence filtering Blast Repetitive sequence filtering Cross_Match Ribosomal sequence filtering Cross_Match Plastid sequence filtering Cross_Match Mitochondrial sequence filtering Cross_Match 2 AG-ICB-USP

Sequence processing and grahical report 2 AG-ICB-USP

How to get EGene Internet site: - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP

How to get EGene Internet site: - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP

Recent developments Incorporation of forks Enhancement of the data model – incorporation of annotation evidences Development of annotation components Evidence-based annotation 2 AG-ICB-USP

Genome annotation Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) 2 AG-ICB-USP

Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) Genome annotation 2 AG-ICB-USP

Annotation file A typical annotation file contains: A header with: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates 2 AG-ICB-USP

Feature table format Flatfile format Format definition available at Covers DDBJ/EMBL/GenBank Defines all accepted annotation terms and hierarchy 2 AG-ICB-USP

Incorporating annotation EGene’s data model was enriched to incorporate annotation information into the representation of the sequences All collected data is converted into a proprietary XML format The XML can be easily converted into different annotation formats: Feature Table, GFF3, etc. We provide some converters and new ones can be easily implemented 2 AG-ICB-USP

Annotation components A comprehensive set of annotation components has been implemented: ORF finding and translation Tandem repeats finding: TRF, String, mREPS tRNA finding: tRNAscan-SE Gene Prediction: Genscan, GlimmerM, GlimmerHMM, Twinscan, Phat, ESTscan, SNAP Motif finding: HMMer x Pfam, RPS-BLAST, InterproScan Similarity search: BLAST EST mapping: Sim4, Exonerate 2 AG-ICB-USP

Annotation components A comprehensive set of annotation components has been implemented: Transmembrane domain finding: TMHMM, Phobius Signal peptide: SignalP, Phobius GPI anchor: DGPI GO mapping and quantification Orthology assignment and quantification: COG/KOG Pathway mapping: KEGG Annotation visualization with GBrowse: web inspection Annotation report generation: feature table, GFF3 Web site generation: HTML/PHP 2 AG-ICB-USP

EGene generates annotation files that can be inspected using regular editors (Artemis, Apollo, etc.) 2 AG-ICB-USP

EGene’s annotation EGene can generate annotation in different formats: XML – local use, easy to feed a database management system Feature table  Convenient for manual curation on Artemis  Ready for submission to public databases GFF3  Current annotation interchange format  Manual curation/visualization on Artemis, Apollo and GMOD Genome Browser  Compliant with Sequence Ontology terms 2 AG-ICB-USP

EGene performs GO term mapping and constructs web pages for inspection 2 AG-ICB-USP

EGene performs an integrated and quantitative orthology analysis (COG/KOG) and constructs web pages 2 AG-ICB-USP

EGene automatically constructs a full web site for evidence inspection 2 AG-ICB-USP

Current developments Full integration with a database management system Automated task distribution management across multiple processing nodes Development of a graphical interface for evidence inspection and manual curation “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions 2 AG-ICB-USP

Why use EGene2 ? Ideal for small- and mid-sized laboratories Genome and EST sequencing projects Conceived for Biologists Does not require programming skills Generic tool for any sequencing/annotation project – customized for specific user’s requirements Very easy to implement new components Multiplatform - MacOS, UNIX, Linux, etc. Well documented – HOWTOs, tutorials, example datasets available Easy configuration CoEd - Application with a GUI for pipeline construction Generic pipeline templates provided 2 AG-ICB-USP

Research team Prof. Alan M. Durham – IME-USP Annotation Milene Ferro – ICB-USP Ricardo Yamamoto Abe – IME-USP Luiz Thiberio Rangel – ICB-USP Sequence pre-processing André Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USP Paulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP 2 AG-ICB-USP

Financial Support FAPESP - São Paulo State Science Foundation CNPq - National Research Council 2 AG-ICB-USP

Thanks for your attention AG-ICB-USP