Download presentation
Presentation is loading. Please wait.
Published byMervyn Grant Modified over 9 years ago
1
A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP 2
2
Analyzing and processing sequencing reads is a tedious and error-prone job Multistep process All sequences are submitted to the same processing steps Sequences processed by a given step are the input for the next one Require different programs Integrated system – PIPELINE Sequence processing and annotation 2 AG-ICB-USP
3
Problem: how to build pipelines Creating scripts for new pipelines involves good programming knowledge Once created, most pipelines are difficult to change and customize Many programs must be used Phred, Cross_match, Phrap, CAP3, Blast, HMMer, InterproScan, TMHMM, etc. 2 AG-ICB-USP
4
Each program needs a specific environment to work (e.g. directories with specific names) Each program produces output in different ways and formats Integrating programs is a hard task 2 Problem: how to build pipelines AG-ICB-USP
5
Solution: creating an environment to build pipelines Abstract the environment of each program Abstract output format Easily specify “coupling” of different programs Document how the pipe was built Easy to inspect and monitor Easy to store (e.g. in a database) Requirements: 2 AG-ICB-USP
6
EGene To develop a simple to use and configure platform for pipeline construction Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available They are too complex for the small-/mid-sized labs Platform should be generic Useful for any sequencing project Platform should provide components for the most common tasks New components should be easy to develop Aims and characteristics: AG-ICB-USP 2
7
EGene: a generic platform for pipeline construction Written in Perl language Modular Easy to build specific components to interact with third-party programs EGene components can be integrated to fulfill user-specific needs CoEd – a graphical configuration editor written in Java – user-friendly interface Characteristics: 2 AG-ICB-USP
8
AG-ICB-USP
9
AG-ICB-USP
10
AG-ICB-USP
11
AG-ICB-USP
12
AG-ICB-USP
13
AG-ICB-USP
14
AG-ICB-USP
15
Sequence processing pipeline The Eimeria ORESTES project Size filtering Filter-size End trimming Trim-ends.pl Quality filtering Filter-quality.pl Vector masking and screening Cross_Match Primer screening and masking Cross_Match Base calling and quality assignment Phred Input chromatogram files Assembly CAP3 Human sequence filtering Blast Chicken sequence filtering Blast Bacterial sequence filtering Blast Repetitive sequence filtering Cross_Match Ribosomal sequence filtering Cross_Match Plastid sequence filtering Cross_Match Mitochondrial sequence filtering Cross_Match 2 AG-ICB-USP
16
Sequence processing and grahical report 2 AG-ICB-USP
17
How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP
18
How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP
20
Recent developments Incorporation of forks Enhancement of the data model – incorporation of annotation evidences Development of annotation components Evidence-based annotation 2 AG-ICB-USP
24
Genome annotation Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) 2 AG-ICB-USP
25
Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) Genome annotation 2 AG-ICB-USP
26
Annotation file A typical annotation file contains: A header with: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates 2 AG-ICB-USP
27
Feature table format Flatfile format Format definition available at http://www.ncbi.nlm.nih.gov/projects/collab/FT/ Covers DDBJ/EMBL/GenBank Defines all accepted annotation terms and hierarchy 2 AG-ICB-USP
28
Incorporating annotation EGene’s data model was enriched to incorporate annotation information into the representation of the sequences All collected data is converted into a proprietary XML format The XML can be easily converted into different annotation formats: Feature Table, GFF3, etc. We provide some converters and new ones can be easily implemented 2 AG-ICB-USP
29
Annotation components A comprehensive set of annotation components has been implemented: ORF finding and translation Tandem repeats finding: TRF, String, mREPS tRNA finding: tRNAscan-SE Gene Prediction: Genscan, GlimmerM, GlimmerHMM, Twinscan, Phat, ESTscan, SNAP Motif finding: HMMer x Pfam, RPS-BLAST, InterproScan Similarity search: BLAST EST mapping: Sim4, Exonerate 2 AG-ICB-USP
30
Annotation components A comprehensive set of annotation components has been implemented: Transmembrane domain finding: TMHMM, Phobius Signal peptide: SignalP, Phobius GPI anchor: DGPI GO mapping and quantification Orthology assignment and quantification: COG/KOG Pathway mapping: KEGG Annotation visualization with GBrowse: web inspection Annotation report generation: feature table, GFF3 Web site generation: HTML/PHP 2 AG-ICB-USP
34
EGene generates annotation files that can be inspected using regular editors (Artemis, Apollo, etc.) 2 AG-ICB-USP
35
EGene’s annotation EGene can generate annotation in different formats: XML – local use, easy to feed a database management system Feature table Convenient for manual curation on Artemis Ready for submission to public databases GFF3 Current annotation interchange format Manual curation/visualization on Artemis, Apollo and GMOD Genome Browser Compliant with Sequence Ontology terms 2 AG-ICB-USP
40
EGene performs GO term mapping and constructs web pages for inspection 2 AG-ICB-USP
61
EGene performs an integrated and quantitative orthology analysis (COG/KOG) and constructs web pages 2 AG-ICB-USP
82
EGene automatically constructs a full web site for evidence inspection 2 AG-ICB-USP
96
Current developments Full integration with a database management system Automated task distribution management across multiple processing nodes Development of a graphical interface for evidence inspection and manual curation “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions 2 AG-ICB-USP
97
Why use EGene2 ? Ideal for small- and mid-sized laboratories Genome and EST sequencing projects Conceived for Biologists Does not require programming skills Generic tool for any sequencing/annotation project – customized for specific user’s requirements Very easy to implement new components Multiplatform - MacOS, UNIX, Linux, etc. Well documented – HOWTOs, tutorials, example datasets available Easy configuration CoEd - Application with a GUI for pipeline construction Generic pipeline templates provided 2 AG-ICB-USP
98
Research team Prof. Alan M. Durham – IME-USP Annotation Milene Ferro – ICB-USP Ricardo Yamamoto Abe – IME-USP Luiz Thiberio Rangel – ICB-USP Sequence pre-processing André Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USP Paulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP 2 AG-ICB-USP
99
Financial Support FAPESP - São Paulo State Science Foundation CNPq - National Research Council 2 AG-ICB-USP
100
Thanks for your attention AG-ICB-USP
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.