Presentation is loading. Please wait.

Presentation is loading. Please wait.

A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo.

Similar presentations


Presentation on theme: "A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo."— Presentation transcript:

1 A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP 2

2 Analyzing and processing sequencing reads is a tedious and error-prone job Multistep process All sequences are submitted to the same processing steps Sequences processed by a given step are the input for the next one Require different programs Integrated system – PIPELINE Sequence processing and annotation 2 AG-ICB-USP

3 Problem: how to build pipelines Creating scripts for new pipelines involves good programming knowledge Once created, most pipelines are difficult to change and customize Many programs must be used Phred, Cross_match, Phrap, CAP3, Blast, HMMer, InterproScan, TMHMM, etc. 2 AG-ICB-USP

4 Each program needs a specific environment to work (e.g. directories with specific names) Each program produces output in different ways and formats Integrating programs is a hard task 2 Problem: how to build pipelines AG-ICB-USP

5 Solution: creating an environment to build pipelines Abstract the environment of each program Abstract output format Easily specify “coupling” of different programs Document how the pipe was built Easy to inspect and monitor Easy to store (e.g. in a database) Requirements: 2 AG-ICB-USP

6 EGene To develop a simple to use and configure platform for pipeline construction Big sequencing centers already have sophisticated pipelines, but many are not published and/or publicly available They are too complex for the small-/mid-sized labs Platform should be generic Useful for any sequencing project Platform should provide components for the most common tasks New components should be easy to develop Aims and characteristics: AG-ICB-USP 2

7 EGene: a generic platform for pipeline construction Written in Perl language Modular Easy to build specific components to interact with third-party programs EGene components can be integrated to fulfill user-specific needs CoEd – a graphical configuration editor written in Java – user-friendly interface Characteristics: 2 AG-ICB-USP

8 AG-ICB-USP

9 AG-ICB-USP

10 AG-ICB-USP

11 AG-ICB-USP

12 AG-ICB-USP

13 AG-ICB-USP

14 AG-ICB-USP

15 Sequence processing pipeline The Eimeria ORESTES project Size filtering Filter-size End trimming Trim-ends.pl Quality filtering Filter-quality.pl Vector masking and screening Cross_Match Primer screening and masking Cross_Match Base calling and quality assignment Phred Input chromatogram files Assembly CAP3 Human sequence filtering Blast Chicken sequence filtering Blast Bacterial sequence filtering Blast Repetitive sequence filtering Cross_Match Ribosomal sequence filtering Cross_Match Plastid sequence filtering Cross_Match Mitochondrial sequence filtering Cross_Match 2 AG-ICB-USP

16 Sequence processing and grahical report 2 AG-ICB-USP

17 How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP

18 How to get EGene Internet site: http://www.coccidia.icb.usp.br/egene - EGene is distributed under the GNU General Public License - EGene is Open Source 2 AG-ICB-USP

19

20 Recent developments Incorporation of forks Enhancement of the data model – incorporation of annotation evidences Development of annotation components Evidence-based annotation 2 AG-ICB-USP

21

22

23

24 Genome annotation Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) 2 AG-ICB-USP

25 Annotation is the process of adding information to DNA sequence. The information usually has a DNA coordinate. Features could be repeats, genes, promoters, protein domains, etc. Features can be cross-referenced to other databases (e.g. Pfam/Pubmed) Genome annotation 2 AG-ICB-USP

26 Annotation file A typical annotation file contains: A header with: Information about the sequence Organism Authors References Comments A feature table containing Sequence features and co-ordinates 2 AG-ICB-USP

27 Feature table format Flatfile format Format definition available at http://www.ncbi.nlm.nih.gov/projects/collab/FT/ Covers DDBJ/EMBL/GenBank Defines all accepted annotation terms and hierarchy 2 AG-ICB-USP

28 Incorporating annotation EGene’s data model was enriched to incorporate annotation information into the representation of the sequences All collected data is converted into a proprietary XML format The XML can be easily converted into different annotation formats: Feature Table, GFF3, etc. We provide some converters and new ones can be easily implemented 2 AG-ICB-USP

29 Annotation components A comprehensive set of annotation components has been implemented: ORF finding and translation Tandem repeats finding: TRF, String, mREPS tRNA finding: tRNAscan-SE Gene Prediction: Genscan, GlimmerM, GlimmerHMM, Twinscan, Phat, ESTscan, SNAP Motif finding: HMMer x Pfam, RPS-BLAST, InterproScan Similarity search: BLAST EST mapping: Sim4, Exonerate 2 AG-ICB-USP

30 Annotation components A comprehensive set of annotation components has been implemented: Transmembrane domain finding: TMHMM, Phobius Signal peptide: SignalP, Phobius GPI anchor: DGPI GO mapping and quantification Orthology assignment and quantification: COG/KOG Pathway mapping: KEGG Annotation visualization with GBrowse: web inspection Annotation report generation: feature table, GFF3 Web site generation: HTML/PHP 2 AG-ICB-USP

31

32

33

34 EGene generates annotation files that can be inspected using regular editors (Artemis, Apollo, etc.) 2 AG-ICB-USP

35 EGene’s annotation EGene can generate annotation in different formats: XML – local use, easy to feed a database management system Feature table  Convenient for manual curation on Artemis  Ready for submission to public databases GFF3  Current annotation interchange format  Manual curation/visualization on Artemis, Apollo and GMOD Genome Browser  Compliant with Sequence Ontology terms 2 AG-ICB-USP

36

37

38

39

40 EGene performs GO term mapping and constructs web pages for inspection 2 AG-ICB-USP

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61 EGene performs an integrated and quantitative orthology analysis (COG/KOG) and constructs web pages 2 AG-ICB-USP

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82 EGene automatically constructs a full web site for evidence inspection 2 AG-ICB-USP

83

84

85

86

87

88

89

90

91

92

93

94

95

96 Current developments Full integration with a database management system Automated task distribution management across multiple processing nodes Development of a graphical interface for evidence inspection and manual curation “Intelligent” annotation – use of probalistic methods to evaluate evidence and designate protein functions 2 AG-ICB-USP

97 Why use EGene2 ? Ideal for small- and mid-sized laboratories Genome and EST sequencing projects Conceived for Biologists Does not require programming skills Generic tool for any sequencing/annotation project – customized for specific user’s requirements Very easy to implement new components Multiplatform - MacOS, UNIX, Linux, etc. Well documented – HOWTOs, tutorials, example datasets available Easy configuration CoEd - Application with a GUI for pipeline construction Generic pipeline templates provided 2 AG-ICB-USP

98 Research team Prof. Alan M. Durham – IME-USP Annotation Milene Ferro – ICB-USP Ricardo Yamamoto Abe – IME-USP Luiz Thiberio Rangel – ICB-USP Sequence pre-processing André Yoshiaki Kashiwabara - IME-USP Fernando Tadashi G. Matsunaga - ICB-USP Paulo Henrique Ahagon - ICB-USP Leonardo Varuzza - ICB-USP 2 AG-ICB-USP

99 Financial Support FAPESP - São Paulo State Science Foundation CNPq - National Research Council 2 AG-ICB-USP

100 Thanks for your attention AG-ICB-USP


Download ppt "A generic and modular platform for automated sequence processing and annotation Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo."

Similar presentations


Ads by Google