Download presentation
Presentation is loading. Please wait.
Published byAndrea Lawrence Modified over 9 years ago
1
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov
2
Galaxy Galaxy is an open, web-based platform for accessible, reproducible, and transparent computational biomedical research.
3
Adding new tool in Galaxy To add new tool in Galaxy you need: Tool definition file in xml format The tool script
4
...
5
SAGE Sequence and count short tags representative for a transcript Absolute abundance of transcript
6
Existing pipeline for analyzing DeepSAGE data GAPSS: General analysis pipeline for second generation sequencers Implemented in Galaxy Some final steps were missed: - Gene annotation (ENSEMBL/Biomart) and summarization - Statistical analysis of differential gene expression
7
Existing workflow
8
Gene annotation and summarization Tool for counting DeepSAGE tags in ENSEMBL annotated exons. Tool for automatic BioMart format file obtaining.
9
Obtain BioMart format file
10
Count DeepSAGE tags in annotated exons Input files: 1) BioMart format file: 2) SAM format file:
11
Count DeepSAGE tags in annotated exons
12
Output file:
13
Count DeepSAGE tags in annotated exons 1. For each line in SAM file reads all Biomart file. (~1 second/line) 2. BioMart file load into dictionary, data splits by chromosome name and strand. (50 seconds for 10,000 lines) 3. SAM file is loaded into dictionary, data splits by chromosome name, strand and genomic position. (16 seconds for 10,000 lines) 4. Work with several SAM files. 5. Both files are loaded into dictionaries. (16 seconds for 10,000 lines; ~16 minutes for 7,768,787 lines) 6. Sort BioMart dictionary by exon coordinates, problem with crossing and repeated exons. 7. Binary search for position from SAM file in sorted list of exon coordinates was implemented. (77 seconds for 7,768,787 lines)
14
About R/Bioconductor R is a language and environment for statistical computing and graphics. Bioconductor provides tools for the analysis and comprehension of high-throughput genomic data. Bioconductor uses the R statistical programming language, and is open source and open development.
15
Statistical analysis of differential gene expression Tool for examining differential expression of replicated count data using edgeR package of Bioconductor Tool for estimating the variance in count data and test for differential expression using DESeq package of Bioconductor
16
Analysis of differentially expressed genes (edgeR) Input files: 1.DeepSAGE tags in annotated exons counter output file 2. Metadata file Design matrix Contrast vector 1 0 Generalized linear model
17
Analysis of differentially expressed genes (edgeR)
18
Output file:
19
Analysis of differentially expressed genes (DESeq) Test for differences between the base means of two levels Input files: 1. DeepSAGE tags in annotated exons counter output file 2. Metadata file Create a CountDataSet object Estimate the effective library size for a CountDataSet Estimate the variance functions for a CountDataSet
20
Analysis of differentially expressed genes (DESeq)
21
Output file:
22
Comparison of results obtained by edgeR and DESeq
23
Full workflow
24
Thank you for your attention Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.