Download presentation
1
Transcriptome Analysis
Song Li
2
Introduction to Transcriptome Analysis
Goal of transcriptome analysis Differentially expressed genes/small RNA Clustering/network analysis Identify novel transcripts (w/o reference genome) Types of transcriptome analysis RNA-seq Small RNA-seq Ribo-seq, PEAT-seq, PolyA-seq Normalized RNA-seq (for lincRNA)
3
Workflow for Transcriptome Analysis
RNA-seq reads De-multiplexing Quality Accessment Adapter trimming FASTX-toolkits, Btrim, FASTQC Map reads to reference genome/transcriptome De novo assembly Bowtie, BWA, tophat, STAR and more Trinity, Trans-Abyss, Soap-denovo Differential Expression (gene, small RNA) Differential Splicing Merge, compare and filtering HTseq, edgeR, DESeq, BayesSeq and more Cufflinks, DEXSeq MISO,MATS BLAST like tools Always follow up with wet-bench validation !
4
Introduction Steps for transcriptome analysis
Read mapping: bowtie/tophat (Splicing variant detection: cufflinks) Read counting: HTseq Differential expression analysis: edgeR/DESeq
5
Software Installation
For bowtie/tophat/cufflinks, looking for binary for your platform Bowtie2 bowtie linux-x86_64.zip Tophat2 tophat Linux_x86_64.tar.gz Cufflinks2 cufflinks Linux_x86_64.tar.gz
6
Data # Data from the read mapping class (Fall 2014)
# Genome annotation file ITAG2.4_gene_models.chr01.gff3 # Genomic sequence SL2.50ch01.fa # RNAseq reads, three replicates per sample Sohab_LA1777_apmer_r1.c01.fq; Sohab_LA1777_apmer_r2.c01.fq Sohab_LA1777_apmer_r3.c01.fq Solyc_M82_apmer_r1.c01.fq; … Sopen_LA0716_apmer_r1.c01.fq; …
7
Read Mapping (1) Build reference genome index # buildgenome.sh
# define your working directory WORKDIR=/home/li/Research/BioinforClass/Transcriptome/ # define bowtie home BOWTIEHOME=$WORKDIR/softwares/bowtie # change the working directory to where the index files will be located cd $WORKDIR/data/bowtie2index # build index $BOWTIEHOME/bowtie2-build ../SL2.50ch01.fa SL2index >bt2index.log
8
Read Mapping (1) Expected output
# log file for the bowtie2-build program bt2index.log # 6 index files SL2index.1.bt2 SL2index.2.bt2 SL2index.3.bt2 SL2index.4.bt2 SL2index.rev.1.bt2 SL2index.rev.2.bt2
9
Read Mapping (2) Use Tophat for read alignment
Splicing aware alignment # tophatalign_Sohab1.sh WORKDIR=/home/li/Research/BioinforClass/Transcriptome/ TOPHATHOME=$WORKDIR/softwares/tophat Linux_x86_64 BOWTIEINDEX=$WORKDIR/data/bowtie2index/SL2index INPUTDATA=$WORKDIR/data/Sohab_LA1777_apmer_r1.c01.fq OUTDIR=$WORKDIR/results/Sohab1 cd $OUTDIR $TOPHATHOME/tophat2 -o $OUTDIR $BOWTIEINDEX $INPUTDATA 2> tophat.log
10
Read Mapping (2) Expected output accepted_hits.bam # alignment file
unmapped.bam # unmapped reads junctions.bed # splice junctions deletions.bed # deletions insertions.bed # insertions prep_reads.info # number of input/output reads align_summary.txt # alignment rate (% of reads aligned) Logs # other log files tophat.log # runtime log file
11
Read Mapping (2) Useful parameters
-N/--read-mismatches discard if MM is more than this number -r/--mate-inner-dist paired-end distance -a/--min-anchor-length anchor length -i/--min-intron-length length of introns -I/--max-intron-length length of introns -g/--max-multihits number of multihits --report-secondary-alignments -G/--GTF use genome annotation --transcriptome-index
12
Introduction Steps for transcriptome analysis
Read mapping: bowtie/tophat (~20 lines of code) Read counting: HTseq Differential expression analysis: edgeR/DESeq
13
Counting Reads Install htseq-count for counting reads
HTseq-count (Python program) Download from: Installation: $python setup.py install --user Installed directory (add this to your PATH): ~/.local/bin
14
Counting Reads Mode of overlapping
15
Counting Reads Script for read count
workdir=/home/li/Research/BioinforClass/Transcriptome gff=$workdir/data/ITAG2.4_gene_models.chr01.gff3 sam=$workdir/results/Sohab1/sorted.sam output=$workdir/results/counts/Sohab1.count htseq-count --format=sam \ --stranded=no \ --order=name \ --type=exon \ --idattr=Parent \ $sam $gff > $output
16
Counting Reads Expected output: ./HTseqcount.sopen.sh
72727 GFF lines processed. SAM alignment records processed. SAM alignment records processed. SAM alignments processed. In the result folder: Sohab1.count Sohab3.count Solyc2.count Sopen1.count Sopen3.count Sohab2.count Solyc1.count Solyc3.count Sopen2.count
17
Counting Reads Expected output: mRNA:Solyc01g112300.2.1 13
__no_feature __ambiguous __too_low_aQual 0 __not_aligned 0 __alignment_not_unique
18
Introduction Steps for transcriptome analysis
Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (~10 lines of code) Differential expression analysis: edgeR/DESeq
19
Combine Read Counts MergeReads.R # in source code MergeReads.R
filenames=dir('results') filenames=grep('csv',filenames,value=TRUE) # in console > filenames [1] "Sohab1.csv" "Sohab2.csv" "Sohab3.csv" "Solyc1.csv" "Solyc2.csv" "Solyc3.csv" [7] "Sopen1.csv" "Sopen2.csv" "Sopen3.csv"
20
Differential Expression
MergeReads.R # in console: check the size of the inputs > tail(tmpmat) X1 Solyc01g __no_feature __ambiguous __too_low_aQual __not_aligned __alignment_not_unique 25744 > dim(tmpmat) [1]
21
Differential Expression
MergeReads.R # in source code MergeReads.R # create data matrix datmat<-matrix(0, ncol=9,nrow=4293) colnames(datmat)<-filenames rownames(datmat)<-rownames(tmpmat)[1:4293]
22
Differential Expression
MergeReads.R # in source code MergeReads.R # use loop to read all the files for(eachfn in filenames) { print(eachfn) tmpmat<-read.table(eachfn,sep='\t', as.is=T,col.names=1) datmat[,eachfn]<-tmpmat[1:4293,1] }
23
Differential Expression
MergeReads.R # in console: check the data matrix > datmat[1:3,1:3] Sohab1 Sohab2 Sohab3 Solyc01g Solyc01g Solyc01g # in source: save the data matrix for future use write.table(datmat,'mergedcounts.csv')
24
Introduction Steps for transcriptome analysis
Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (< 10 lines of code) Differential expression analysis: edgeR/DESeq ~15 lines of code for data preparation
25
Differential Expression
Install edgeR source(' biocLite('edgeR')
26
Differential Expression
CallDiffGene.R # load data datmat<-read.table('mergedcounts.csv',as.is=T) # set group group <- c(1,1,1,2,2,2,3,3,3) # create an object for the count data dge <- DGEList(counts=datmat,group=group) # filter low exp genes keep<-rowSums(cpm(dge))>1 dge<-dge[keep,]
27
Differential Expression
CallDiffGene.R # MDS plot plotMDS(dge)
28
Differential Expression
CallDiffGene.R # normalization dge<-calcNormFactors(dge) #make design matrix groupf<-factor(group) design<-model.matrix(~0+groupf) #estimate dispersion dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design) fit<-glmFit(dge,design)
29
Differential Expression
Normalization (TMM) > dge$samples group lib.size norm.factors Sohab Sohab Sohab Solyc Solyc Solyc Sopen Sopen Sopen
30
Differential Expression
Design matrix > design groupf1 groupf2 groupf3
31
Differential Expression
Biological Coefficient of Variation dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design)
32
Differential Expression
CallDiffGene.R # normalization dge<-calcNormFactors(dge) #make design matrix groupf<-factor(group) design<-model.matrix(~0+groupf) #estimate dispersion dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design) fit<-glmFit(dge,design)
33
Differential Expression
CallDiffGene.R # doing LRT test lrt.habvslyc<-glmLRT(fit,contrast = c(1,-1,0)) # select subset of genes tmpdiff<-topTags(lrt.habvslyc,n=1000) # filter by FDR diff.habvslyc<-tmpdiff$table[tmpdiff$table[,5]<0.05,]
34
Differential Expression
CallDiffGene.R # check number of differentially expressed genes > dim(diff.habvslyc) [1] > dim(diff.habvspen) [1] > dim(diff.lycvspen) [1]
35
Differential Expression
CallDiffGene.R > diff.habvslyc[1:2,] logFC logCPM Solyc01g Solyc01g LR PValue Solyc01g e-25 Solyc01g e-19 FDR Solyc01g e-22 Solyc01g e-16
36
Summary Steps for transcriptome analysis
Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (< 10 lines of code) Differential expression analysis: edgeR/DESeq ~15 lines of code for data preparation ~20-30 lines of code for differential expression
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.