Transcriptome Analysis

Name: Transcriptome Analysis
Uploaded: 2017-07-18T10:43:47+00:00
Duration: PTM20S56
Channel: Austin Jonas Powell
Description: Transcriptome Analysis

Transcriptome Analysis
Song Li

Introduction to Transcriptome Analysis
Goal of transcriptome analysis Differentially expressed genes/small RNA Clustering/network analysis Identify novel transcripts (w/o reference genome) Types of transcriptome analysis RNA-seq Small RNA-seq Ribo-seq, PEAT-seq, PolyA-seq Normalized RNA-seq (for lincRNA)

Workflow for Transcriptome Analysis
RNA-seq reads De-multiplexing Quality Accessment Adapter trimming FASTX-toolkits, Btrim, FASTQC Map reads to reference genome/transcriptome De novo assembly Bowtie, BWA, tophat, STAR and more Trinity, Trans-Abyss, Soap-denovo Differential Expression (gene, small RNA) Differential Splicing Merge, compare and filtering HTseq, edgeR, DESeq, BayesSeq and more Cufflinks, DEXSeq MISO,MATS BLAST like tools Always follow up with wet-bench validation !

Introduction Steps for transcriptome analysis
Read mapping: bowtie/tophat (Splicing variant detection: cufflinks) Read counting: HTseq Differential expression analysis: edgeR/DESeq

Software Installation
For bowtie/tophat/cufflinks, looking for binary for your platform Bowtie2 bowtie linux-x86_64.zip Tophat2 tophat Linux_x86_64.tar.gz Cufflinks2 cufflinks Linux_x86_64.tar.gz

Data # Data from the read mapping class (Fall 2014)
# Genome annotation file ITAG2.4_gene_models.chr01.gff3 # Genomic sequence SL2.50ch01.fa # RNAseq reads, three replicates per sample Sohab_LA1777_apmer_r1.c01.fq; Sohab_LA1777_apmer_r2.c01.fq Sohab_LA1777_apmer_r3.c01.fq Solyc_M82_apmer_r1.c01.fq; … Sopen_LA0716_apmer_r1.c01.fq; …

Read Mapping (1) Build reference genome index # buildgenome.sh
# define your working directory WORKDIR=/home/li/Research/BioinforClass/Transcriptome/ # define bowtie home BOWTIEHOME=$WORKDIR/softwares/bowtie # change the working directory to where the index files will be located cd $WORKDIR/data/bowtie2index # build index $BOWTIEHOME/bowtie2-build ../SL2.50ch01.fa SL2index >bt2index.log

Read Mapping (1) Expected output
# log file for the bowtie2-build program bt2index.log # 6 index files SL2index.1.bt2 SL2index.2.bt2 SL2index.3.bt2 SL2index.4.bt2 SL2index.rev.1.bt2 SL2index.rev.2.bt2

Read Mapping (2) Use Tophat for read alignment
Splicing aware alignment # tophatalign_Sohab1.sh WORKDIR=/home/li/Research/BioinforClass/Transcriptome/ TOPHATHOME=$WORKDIR/softwares/tophat Linux_x86_64 BOWTIEINDEX=$WORKDIR/data/bowtie2index/SL2index INPUTDATA=$WORKDIR/data/Sohab_LA1777_apmer_r1.c01.fq OUTDIR=$WORKDIR/results/Sohab1 cd $OUTDIR $TOPHATHOME/tophat2 -o $OUTDIR $BOWTIEINDEX $INPUTDATA 2> tophat.log

Read Mapping (2) Expected output accepted_hits.bam # alignment file
unmapped.bam # unmapped reads junctions.bed # splice junctions deletions.bed # deletions insertions.bed # insertions prep_reads.info # number of input/output reads align_summary.txt # alignment rate (% of reads aligned) Logs # other log files tophat.log # runtime log file

Read Mapping (2) Useful parameters
-N/--read-mismatches discard if MM is more than this number -r/--mate-inner-dist paired-end distance -a/--min-anchor-length anchor length -i/--min-intron-length length of introns -I/--max-intron-length length of introns -g/--max-multihits number of multihits --report-secondary-alignments -G/--GTF use genome annotation --transcriptome-index

Read mapping: bowtie/tophat (~20 lines of code) Read counting: HTseq Differential expression analysis: edgeR/DESeq

Counting Reads Install htseq-count for counting reads
HTseq-count (Python program) Download from: Installation: $python setup.py install --user Installed directory (add this to your PATH): ~/.local/bin

Counting Reads Mode of overlapping

Counting Reads Script for read count
workdir=/home/li/Research/BioinforClass/Transcriptome gff=$workdir/data/ITAG2.4_gene_models.chr01.gff3 sam=$workdir/results/Sohab1/sorted.sam output=$workdir/results/counts/Sohab1.count htseq-count --format=sam \ --stranded=no \ --order=name \ --type=exon \ --idattr=Parent \ $sam $gff > $output

Counting Reads Expected output: ./HTseqcount.sopen.sh
72727 GFF lines processed. SAM alignment records processed. SAM alignment records processed. SAM alignments processed. In the result folder: Sohab1.count Sohab3.count Solyc2.count Sopen1.count Sopen3.count Sohab2.count Solyc1.count Solyc3.count Sopen2.count

Counting Reads Expected output: mRNA:Solyc01g112300.2.1 13
__no_feature __ambiguous __too_low_aQual 0 __not_aligned 0 __alignment_not_unique

Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (~10 lines of code) Differential expression analysis: edgeR/DESeq

Combine Read Counts MergeReads.R # in source code MergeReads.R
filenames=dir('results') filenames=grep('csv',filenames,value=TRUE) # in console > filenames [1] "Sohab1.csv" "Sohab2.csv" "Sohab3.csv" "Solyc1.csv" "Solyc2.csv" "Solyc3.csv" [7] "Sopen1.csv" "Sopen2.csv" "Sopen3.csv"

Differential Expression
MergeReads.R # in console: check the size of the inputs > tail(tmpmat) X1 Solyc01g __no_feature __ambiguous __too_low_aQual __not_aligned __alignment_not_unique 25744 > dim(tmpmat) [1]

MergeReads.R # in source code MergeReads.R # create data matrix datmat<-matrix(0, ncol=9,nrow=4293) colnames(datmat)<-filenames rownames(datmat)<-rownames(tmpmat)[1:4293]

MergeReads.R # in source code MergeReads.R # use loop to read all the files for(eachfn in filenames) { print(eachfn) tmpmat<-read.table(eachfn,sep='\t', as.is=T,col.names=1) datmat[,eachfn]<-tmpmat[1:4293,1] }

MergeReads.R # in console: check the data matrix > datmat[1:3,1:3] Sohab1 Sohab2 Sohab3 Solyc01g Solyc01g Solyc01g # in source: save the data matrix for future use write.table(datmat,'mergedcounts.csv')

Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (< 10 lines of code) Differential expression analysis: edgeR/DESeq ~15 lines of code for data preparation

Install edgeR source(' biocLite('edgeR')

CallDiffGene.R # load data datmat<-read.table('mergedcounts.csv',as.is=T) # set group group <- c(1,1,1,2,2,2,3,3,3) # create an object for the count data dge <- DGEList(counts=datmat,group=group) # filter low exp genes keep<-rowSums(cpm(dge))>1 dge<-dge[keep,]

CallDiffGene.R # MDS plot plotMDS(dge)

CallDiffGene.R # normalization dge<-calcNormFactors(dge) #make design matrix groupf<-factor(group) design<-model.matrix(~0+groupf) #estimate dispersion dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design) fit<-glmFit(dge,design)

Normalization (TMM) > dge$samples group lib.size norm.factors Sohab Sohab Sohab Solyc Solyc Solyc Sopen Sopen Sopen

Design matrix > design groupf1 groupf2 groupf3

Biological Coefficient of Variation dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design)

CallDiffGene.R # normalization dge<-calcNormFactors(dge) #make design matrix groupf<-factor(group) design<-model.matrix(~0+groupf) #estimate dispersion dge<-estimateGLMCommonDisp(dge,design) dge<-estimateGLMTagwiseDisp(dge,design) fit<-glmFit(dge,design)

CallDiffGene.R # doing LRT test lrt.habvslyc<-glmLRT(fit,contrast = c(1,-1,0)) # select subset of genes tmpdiff<-topTags(lrt.habvslyc,n=1000) # filter by FDR diff.habvslyc<-tmpdiff$table[tmpdiff$table[,5]<0.05,]

CallDiffGene.R # check number of differentially expressed genes > dim(diff.habvslyc) [1] > dim(diff.habvspen) [1] > dim(diff.lycvspen) [1]

CallDiffGene.R > diff.habvslyc[1:2,] logFC logCPM Solyc01g Solyc01g LR PValue Solyc01g e-25 Solyc01g e-19 FDR Solyc01g e-22 Solyc01g e-16

Summary Steps for transcriptome analysis
Read mapping: bowtie/tophat (~20 lines of code) Read counting: Htseq (< 10 lines of code) Differential expression analysis: edgeR/DESeq ~15 lines of code for data preparation ~20-30 lines of code for differential expression

Transcriptome Analysis

Similar presentations

Presentation on theme: "Transcriptome Analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Transcriptome Analysis

Similar presentations

Presentation on theme: "Transcriptome Analysis"— Presentation transcript:

Similar presentations

About project

Feedback