Download presentation
Presentation is loading. Please wait.
Published bySabina Moore Modified over 6 years ago
1
Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to communication A) what did the software do, and B) what does the output mean Dec 06, 2016 Joe Troy
2
Agenda Review key points of the alignment process
Review the alignment summary report Review the count files from htseq-count Explain how choosing an annotation file (.gtf or .gff file) is important. MDS plot exercise Differential Gene Expression exercise (edgeR)
3
Review Alignment Process
Notes on tophat2 alignments per the tophat2 command we use in our scripts [ tophat2 -p 12 --library-type fr-firststrand -G file.gtf -o output-folder genome file.fastq ] By default: Final read alignments having more than TWO mismatches are discarded. (this can be changed with the -N/--read-mismatches option) By default: Any given read will only have 20 different alignments reported in the tophat2 output (accepted_hits.bam). This can be changed with the -g/--max-multihits option “TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments.” IMPORTANT NOTE: The htseq-count script DOES NOT COUNT reads that map to multiple locations in the genome.
4
Review Alignment Summary Report
5
Review Alignment Summary Report
input – the number of reads in the sample’s .fastq file mapped – the number of reads mapped to the genome Note that TWO mismatches are allowed by default multiple_alignments – reads mapped to more than one location greater_than_20 – reads mapping to more than 20 locations. Note that only 20 alignments are kept in the accepted_hits.bam file overall – proportion of mapped reads to input reads mapped - multiple_alignments = uniquely mapped reads
6
Review htseq-count count files
Notes on count files created by htseq-count [ htseq-count -f bam -a 0 -t exon -i gene_id -s reverse -m union accepted_hits.bam file.gtf ] -f bam : the input file is a .bam file -a 0 : minimum alignment quality in order to count. (alignment quality not checked) -t exon : count if read aligns to an exon as specified in the .gtf file -s reverse : For stranded=reverse and single-end reads, the read has to be mapped to the opposite strand as the feature. -m union : A read that overlaps an exon (totally or partially) for one gene (and only one gene) is counted (see next slide) more info at: Stress the difference in how gtf files are used for tophat vs htseq-count.
8
1 read is counted multiple times in this figure
htseq-count output C13Rik C21Rik L01Rik P08Rik 0 . mCG_ mCG_ mir694 0 rp9 0 rpf2 0 __no_feature __ambiguous 3169 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique read alignments not hitting an exon in any gene – not counted reads alignments hitting more than one gene – not counted read alignments mapped to multiple locations – not counted. 1 read is counted multiple times in this figure
9
htseq-count – special counters documentation http://www-huber. embl
__no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty). __ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence were not counted for any of these (set S had mroe than one element). __too_low_aQual: reads (or read pairs) which were skipped due to the -a option, see below __not_aligned: reads (or read pairs) in the SAM file without alignment __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times, unless they get filtered out by due to the -a option.)
10
INPUT: .tgz file(s) from ftp.biotec.illinois.edu
INPUT: .fastq short read files INPUT: align_summary.txt files from tophat2 OUTPUT: “accepted_hits.bam” for each sample OUTPUT: alignment_summary.txt (can be opened in excel) OUTPUT: .fastq short read files OUTPUT: “align_summary.txt” Retrieve and un-compress short read files Align Reads to genome Alignment Summary Report to review alignment stats sftp command Tophat 2 script main_script_alignment_summary_16Gso.sh & create_alignment_summary.R tar command
11
INPUT: “accepted_hits.bam” file from each sample
INPUT: .count files from htseq-count INPUT: .count files from htseq-count INPUT: TARGET file INPUT: TARGET file OUTPUT: .count files OUTPUT: MDS_plot_in_color.pdf OUTPUT: Differential Gene Expression files Create Count Files (un-normalized counts by gene) Create MDS plots edgeR Differential Gene Expression main_script_htseq_count_16Gso.sh note: this script uses htseq-count main_script_MDS_plot_16Gso.sh main_script_de_w_edgeR_16Gso.sh
12
What is a TARGET file? The target files provides data for each sample to the MDS plot and edgeR DE scripts. MDS plot script uses the File, Report_Group2 & MDS_plot columns edgeR script uses the File, Label & Group column /home/share/example_rna_seq_project_16Gso/project_input_data/TARGET.txt
13
Instructions to create an MDS plot
INSTRUCTION SLIDE 1 of 2 – no need to use ‘screen’ as it runs quick! Josephs-MacBook-Pro:~ josephtroy$ ssh password: Last login: Mon Nov 21 20:15: from c hsd1.il.comcast.net ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda T 4.2T 150G 97% / /dev/sda G 14G 77G 16% /var /dev/sdb M 29M 246M 11% /boot tmpfs G G 0% /dev/shm /dev/sdb G 116G 145G 45% /var/lib/mysql ~]$ cd /home/share/example_rna_seq_project_16Gso/ example_rna_seq_project_16Gso]$ cd code_030_MDS_plots/ code_030_MDS_plots]$ sh main_script_MDS_plot_16Gso.sh Loading required package: methods [1] "end reading arguments" The output folder is: /home/share/example_rna_seq_project_16Gso/output_030_MDS_plots_RUN_ _174635 end script main_script_MDS_plot_16Gso.sh at Mon Dec 5 17:46:43 CST 2016 code_030_MDS_plots]$
14
Instructions to view the MDS plot
INSTRUCTION SLIDE 2 of 2 In cyberduck find the “MDS_plot_in_color.pdf” file in the output folder and move it (drag and drop) to a local folder on your PC or laptop. Open the “MDS_plot_in_color.pdf” file with adobe reader (or any other .pdf reader) on your PC or laptop. -OR- in cyberduck, select the “MDS_plot_in_color.pdf” file and hit the space bar to view.
15
Instructions to run edgeR Differential Gene Expression
INSTRUCTION SLIDE 1 of 2 – no need to use ‘screen’ as it runs quick! Josephs-MacBook-Pro:~ josephtroy$ ssh password: Last login: Tue Dec 6 10:12: from nat-relay.igb.illinois.edu ~]$ cd /home/share/example_rna_seq_project_16Gso/ example_rna_seq_project_16Gso]$ cd code_060_differential_expression_w_edgeR/ code_060_differential_expression_w_edgeR]$ sh main_script_de_w_edgeR_16Gso.sh Loading required package: limma Loading required package: methods Loading required package: edgeR Running estimateCommonDisp() on DGEList object before proceeding with estimateTrendedDisp(). Running estimateCommonDisp() on DGEList object before proceeding with estimateTagwiseDisp(). The output folder is: /home/share/example_rna_seq_project_16Gso/output_060_differential_expression_w_edgeR_RUN_ _105505 end script main_script_de_w_edgeR_16Gso.sh at Tue Dec 6 10:55:18 CST 2016 code_060_differential_expression_w_edgeR]$
16
Instructions to run edgeR Differential Gene Expression
INSTRUCTION SLIDE 2 of 2 In cyberduck find the “… pairwise_DGE.txt” files in the output folder and move it (drag and drop) to a local folder on your PC or laptop. These are tab delimited text files and can be opened with Excel. The header information at the top of the file explains the meaning of the columns
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.