Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to.

Slides:



Advertisements
Similar presentations
ASENT_IMPORT.PPT Importing Data From Templates Last revised 11/03/2009.
Advertisements

The Maize Inflorescence Project Website Tutorial Nov 7, 2014.
To open a new document, double click Word from the programs – or from an existing document, go to the file menu at the top left, and click new. Also from.
RNA-seq data analysis Project
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
1 Committed to Shaping the Next Generation of IT Experts. Chapter 3 – Graphs and Charts: Delivering a Message Robert Grauer and Maryann Barber Exploring.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Scaffold Download free viewer:
Introduction to SPSS Short Courses Last created (Feb, 2008) Kentaka Aruga.
Bacterial Genome Assembly | Victor Jongeneel Radhika S. Khetani
NGS Analysis Using Galaxy
Introduction to RNA-Seq and Transcriptome Analysis
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
File formats Wrapping your data in the right package Deanna M. Church
Pandas: Python Programming for Spreadsheets Pamela Wu Sept. 17 th 2015.
RNAseq analyses -- methods
SAGExplore web server tutorial for Module II: Genome Mapping.
Introduction to RNA-Seq & Transcriptome Analysis
GeneWise and Artemis Exercises Spliced Alignment using GeneWise Click on the GeneWise hyperlink on the course links page,
Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.
Transcriptome Analysis
1.Obtaining software 2.Sample pdf for this presentation 3.Checking accessibility of the pdf 4.Tackling inaccessibility 5.Tips and helpful links How to.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Chip-Seq Peak Calling in Galaxy | Lisa Stubbs | PowerPoint by Casey Hanson.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Microsoft® Excel Key and format dates and times. 1 Use Date & Time functions. 2 Use date and time arithmetic. 3 Use the IF function. 4 Create.
The material contained in this document is proprietary to Triniti Corporation (Triniti). This material may not be disclosed, duplicated or otherwise revealed,
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
Analyzing digital gene expression data in Galaxy Supervisors: Peter-Bram A.C. ’t Hoen Kostas Karasavvas Students: Ilya Kurochkin Ivan Rusinov.
SAGExplore web server tutorial. The SAGExplore server has three different modules …
-1- Module 3: RNA-Seq Module 3 BAMView Introduction Recently, the use of new sequencing technologies (pyrosequencing, Illumina-Solexa) have produced large.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
Short Read Workshop Day 5: Mapping and Visualization
For Datatel and other applications Presented by Cheryl Sullivan.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Short Read Workshop Day 5: Mapping and Visualization Video 3 Introduction to BWA.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
EndNote Essentials.
Chapter 3: Getting Started with Tasks
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Stubbs Lab Bioinformatics - 2 Retrieving sequence data files and Linux commands Nov 17, 2016 Joe Troy.
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA Sequencing Day 7 Wooohoooo!
Linux 103 Training MOdule Basic System Mgmt.
Using Microsoft Word to Create APA Style Headers & Page Numbers
SAGExplore web server tutorial for Module III:
RNA-Seq Software, Tools, and Workflows
RNA-Seq analysis in R (Bioconductor)
Stubbs Lab Bioinformatics – 4 Alignment Summary Report & Count files with htseq-count Nov 29, 2016 Joe Troy.
Chip – Seq Peak Calling in Galaxy
Stubbs Lab Bioinformatics - 3 Review RNA-Seq Analysis Overview Alignment using Tophat2 Nov 22, 2016 Joe Troy.
Bulk Loading Documents* into Windchill
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Central Document Library Quick Reference User Guide View User Guide
Genome Biology & Applied Bioinformatics Mehmet Tevfik DORAK, MD PhD
Learning to count: quantifying signal
Maximize read usage through mapping strategies
Alignment of Next-Generation Sequencing Data
Genome 540: Discussion Section Week 3
Computational Pipeline Strategies
Introduction to RNA-Seq & Transcriptome Analysis
Chip – Seq Peak Calling in Galaxy
Agenda for Unit 8: Defining Arrays in Game Programs
P 72 (PDF 76) Figure 32 Information item name Rules in columns
RNA-Seq Data Analysis UND Genomics Core.
Quality Control & Nascent Sequencing
Presentation transcript:

Stubbs Lab Bioinformatics – 5 Review tophat, alignment summary and htseq-count exercises: MDS plots and Differential expression We want to be able to communication A) what did the software do, and B) what does the output mean Dec 06, 2016 Joe Troy

Agenda Review key points of the alignment process Review the alignment summary report Review the count files from htseq-count Explain how choosing an annotation file (.gtf or .gff file) is important. MDS plot exercise Differential Gene Expression exercise (edgeR)

Review Alignment Process Notes on tophat2 alignments per the tophat2 command we use in our scripts [ tophat2 -p 12 --library-type fr-firststrand -G file.gtf -o output-folder genome file.fastq ] By default: Final read alignments having more than TWO mismatches are discarded. (this can be changed with the -N/--read-mismatches option) By default: Any given read will only have 20 different alignments reported in the tophat2 output (accepted_hits.bam). This can be changed with the -g/--max-multihits option “TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments.” IMPORTANT NOTE: The htseq-count script DOES NOT COUNT reads that map to multiple locations in the genome.

Review Alignment Summary Report

Review Alignment Summary Report input – the number of reads in the sample’s .fastq file mapped – the number of reads mapped to the genome Note that TWO mismatches are allowed by default multiple_alignments – reads mapped to more than one location greater_than_20 – reads mapping to more than 20 locations. Note that only 20 alignments are kept in the accepted_hits.bam file overall – proportion of mapped reads to input reads mapped - multiple_alignments = uniquely mapped reads

Review htseq-count count files Notes on count files created by htseq-count [ htseq-count -f bam -a 0 -t exon -i gene_id -s reverse -m union accepted_hits.bam file.gtf ] -f bam : the input file is a .bam file -a 0 : minimum alignment quality in order to count. (alignment quality not checked) -t exon : count if read aligns to an exon as specified in the .gtf file -s reverse : For stranded=reverse and single-end reads, the read has to be mapped to the opposite strand as the feature. -m union : A read that overlaps an exon (totally or partially) for one gene (and only one gene) is counted (see next slide) more info at: http://www-huber.embl.de/HTSeq/doc/count.html Stress the difference in how gtf files are used for tophat vs htseq-count.

1 read is counted multiple times in this figure htseq-count output 0610005C13Rik 0 0610007C21Rik 1318 0610007L01Rik 957 0610007P08Rik 0 . mCG_21548 0 mCG_55969 0 mir694 0 rp9 0 rpf2 0 __no_feature 364413 __ambiguous 3169 __too_low_aQual 0 __not_aligned 0 __alignment_not_unique 239889 read alignments not hitting an exon in any gene – not counted reads alignments hitting more than one gene – not counted read alignments mapped to multiple locations – not counted. 1 read is counted multiple times in this figure

htseq-count – special counters documentation http://www-huber. embl __no_feature: reads (or read pairs) which could not be assigned to any feature (set S as described above was empty). __ambiguous: reads (or read pairs) which could have been assigned to more than one feature and hence were not counted for any of these (set S had mroe than one element). __too_low_aQual: reads (or read pairs) which were skipped due to the -a option, see below __not_aligned: reads (or read pairs) in the SAM file without alignment __alignment_not_unique: reads (or read pairs) with more than one reported alignment. These reads are recognized from the NH optional SAM field tag. (If the aligner does not set this field, multiply aligned reads will be counted multiple times, unless they get filtered out by due to the -a option.)

INPUT: .tgz file(s) from ftp.biotec.illinois.edu INPUT: .fastq short read files INPUT: align_summary.txt files from tophat2 OUTPUT: “accepted_hits.bam” for each sample OUTPUT: alignment_summary.txt (can be opened in excel) OUTPUT: .fastq short read files OUTPUT: “align_summary.txt” Retrieve and un-compress short read files Align Reads to genome Alignment Summary Report to review alignment stats sftp command Tophat 2 script main_script_alignment_summary_16Gso.sh & create_alignment_summary.R tar command

INPUT: “accepted_hits.bam” file from each sample INPUT: .count files from htseq-count INPUT: .count files from htseq-count INPUT: TARGET file INPUT: TARGET file OUTPUT: .count files OUTPUT: MDS_plot_in_color.pdf OUTPUT: Differential Gene Expression files Create Count Files (un-normalized counts by gene) Create MDS plots edgeR Differential Gene Expression main_script_htseq_count_16Gso.sh note: this script uses htseq-count http://www-huber.embl.de/HTSeq/doc/count.html main_script_MDS_plot_16Gso.sh main_script_de_w_edgeR_16Gso.sh

What is a TARGET file? The target files provides data for each sample to the MDS plot and edgeR DE scripts. MDS plot script uses the File, Report_Group2 & MDS_plot columns edgeR script uses the File, Label & Group column /home/share/example_rna_seq_project_16Gso/project_input_data/TARGET.txt

Instructions to create an MDS plot INSTRUCTION SLIDE 1 of 2 – no need to use ‘screen’ as it runs quick! Josephs-MacBook-Pro:~ josephtroy$ ssh jmtroy2@stubbslab.igb.illinois.edu jmtroy2@stubbslab.igb.illinois.edu's password: Last login: Mon Nov 21 20:15:51 2016 from c-73-73-226-74.hsd1.il.comcast.net [jmtroy2@stubbslab ~]$ df -h Filesystem Size Used Avail Use% Mounted on /dev/sda1 4.6T 4.2T 150G 97% / /dev/sda2 95G 14G 77G 16% /var /dev/sdb1 289M 29M 246M 11% /boot tmpfs 32G 0 32G 0% /dev/shm /dev/sdb2 275G 116G 145G 45% /var/lib/mysql [jmtroy2@stubbslab ~]$ cd /home/share/example_rna_seq_project_16Gso/ [jmtroy2@stubbslab example_rna_seq_project_16Gso]$ cd code_030_MDS_plots/ [jmtroy2@stubbslab code_030_MDS_plots]$ sh main_script_MDS_plot_16Gso.sh Loading required package: methods [1] "end reading arguments" The output folder is: /home/share/example_rna_seq_project_16Gso/output_030_MDS_plots_RUN_20161205_174635 end script main_script_MDS_plot_16Gso.sh at Mon Dec 5 17:46:43 CST 2016 [jmtroy2@stubbslab code_030_MDS_plots]$

Instructions to view the MDS plot INSTRUCTION SLIDE 2 of 2 In cyberduck find the “MDS_plot_in_color.pdf” file in the output folder and move it (drag and drop) to a local folder on your PC or laptop. Open the “MDS_plot_in_color.pdf” file with adobe reader (or any other .pdf reader) on your PC or laptop. -OR- in cyberduck, select the “MDS_plot_in_color.pdf” file and hit the space bar to view.

Instructions to run edgeR Differential Gene Expression INSTRUCTION SLIDE 1 of 2 – no need to use ‘screen’ as it runs quick! Josephs-MacBook-Pro:~ josephtroy$ ssh jmtroy2@stubbslab.igb.illinois.edu jmtroy2@stubbslab.igb.illinois.edu's password: Last login: Tue Dec 6 10:12:14 2016 from nat-relay.igb.illinois.edu [jmtroy2@stubbslab ~]$ cd /home/share/example_rna_seq_project_16Gso/ [jmtroy2@stubbslab example_rna_seq_project_16Gso]$ cd code_060_differential_expression_w_edgeR/ [jmtroy2@stubbslab code_060_differential_expression_w_edgeR]$ sh main_script_de_w_edgeR_16Gso.sh Loading required package: limma Loading required package: methods Loading required package: edgeR Running estimateCommonDisp() on DGEList object before proceeding with estimateTrendedDisp(). Running estimateCommonDisp() on DGEList object before proceeding with estimateTagwiseDisp(). The output folder is: /home/share/example_rna_seq_project_16Gso/output_060_differential_expression_w_edgeR_RUN_20161206_105505 end script main_script_de_w_edgeR_16Gso.sh at Tue Dec 6 10:55:18 CST 2016 [jmtroy2@stubbslab code_060_differential_expression_w_edgeR]$

Instructions to run edgeR Differential Gene Expression INSTRUCTION SLIDE 2 of 2 In cyberduck find the “… pairwise_DGE.txt” files in the output folder and move it (drag and drop) to a local folder on your PC or laptop. These are tab delimited text files and can be opened with Excel. The header information at the top of the file explains the meaning of the columns