RNA Sequencing Day 7 Wooohoooo! Aaron Odell June 17th 2015
Outline For The Day Map raw fastq files to reference genome Convert mapped data to visualization ready files “Quantify” mapped data over reference genome genes/annotations Differential Expression
Data Set Mouse RNA Seq mm10 Reference Genome (.fasta file) PolyA Strand Specific Library. Ethanol treatment 2 x 100 Paired End Reads = 2 Fastq Files mm10 Reference Genome (.fasta file) Ensemble mm10 gene annotation file (.gtf) Due to time and compute restraints we will only be working with Chromosome 19 Pull up what a gtf looks like
What Mapping Program To Use?... Need program that is able to handle spliced reads… Tophat2, GSNAP, STAR, etc … We will be using Tophat2 https://ccb.jhu.edu/software/tophat/manual.shtml Excellent spliced alignment program evaluation http://www.nature.com/nmeth/journal/v10/n12/full/nmeth.2722.html
First things first Log onto vieques and load all modules we will need to run analysis successfully
module load commands module load python_2.7.3 module load samtools_0.1.16 module load numpy_1.6.1 module load pysam_0.8.1 module load bowtie_bowtie2-2.0.2 module load tophat_2.0.6 module load htseq_0.6.1
Lets begin… cp -r /projects/sreadgrp/RNA /Users/username/ Edit /Users/username/RNA/SCRIPTS/ILS_E_tophat_samtools.pbs in text editor Change all “odella” to your actual username Example /Users/odella/ /Users/your_username/ Change #PBS –M your_email_address to your actual email address (or if you don’t want the email then do nothing) From the /Users/username/RNA/SCRIPTS directory qsub ILS_E_1_tophat_samtools.pbs Have half the class copy. Then wait a minute or two. Then have to other half the class do the copy
While we wait… Edit /Users/username/RNA/SCRIPTS/ILS_E_htSeqCount.pbs Change all “odella” to your actual username htSeqCount manual http://www-huber.embl.de/users/anders/HTSeq/doc/count.html Once /Users/username/RNA/SCRIPTS/ILS_E_tophat_samtools.pbs has run to completion, qsub /Users/username/RNA/SCRIPTS/ILS_E_htSeqCount.pbs
For those of you who are bored… Read through the tophat and htseq manuals Take a look at various files created by our mapping/quantification pipeline. Tophat manual explains what all these files are and where they come from Run this command from /Users/username/RNA/ILS_E directory samtools flagstat ILS_E_sorted.bam Do you understand the output? Search the internet or ask someone if you don’t
Time to visualize! Copy bam and genome files to student directory cp /Users/username/RNA/ILS_E/ILS_E_sorted.bam /projects/sreadgrp/student/username/ILS_E_sorted.bam cp /Users/username/RNA/ILS_E/ILS_E_sorted.bam.bai /projects/sreadgrp/student/username/ILS_E_sorted.bam.bai cp /User/username/RNA/GENOME/chr19.fasta /projects/sreadgrp/student/username/chr19.fasta cp /User/username/RNA/GENOME/chr19.fasta.fai /projects/sreadgrp/student/username/chr19.fasta.fai cp /User/username/RNA/GENOME/chr19.gtf /projects/sreadgrp/student/username/chr19.gtf
Load the approriate files Load genome from file Chr19.fasta Load from file Chr19.gtf ILS_E_sorted.bam
Look at ILS_E_htSeq.txt Count the number of reads you see mapping in IGV over any given gene and compare to the htSeqCount value Does it make sense?