Quality Control & Nascent Sequencing

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

High Throughput Sequencing
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
Experimental validation. Integration of transcriptome and genome sequencing uncovers functional variation in human populations Tuuli Lappalainen et al.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
IPlant Collaborative Discovery Environment RNA-seq Basic Analysis Log in with your iPlant ID; three orange icons.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Genome-wide association study between DSE polymorphism and Poly-A usage in Human population Hiren Karathia Sridhar Hannenhalli.
Trinity College Dublin, The University of Dublin GE3M25: Data Analysis, Class 4 Karsten Hokamp, PhD Genetics TCD, 07/12/2015
IGV tools. Pipeline Download genome from Ensembl bacteria database Export the mapping reads file (SAM) Map reads to genome by CLC Using the mapping.
Objectives Genome-wide investigation – to estimate alternate Poly-Adenylation (APA) usage on 3’UTR – to identify polymorphism of Downstream Sequence Elements.
Chip – Seq Peak Calling in Galaxy Lisa Stubbs Lisa Stubbs | Chip-Seq Peak Calling in Galaxy1.
HOMER – a one stop shop for ChIP-Seq analysis
Introduction to Exome Analysis in Galaxy Carol Bult, Ph.D. Professor Deputy Director, JAX Cancer Center Short Course Bioinformatics Workshops 2014 Disclaimer…I.
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
IGV Demo Slides:/g/funcgen/trainings/visualization/Demos/IGV_demo.ppt Galaxy Dev: 0.
RNA Seq Analysis Aaron Odell June 17 th Mapping Strategy A few questions you’ll want to ask about your data… - What organism is the data from? -
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
Short Read Workshop Day 1 - Experimental Design Example 1: How to log in to vieques.
Short Read Workshop Day 1 - Experimental Design Video 1- Why to short read sequence (or not)
Canadian Bioinformatics Workshops
Konstantin Okonechnikov Qualimap v2: advanced quality control of
Introductory RNA-seq Transcriptome Profiling
Using command line tools to process sequencing data
NGS File formats Raw data from various vendors => various formats
Day 5 Mapping and Visualization
Short Read Sequencing Analysis Workshop
Placental Bioinformatics
What is a Hidden Markov Model?
Cancer Genomics Core Lab
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
RNA Sequencing Day 7 Wooohoooo!
Variant Calling Workshop
Short Read Sequencing Analysis Workshop
RNA-Seq analysis in R (Bioconductor)
Chip – Seq Peak Calling in Galaxy
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Introductory RNA-Seq Transcriptome Profiling
GE3M25: Data Analysis, Class 4
ChIP-Seq Analysis – Using CLCGenomics Workbench
Figure 1. Effect of acute TNF treatment on transcription in human SGBS adipocytes as assessed by RNA-seq and RNAPII ChIP-seq. Following 10 days in vitro.
Volume 5, Issue 3, Pages (November 2013)
by Leighton J. Core, Joshua J. Waterfall, and John T. Lis
TSS Annotation Workflow
Regulation of Gene Expression by Eukaryotes
Welcome to the LMS Quick Manager Guide.
Two-sided p-values (1.4) and Theory-based approaches (1.5)
Stat 217 – Day 28 Review Stat 217.
ChIP-Seq Data Processing and QC
Agenda 3/16 Eukaryotic Control Introduction and Reading
EXTENDING GENE ANNOTATION WITH GENE EXPRESSION
The transcription process is similar to replication.
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Maximize read usage through mapping strategies
Welcome to the LMS Quick Manager Guide.
Garbage In, Garbage Out: Quality control on sequence data
Regulatory Genomics Lab
Additional file 2: RNA-Seq data analysis pipeline
BF528 - Sequence Analysis Fundamentals
Introduction to RNA-Seq & Transcriptome Analysis
Chip – Seq Peak Calling in Galaxy
by Leighton J. Core, Joshua J. Waterfall, and John T. Lis
Short Read Sequencing Analysis Workshop

RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Quality Control & Nascent Sequencing 2019 Short Read Sequencing Analysis Workshop Day 7 Quality Control & Nascent Sequencing

Today’s plan Quality Control (QC) Pileup, RSeQC, MultiQC Nascent Analysis FStitch, Tfit, & DAStk

Before we begin.... Run the following script in the EC2 instance:       $ sh /scratch/Workshop/SR2019/7_nascent/scripts/install_python.sh

Part 1: Quality Control

Data quality affects the type and quality of information that can be obtained from a sequencing experiment What do we notice about the two samples below?

Preseq : Measures data complexity Complexity = # of unique reads / total reads sequenced

Pileup : Assesses coverage Coverage = # of base pairs with at least 1 read mapping (not depth)

Two Important Points 1. Coverage and complexity are correlative but are NOT the same 2. Increased coverage/complexity does not guarantee better data

Running FastQC We've already done this... so this should be a good warm-up! Copy the script /scratch/Workshop/SR2019/7_nascent/scripts/fastqc.sbatch  to your /scratch/Users directory Run FastQC on the fastq files in the RNA-seq and GRO-seq directories      RNA-seq directory: /scratch/Workshop/SR2019/RNA-seq/fastqs      GRO-seq directory: /scratch/Workshop/SR2019/GRO-seq/fastqs

Running Pileup Copy the script /scratch/Workshop/SR2019/7_nascent/scripts/pileup.sbatch to your /scratch/Users directory Edit the pileup.sbatch script and run it on one RNA-seq BAM file and one GRO-seq BAM file      RNA-seq directory: /scratch/Workshop/SR2019/RNA-seq/mapped/bams       GRO-seq directory: /scratch/Workshop/SR2019/GRO-seq/mapped/bams 3. Look at the two *coverage.stats.txt files – what do you notice?

Running RSeQC Read Distribution Copy the script /scratch/Workshop/SR2019/7_nascent/scripts/read_dist.sbatch to your /scratch/Users directory Edit the read_dist.sbatch script and run it on one RNA-seq BAM file and one GRO-seq BAM file       RNA-seq directory: /scratch/Workshop/SR2019/RNA-seq/mapped/bams       GRO-seq directory: /scratch/Workshop/SR2019/GRO-seq/mapped/bams 3. Look at the two *read_dist.txt files – what do you notice?

But... what if I have a lot of samples? It would be nice to be able to view the stats on all of your samples at once.... Fortunately, there's a tool for that: MultiQC And I already posted the link for it on Day 4: http://dowell.colorado.edu/ShortRead/sr2019.html#/day4

Running MultiQC If you kept the scripts formatted the way I had them, you should now have a /scratch/Users/USERNAME/qc directory In scratch, enter the following lines into your terminal:      $ export PATH=~/.local/bin:$PATH      $ multiqc /scratch/Users/USERNAME/qc -o /scratch/Users/USERNAME/qc/multiqc NOTE: Because there aren't many files, this is a super quick, low- memory. If it were a larger job, we would run this in the queue

Look at MultiQC html output file in X2Go

Part 2: Generating Annotations/Regions of Interest

Steady-state (RNA-seq) Nascent (GRO-seq, PRO-seq) Alternative gene splicing/isoform analysis Differential expression of steady- state Post-transcriptional modifications Exon/intron boundaries Transcriptional regulatory elements (TREs) RNA-polymerase activity Initiation Pausing Elongation Termination Immediate changes upon permutation

Steady-state vs. Nascent

Information Obtained from Nascent Sequencing Transcription Regulatory Elements (TREs) (orange) RNAP Pausing Index (red) 3'-end run-on (blue)

Need a way to quickly annotate transcriptional regions of interest Gene annotations are based on mRNA... Need a way to quickly annotate transcriptional regions of interest

FStitch | Train Generates model parameters given a set of hand-annotated training regions 1.  Look at training file format:         $ less /scratch/Workshop/SR2019/7_nascent/chr1_hct116.bed 2.  Log onto X2Go and load IGV:         $ sh /opt/igv/2.3.75/igv.sh 3.  Then, in IGV, import our training file:         Regions --> Region Navigator … 4.  Choose 3 more "OFF" regions and 3 more "ON" region         Regions --> Export Regions … 5. Edit and run 7_fstitch_train.sbatch script       

Training File Output

FStitch | Segment Given the model parameters, segment the genome into active ("ON") and inactive ("OFF") regions of transcription 1. Edit and run 7_fstitch_segment.sbatch script 2. When the job is complete, look at segment data in IGV       

FStitch | Bidir Using FStitch segment output and a gene annotation file, we can now parse bidirectionals (active regions of RNAP loading) 1. Edit and run 7_fstitch_bidir.sbatch script 2. When the job is complete, look at bidirectional annotations in IGV       

Applications of FStitch Output Differential transcription analysis of genes/bidirectionals

Applications of FStitch Output Pre-filter to Tfit (a tool that models RNAP behavior)

Tfit Model Cartoon

Motif Displacement Analysis Generate a "metagene" using the annotations Calculate the ratio of the number of motifs within 300bp from the center of each annotation to those found within 3000bp Determine whether the change in MD score is statistically significant

Homework Run QC on ChIP and ATAC and compare this data to RNA-seq and GRO-seq Use the scripts provided to calculate MD scores using Tfit annotations If you are struggling, keep in mind we will get more practice with this on Day 9/10

The End Questions?? Don’t forget the homework. Help session in JSCBB A108 from 1-3pm Watch videos for Day 8