Bulk RNA-Seq Analysis Using CLCGenomics Workbench

Bulk RNA-Seq Analysis Using CLCGenomics Workbench
2019 Ansuman Chattopadhyay, PhD Asst Director, Molecular Biology information service Health sciences library system University of pittsburgh Sri Chaparala, MS Bioinformatics Specialist Health Sciences Library System University of Pittsburgh

Topics Brief introduction to RNA-Seq experiments Analyze RNA-seq data
Download seq reads from EBI-ENA/NCBI SRA Import reads to CLC Genomics Workbench Align reads to Reference Genome Estimate expressions in the gene level Estimate expressions in the transcript isoform level Statistical analysis of the differential expressed genes and transcripts Create Heat Map, Volcano Plots, and Venn Diagram

Differential Gene Expressions
Raw Reads Venn Diagram Volcano Plot

Scaife Hall, Falk Library, Classroom 2
Descriptions & Registration: 4th Single Cell RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On SEPTEMBER 11th ChIP-Seq & CLC Genomics 10am-12pm Overview & 1-3pm Hands-On 25th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 2nd Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On Fall 2019 HSLS MolBio Workshops OCTOBER 9th Pathway Analysis—Open Access Tools 10am-12pm Overview & 1-3pm Hands-On 23rd ChIP-Seq & Partek Flow 1-4pm 30th Gene Regulation 1-4pm Scaife Hall, Falk Library, Classroom 2 6th Single Cell RNA-Seq 10am-12pm Overview & 1-3pm Hands-On NOVEMBER 13th Gene Expression Visualization 1-4pm 20th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 4th Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On DECEMBER 11th Genetic Variation 10am-12pm Overview & 1-3pm Hands-On

CRC Workshops

HSLS MolBio

Partek Flow : Software for scRNA-Seq Data Analysis

NGS Software @ HSLS MolBio
NGS Analysis Sanger Seq Analysis

RNA-Seq Software @ HSLS MolBio
Enrichment Analysis Deferentially Expressed Genes CLC Genomics Work Bench Ingenuity Pathway Analysis Functions Diseases Pathways Key Pathway Advisor Upstream Regulators Volcano Plot PCA Plot Venn Diagram Heat Map Any Organism Illumina BaseSpace Correlation Engine RNA-Seq Reads RNA-Seq Analysis Down Stream Analysis

RNA-Seq Data Analysis Support through HSLS MBIS

RNA Seq Questionnaire What is the scientific objective of the RNA Seq experiment? How many classes will be compared? Are only coding RNA (mRNA) or long non coding RNA, miRNA expected to be detected? Did all the samples pass RNA quality checks before sequencing? Are there biological replicates? If so how many? What type of sequencing platform was used to sequence the reads? Illumina, Ion torrent, Solid Where was the sequencing performed? Facility name and contact info When was the sequencing performed? Year/date Which RNA – extraction method was used in the experiment? Total RNA/ poly A/ rRNA depletion method and kit name and if possible, link to protocol Whether the protocol is strand specific or not? Unstranded/ forward/reverse, kit name and if possible link to protocol Whether the data is single end or paired end? What is the expected read length? Do the reads contain adapters or removed? If not please provide adapter sequence, if available, or link (usually can get this info from facility) What are the experimental conditions to perform differential expression analysis? Which organism and the reference genome to be used for analysis?

CLC Genomics Workbench

CLCGx 12 Genomics Workbench BioMedical Workbench

Install Plugins

CLCbio Genomics Workbench
System Requirements Windows Vista, Windows 7, Windows 8, Windows 10, Windows Server 2012 or 2016 Mac: OS X 10.10, and macOS 10.12, 10.13, 10.14 Linux: RHEL 7 and later, Suse Linux Enterprise Server 11 and later. (The software is expected to run without problem on other recent Linux systems, but we do not guarantee this.) 8 GB RAM required 16 GB RAM recommended 1024 x 768 display required 1600 x 1200 display recommended Intel or AMD CPU required 500GB disc space required in the CLC Genomics server

HPC Partnership with CRC to Mitigate Computational Bottleneck
NGS Pitt HSLS License Server

CLCBio Genomics Workbench Server
- You can connect your CLC Genomics Workbench software to the core HTC cluster available to University of Pittsburgh researchers through the Center for Research Computing (CRC). - This allows you to transparently migrate data from your workstation to the cluster, and run analyses on the cluster, which then run independently of your workstation (i.e. you can shutdown your machine and your analyses will continue unabated).

Center for Research computing (CRC)

Request Access to CRC

CLC Genomics Workbench
Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check on the CLCbio website) If you have not already done so, request a user account/allocation on the Center for Research Computing (CRC) for HTC cluster by filling out the required information If your computer is not connected to the Pitt network (e.g. you are working from home or on a trip), or you are working from a laptop that is connected to the Pitt wireless system, make sure you setup Pitt VPN, so that you can communicate with the CLC Bioserver on HTC cluster. Start the CLC Genomics Workbench

Connect to CLC Server @ CRC

Access to CRC-HTC Cluster – CLC Server
If you DO NOT HAVE CRC-HTC account: Use the following for a limited access during workshop UserID: hslsmolb PW: library1# Server name: clcbio.crc.pitt.edu Port: 7777 If you have CRC-HTC account Use – pitt user name; pitt password Server name: clcbio.crc.pitt.edu Port: 7777

File Structure at CRC CLC Gx Server
folders organized by PI’s name

Pre-analyzed Results

RNA-Seq Data

Bulk RNA-seq Study

NCBI SRA

NCBI SRA Untreated Vs DEX

RNA-Seq Basics

RNA-Seq vs. Microarrays
covers more dynamic range allows to discover novel transcripts able to detect SNPs more costly ($300-$1000/sample) than Microarray ($100-$200/sample) Generates times larger dataset than Microarray uncompressed RNA-Seq raw files: >5GB Microarray RNA-Seq Riki Kawaguchi’s Blog: Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE Jan 16;9(1):e78644.

convert to cDNA fragments
adaptors ligation short seq reads align reads to reference genome

Bulk RNA-Seq fragmentation of RNA before cDNA synthesis was shown to reduce 3ʹ:5ʹ bias4, and strand-specific library preparation methods, which allow sense and antisense transcripts to be differentiated, were shown to provide a more accurate estimate of transcript abundance

Bulk RNA-Seq Data Analysis Workflow

Bulk RNA-Seq Data Analysis Steps
Command Line Tools Graphical User Interface In workflow A, aligners such as TopHat112, STAR113 or HISAT2 (ref.114) use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq133 and featureCounts134, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)142), gene expression is modelled using tools such as edgeR143, DESeq2 (ref.155) and limma+voom156, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation. In workflow B, newer, alignment-free tools, such as Kallisto119 and Salmon120, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts. Alternatively, workflow C begins by aligning the reads (typically performed with TopHat112, although STAR113 and HISAT114 can also be used), followed by the use of CuffLinks131to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts. Other tools in common use include StringTie116, which assembles a transcriptome model from TopHat112(or similar tools) before the results are passed through to RSEM105 or MMSEQ132 to estimate transcript abundance, and then to Ballgown157 to identify differentially expressed genes or transcripts, and SOAPdenovo-trans117, which simultaneously aligns and assembles reads for analysis via the path of choice. Taken from Stark etal., Nat Rev Genet 2019 paper Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. (2019). doi: /s

CommandLine vs Graphical User Interface
CLI GUI

CLC Genomics Software User Interface

Contact CLCBio Support Team
CLCGX 12.0 User Manual:

Create a Folder in CRC-HTC Cluster
1 2

Create Workshop Folder@ HTC-CLC Server
1 2 3

CLCGX Tools for RNA-Seq Data Analysis
1 2

Import FASTQ Reads to CLCGx

Import FASTQ Reads to CLCGx
Import your saved data from local computer or from CRC servers NCBI SRA download in CLC

Illumina 6,235591 NGS Technologies ABI SoLid 27,315 Ion Torrent 88,946
NCBI Seq Read Archive Illumina 6,235591 ABI SoLid ,315 Ion Torrent ,946 PacBio ,538 MinIon ,404

Import Reads Stored in Local Computer Files to CLCGx
1 2

Import Reads to CLC 3 4 5

Import Reads from CRC Server
Select Grid option – HTC Data CRC can assign each group (faculty) an import/export directory on the server. Member of the group shared this import/export directory with read/write permissions. Please open a support ticket on CRC website if you do not find a folder matching your group.

Download Reads from NCBI SRA database

NCBI SRA download in CLC

Download FASTQ Reads from EBI ENA

Help : Import Illumina Reads

FASTQ Format

Results By CLC : Imported Illumina Reads
TrainingMaterials Workshops CBF_AMLeukemiaProject RNASeq _GSE101788 RNASeq_DifferentialExpression Reads Reads are already downloaded. You can find the reads in Server Folder – TraingMaterials – Pre-analyzed Result_RNA-Seq

Imported Illumina Reads

A Preprocessing includes experimental design, sequencing design, and quality control steps.

Number of Replicates Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [20]. Increasing sequencing depth also can improve statistical power for lowly expressed genes.

QC for Sequencing Reads

https://galaxyproject. github

FASTQC Project

Phred Score wikipedia

Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. – As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability. Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

Create a Seq QC Report 1 2

Results By CLC: Read QC Report

Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability.

Read Trimming (based on quality of reads or adapters)

Trim Reads

Read Trimming

Annotate Reads: Create a Metadata Table

Create and Import a Metadata Table
Spread Sheet

Import Metadata

Read Mapping

Read Mapping Wikipedia

Read Mapping Ozsolak et al. Nature Review Genetics

CLC Read Mapper Documentation

Read Mapping 5

Reads Mapping 7

Reads Mapping 8

Reference Genome

Reference Genomes https://www.ncbi.nlm.nih.gov/grc

Reference Genome Human : Grch38 Mouse: mm10 -- C57BL/6J
Mouse 16 other strains are now available

Read Mapping

Read Mapping 9

Reads Mapping 10

Reads Mapping

Mapping Result GE : Gene Expression; TE: Transcript Expression; FG: Fusion Gene

Reads Mapping

Normalization and Expression Values
TMM: weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM) used by EDGER and CLCGx

Normalization Methods

Reads Mapping GE

Transcript Expression

Read Mapping Report – SRR5861494
An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’).

Transcript Level Expression

The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. We expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases.

Create a Combined RNA-Seq Report

The biotypes are "as a percentage of all transcripts" or "as a percentage of all genes". For a poly-A enrichment experiment, it is expected that the majority of reads correspond to proteincoding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied.

For a poly-A enrichment experiment, it is expected that the majority of reads correspond to protein coding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied. CLC Gx Manual

A Preprocessing includes experimental design, sequencing design, and quality control steps.

Create a PCA Plot - QC at the sample level

Differential Expression
Differential Expressions Between Two Groups – ex: Treated vs Untreated, KO vs WT Differential Expressions between Multiple Groups

Differential Expressions Between Two Groups – Treated vs Untreated
First, select mapped reads from Test Samples Then, select mapped reads from Control Samples

Commonly Used Tools for Differential Gene Expression Analysis

Differential Gene Expression – Treated vs Untreated
TMM Normalization (Trimmed Mean of M values) calculates effective libraries sizes, which are then used as part of the per-sample normalization. TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed.

Differential Expression - Treated vs Untreated
Use the metadata table to define groups

Differential Gene Expression

Differential Expression – Gene level

Fold Change in Natural Scale vs Log Scale
GraphPad Statistics Guide :

Data Visualization

Differential Expression - Volcano Plot

Create a HeatMap

Running CLC Genomics software on CRC HTC Cluster

Create a Track

Expression Browser – all in one large spread sheet

Downstream Analysis

Downstream Analysis DEG Annotates differentially expressed genes from
an RNA-seq experiment, using the curated public data from GEO

NextBio Research

Export Data from CLC

Find Correlated Gene Expression Studies from GEO

Ingenuity IPA Analysis

Suggested MBIS Workshops

Bulk RNA-Seq Analysis Using CLCGenomics Workbench

Similar presentations

Presentation on theme: "Bulk RNA-Seq Analysis Using CLCGenomics Workbench"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bulk RNA-Seq Analysis Using CLCGenomics Workbench

Similar presentations

Presentation on theme: "Bulk RNA-Seq Analysis Using CLCGenomics Workbench"— Presentation transcript:

Similar presentations

About project

Feedback