Bulk RNA-Seq Analysis Using CLCGenomics Workbench 2019 Ansuman Chattopadhyay, PhD Asst Director, Molecular Biology information service Health sciences library system University of pittsburgh ansuman@pitt.edu Sri Chaparala, MS Bioinformatics Specialist Health Sciences Library System University of Pittsburgh srichaparala@pitt.edu
Topics Brief introduction to RNA-Seq experiments Analyze RNA-seq data Download seq reads from EBI-ENA/NCBI SRA Import reads to CLC Genomics Workbench Align reads to Reference Genome Estimate expressions in the gene level Estimate expressions in the transcript isoform level Statistical analysis of the differential expressed genes and transcripts Create Heat Map, Volcano Plots, and Venn Diagram
Differential Gene Expressions Raw Reads Venn Diagram Volcano Plot
Scaife Hall, Falk Library, Classroom 2 Descriptions & Registration: http://www.hsls.pitt.edu/calendar 4th Single Cell RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On SEPTEMBER 11th ChIP-Seq & CLC Genomics 10am-12pm Overview & 1-3pm Hands-On 25th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 2nd Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On Fall 2019 HSLS MolBio Workshops OCTOBER 9th Pathway Analysis—Open Access Tools 10am-12pm Overview & 1-3pm Hands-On 23rd ChIP-Seq & Partek Flow 1-4pm 30th Gene Regulation 1-4pm Scaife Hall, Falk Library, Classroom 2 6th Single Cell RNA-Seq 10am-12pm Overview & 1-3pm Hands-On NOVEMBER 13th Gene Expression Visualization 1-4pm 20th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 4th Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On DECEMBER 11th Genetic Variation 10am-12pm Overview & 1-3pm Hands-On
CRC Workshops https://crc.pitt.edu/Register/CRC/Workshops/Fall/2019
NGS @ Pitt
NGS@Pitt http://www.hscrf.pitt.edu/sites/default/files/NGS/NGSFlow.htm
Software @ HSLS MolBio http://hsls.libguides.com/molbio/licensedtools/resources
Partek Flow : Software for scRNA-Seq Data Analysis http://hsls.libguides.com/molbio/partekflow
NGS Software @ HSLS MolBio NGS Analysis Sanger Seq Analysis
RNA-Seq Software @ HSLS MolBio Enrichment Analysis Deferentially Expressed Genes CLC Genomics Work Bench Ingenuity Pathway Analysis Functions Diseases Pathways Key Pathway Advisor Upstream Regulators Volcano Plot PCA Plot Venn Diagram Heat Map Any Organism Illumina BaseSpace Correlation Engine RNA-Seq Reads RNA-Seq Analysis Down Stream Analysis
RNA-Seq Data Analysis Support through HSLS MBIS http://info.hsls.pitt.edu/updatereport/?p=9974
RNA Seq Questionnaire What is the scientific objective of the RNA Seq experiment? How many classes will be compared? Are only coding RNA (mRNA) or long non coding RNA, miRNA expected to be detected? Did all the samples pass RNA quality checks before sequencing? Are there biological replicates? If so how many? What type of sequencing platform was used to sequence the reads? Illumina, Ion torrent, Solid Where was the sequencing performed? Facility name and contact info When was the sequencing performed? Year/date Which RNA – extraction method was used in the experiment? Total RNA/ poly A/ rRNA depletion method and kit name and if possible, link to protocol Whether the protocol is strand specific or not? Unstranded/ forward/reverse, kit name and if possible link to protocol Whether the data is single end or paired end? What is the expected read length? Do the reads contain adapters or removed? If not please provide adapter sequence, if available, or link (usually can get this info from facility) What are the experimental conditions to perform differential expression analysis? Which organism and the reference genome to be used for analysis?
CLC Genomics Workbench
CLCGx 12 Genomics Workbench BioMedical Workbench
Install Plugins
CLCbio Genomics Workbench System Requirements Windows Vista, Windows 7, Windows 8, Windows 10, Windows Server 2012 or 2016 Mac: OS X 10.10, 10.11 and macOS 10.12, 10.13, 10.14 Linux: RHEL 7 and later, Suse Linux Enterprise Server 11 and later. (The software is expected to run without problem on other recent Linux systems, but we do not guarantee this.) 8 GB RAM required 16 GB RAM recommended 1024 x 768 display required 1600 x 1200 display recommended Intel or AMD CPU required 500GB disc space required in the CLC Genomics server
HPC Partnership with CRC to Mitigate Computational Bottleneck NGS Analysis @ Pitt HSLS License Server
CLCBio Genomics Workbench Server - You can connect your CLC Genomics Workbench software to the 8000-core HTC cluster available to University of Pittsburgh researchers through the Center for Research Computing (CRC). https://crc.pitt.edu/ - This allows you to transparently migrate data from your workstation to the cluster, and run analyses on the cluster, which then run independently of your workstation (i.e. you can shutdown your machine and your analyses will continue unabated).
Center for Research computing (CRC) https://crc.pitt.edu/
Request Access to CRC
CLC Genomics Workbench Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check on the CLCbio website) If you have not already done so, request a user account/allocation on the Center for Research Computing (CRC) for HTC cluster by filling out the required information https://crc.pitt.edu/ If your computer is not connected to the Pitt network (e.g. you are working from home or on a trip), or you are working from a laptop that is connected to the Pitt wireless system, make sure you setup Pitt VPN, so that you can communicate with the CLC Bioserver on HTC cluster. Start the CLC Genomics Workbench
Connect to CLC Server @ CRC
Access to CRC-HTC Cluster – CLC Server If you DO NOT HAVE CRC-HTC account: Use the following for a limited access during workshop UserID: hslsmolb PW: library1# Server name: clcbio.crc.pitt.edu Port: 7777 If you have CRC-HTC account Use – pitt user name; pitt password Server name: clcbio.crc.pitt.edu Port: 7777
File Structure at CRC CLC Gx Server folders organized by PI’s name
Pre-analyzed Results
RNA-Seq Data
Bulk RNA-seq Study http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099625
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778
NCBI SRA
NCBI SRA
NCBI SRA Untreated Vs DEX
RNA-Seq Basics
RNA-Seq vs. Microarrays covers more dynamic range allows to discover novel transcripts able to detect SNPs more costly ($300-$1000/sample) than Microarray ($100-$200/sample) Generates 30-40 times larger dataset than Microarray uncompressed RNA-Seq raw files: >5GB Microarray RNA-Seq Riki Kawaguchi’s Blog: https://bioinfomagician.wordpress.com/about/ Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014 Jan 16;9(1):e78644.
convert to cDNA fragments adaptors ligation short seq reads align reads to reference genome
https://www.nature.com/articles/s41576-019-0150-2
Bulk RNA-Seq fragmentation of RNA before cDNA synthesis was shown to reduce 3ʹ:5ʹ bias4, and strand-specific library preparation methods, which allow sense and antisense transcripts to be differentiated, were shown to provide a more accurate estimate of transcript abundance
Bulk RNA-Seq Data Analysis Workflow http://education.knoweng.org/sequenceng/
Bulk RNA-Seq Data Analysis Steps Command Line Tools Graphical User Interface In workflow A, aligners such as TopHat112, STAR113 or HISAT2 (ref.114) use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq133 and featureCounts134, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)142), gene expression is modelled using tools such as edgeR143, DESeq2 (ref.155) and limma+voom156, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation. In workflow B, newer, alignment-free tools, such as Kallisto119 and Salmon120, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts. Alternatively, workflow C begins by aligning the reads (typically performed with TopHat112, although STAR113 and HISAT114 can also be used), followed by the use of CuffLinks131to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts. Other tools in common use include StringTie116, which assembles a transcriptome model from TopHat112(or similar tools) before the results are passed through to RSEM105 or MMSEQ132 to estimate transcript abundance, and then to Ballgown157 to identify differentially expressed genes or transcripts, and SOAPdenovo-trans117, which simultaneously aligns and assembles reads for analysis via the path of choice. Taken from Stark etal., Nat Rev Genet 2019 paper Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. (2019). doi:10.1038/s41576-019-0150-2
CommandLine vs Graphical User Interface CLI GUI
CLC Genomics Software User Interface
Contact CLCBio Support Team CLCGX 12.0 User Manual: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Introduction_CLC_Genomics_Workbench.html
Create a Folder in CRC-HTC Cluster 1 2
Create Workshop Folder@ HTC-CLC Server 1 2 3
CLCGX Tools for RNA-Seq Data Analysis 1 2
Import FASTQ Reads to CLCGx
Import FASTQ Reads to CLCGx Import your saved data from local computer or from CRC servers NCBI SRA download in CLC
Illumina 6,235591 NGS Technologies ABI SoLid 27,315 Ion Torrent 88,946 NCBI Seq Read Archive Illumina 6,235591 ABI SoLid 27,315 Ion Torrent 88,946 PacBio 52,538 MinIon 7,404
Import Reads Stored in Local Computer Files to CLCGx 1 2
Import Reads to CLC 3 4 5
Import Reads from CRC Server Select Grid option – HTC Data CRC can assign each group (faculty) an import/export directory on the server. Member of the group shared this import/export directory with read/write permissions. Please open a support ticket on CRC website if you do not find a folder matching your group. https://crc.pitt.edu/tickets
Download Reads from NCBI SRA database
NCBI SRA download in CLC
Download FASTQ Reads from EBI ENA https://www.ebi.ac.uk/ena
Help : Import Illumina Reads
FASTQ Format http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/
Results By CLC : Imported Illumina Reads TrainingMaterials Workshops CBF_AMLeukemiaProject RNASeq _GSE101788 RNASeq_DifferentialExpression Reads Reads are already downloaded. You can find the reads in Server Folder – TraingMaterials – Pre-analyzed Result_RNA-Seq
Imported Illumina Reads
A Preprocessing includes experimental design, sequencing design, and quality control steps. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
Number of Replicates Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [20]. Increasing sequencing depth also can improve statistical power for lowly expressed genes.
QC for Sequencing Reads
https://galaxyproject. github https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#1
FASTQC Project http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Phred Score wikipedia
Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. – As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability. Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews) http://bioinfo-core.org/index.php/9th_Discussion-28_October_2010
Create a Seq QC Report 1 2
Results By CLC: Read QC Report
Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability.
Read Trimming (based on quality of reads or adapters)
Trim Reads
Read Trimming
Annotate Reads: Create a Metadata Table
Create and Import a Metadata Table Spread Sheet
Import Metadata
Import Metadata
Read Mapping
Read Mapping Wikipedia
Read Mapping Ozsolak et al. Nature Review Genetics
CLC Read Mapper Documentation http://resources.qiagenbioinformatics.com/white-papers/White_paper_on_CLC_read_mapper.pdf
Read Mapping 5
Reads Mapping 7
Reads Mapping 8
Reference Genome
Reference Genomes https://www.ncbi.nlm.nih.gov/grc http://useast.ensembl.org/info/data/ftp/index.html
Reference Genome Human : Grch38 Mouse: mm10 -- C57BL/6J Mouse 16 other strains are now available http://useast.ensembl.org/info/data/ftp/index.html?redirect=no
Read Mapping
Read Mapping 9
Reads Mapping 10
Reads Mapping
Mapping Result GE : Gene Expression; TE: Transcript Expression; FG: Fusion Gene
Reads Mapping
Normalization and Expression Values TMM: weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM) used by EDGER and CLCGx
Normalization Methods
Reads Mapping GE
Transcript Expression
Read Mapping Report – SRR5861494 An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’).
Transcript Level Expression
The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. We expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases.
Create a Combined RNA-Seq Report
The biotypes are "as a percentage of all transcripts" or "as a percentage of all genes". For a poly-A enrichment experiment, it is expected that the majority of reads correspond to proteincoding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied.
For a poly-A enrichment experiment, it is expected that the majority of reads correspond to protein coding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied. CLC Gx Manual http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/User_Manual.pdf
A Preprocessing includes experimental design, sequencing design, and quality control steps. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8
Create a PCA Plot - QC at the sample level
Differential Expression Differential Expressions Between Two Groups – ex: Treated vs Untreated, KO vs WT Differential Expressions between Multiple Groups
Differential Expressions Between Two Groups – Treated vs Untreated First, select mapped reads from Test Samples Then, select mapped reads from Control Samples
Commonly Used Tools for Differential Gene Expression Analysis
Differential Gene Expression – Treated vs Untreated TMM Normalization (Trimmed Mean of M values) calculates effective libraries sizes, which are then used as part of the per-sample normalization. TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed.
Differential Expression - Treated vs Untreated Use the metadata table to define groups
Differential Gene Expression
Differential Expression – Gene level
Fold Change in Natural Scale vs Log Scale GraphPad Statistics Guide : https://www.graphpad.com/guides/prism/7/statistics/index.htm
Data Visualization
Differential Expression - Volcano Plot
Create a HeatMap
Create a HeatMap
Create a HeatMap
Running CLC Genomics software on CRC HTC Cluster
Create a Track
Expression Browser – all in one large spread sheet
Downstream Analysis
Downstream Analysis DEG Annotates differentially expressed genes from an RNA-seq experiment, using the curated public data from GEO
NextBio Research
Export Data from CLC
Find Correlated Gene Expression Studies from GEO
Find Correlated Gene Expression Studies from GEO
Ingenuity IPA Analysis
Suggested MBIS Workshops