Bulk RNA-Seq Analysis Using CLCGenomics Workbench

Slides:



Advertisements
Similar presentations
IMGS 2012 Bioinformatics Workshop: RNA Seq using Galaxy
Advertisements

Peter Tsai Bioinformatics Institute, University of Auckland
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
High Throughput Sequencing
Before we start: Align sequence reads to the reference genome
NGS Analysis Using Galaxy
An Introduction to RNA-Seq Transcriptome Profiling with iPlant
Li and Dewey BMC Bioinformatics 2011, 12:323
Expression Analysis of RNA-seq Data
Bioinformatics and OMICs Group Meeting REFERENCE GUIDED RNA SEQUENCING.
RNAseq analyses -- methods
Next Generation DNA Sequencing
RNA-Seq Analysis Simon V4.1.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
The iPlant Collaborative
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Introductory RNA-seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis.
RNA-Seq Transcriptome Profiling. Before we start: Align sequence reads to the reference genome The most time-consuming part of the analysis is doing the.
Introduction to RNAseq
Bioinformatics for biologists Dr. Habil Zare, PhD PI of Oncinfo Lab Assistant Professor, Department of Computer Science Texas State University Presented.
The iPlant Collaborative
The iPlant Collaborative
No reference available
__________________________________________________________________________________________________ Fall 2015GCBA 815 __________________________________________________________________________________________________.
CyVerse Workshop Transcriptome Assembly. Overview of work RNA-Seq without a reference genome Generate Sequence QC and Processing Transcriptome Assembly.
Canadian Bioinformatics Workshops
Canadian Bioinformatics Workshops
Microarray Technology and Data Analysis Roy Williams PhD Sanford | Burnham Medical Research Institute.
Canadian Bioinformatics Workshops
Introductory RNA-seq Transcriptome Profiling of the hy5 mutation in Arabidopsis thaliana.
Canadian Bioinformatics Workshops
QuasR: Quantify and Annotate Short Reads in R Anita Lerch, Dimos Gaidatzis, Florian Hahne and Michael Stadler Friedrich Miescher Institute for Biomedical.
RNA-Seq with the Tuxedo Suite Monica Britton, Ph.D. Sr. Bioinformatics Analyst September 2015 Workshop.
Centralizing Bioinformatics Services: Analysis Pipelines, Opportunities, and Challenges with Large- scale –Omics, and other BigData High-Performance Computing.
Simon v RNA-Seq Analysis Simon v
Introductory RNA-seq Transcriptome Profiling
Pathway Informatics 16th August, 2017
Canadian Bioinformatics Workshops
RNA Quantitation from RNAseq Data
An Introduction to RNA-Seq Data and Differential Expression Tools in R
Placental Bioinformatics
Cancer Genomics Core Lab
WS9: RNA-Seq Analysis with Galaxy (non-model organism )
Moderní metody analýzy genomu
Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017
Gene expression from RNA-Seq
Short Read Sequencing Analysis Workshop
RNA-Seq analysis in R (Bioconductor)
Tutorial 6 : RNA - Sequencing Analysis and GO enrichment
Lab meeting
S1 Supporting information Bioinformatic workflow and quality of the metrics Number of slides: 10.
Pathway Visualization
Introductory RNA-Seq Transcriptome Profiling
ChIP-Seq Analysis – Using CLCGenomics Workbench
The FASTQ format and quality control
Kallisto: near-optimal RNA seq quantification tool
Many Sample Size and Power Calculators Exist On-Line
Covering the Bases: Carrie Iwema, PhD, MLS
Exploring and Understanding ChIP-Seq data
Bulk RNA-Seq Analysis Using CLCGenomics Workbench
Pathway Informatics December 5, 2018 Ansuman Chattopadhyay, PhD
Pathway Visualization
Transcriptomics Data Visualization Using Partek Flow Software
Quantitative analyses using RNA-seq data
Sequence Analysis - RNA-Seq 2
BF528 - Sequence Analysis Fundamentals
Toward Accurate and Quantitative Comparative Metagenomics

RNA-Seq Data Analysis UND Genomics Core.
Presentation transcript:

Bulk RNA-Seq Analysis Using CLCGenomics Workbench 2019 Ansuman Chattopadhyay, PhD Asst Director, Molecular Biology information service Health sciences library system University of pittsburgh ansuman@pitt.edu Sri Chaparala, MS Bioinformatics Specialist Health Sciences Library System University of Pittsburgh srichaparala@pitt.edu

Topics Brief introduction to RNA-Seq experiments Analyze RNA-seq data Download seq reads from EBI-ENA/NCBI SRA Import reads to CLC Genomics Workbench Align reads to Reference Genome Estimate expressions in the gene level Estimate expressions in the transcript isoform level Statistical analysis of the differential expressed genes and transcripts Create Heat Map, Volcano Plots, and Venn Diagram

Differential Gene Expressions Raw Reads Venn Diagram Volcano Plot

Scaife Hall, Falk Library, Classroom 2 Descriptions & Registration: http://www.hsls.pitt.edu/calendar 4th Single Cell RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On SEPTEMBER 11th ChIP-Seq & CLC Genomics 10am-12pm Overview & 1-3pm Hands-On 25th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 2nd Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On Fall 2019 HSLS MolBio Workshops OCTOBER 9th Pathway Analysis—Open Access Tools 10am-12pm Overview & 1-3pm Hands-On 23rd ChIP-Seq & Partek Flow 1-4pm 30th Gene Regulation 1-4pm Scaife Hall, Falk Library, Classroom 2 6th Single Cell RNA-Seq 10am-12pm Overview & 1-3pm Hands-On NOVEMBER 13th Gene Expression Visualization 1-4pm 20th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 4th Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On DECEMBER 11th Genetic Variation 10am-12pm Overview & 1-3pm Hands-On

CRC Workshops https://crc.pitt.edu/Register/CRC/Workshops/Fall/2019

NGS @ Pitt

NGS@Pitt http://www.hscrf.pitt.edu/sites/default/files/NGS/NGSFlow.htm

Software @ HSLS MolBio http://hsls.libguides.com/molbio/licensedtools/resources

Partek Flow : Software for scRNA-Seq Data Analysis http://hsls.libguides.com/molbio/partekflow

NGS Software @ HSLS MolBio NGS Analysis Sanger Seq Analysis

RNA-Seq Software @ HSLS MolBio Enrichment Analysis Deferentially Expressed Genes CLC Genomics Work Bench Ingenuity Pathway Analysis Functions Diseases Pathways Key Pathway Advisor Upstream Regulators Volcano Plot PCA Plot Venn Diagram Heat Map Any Organism Illumina BaseSpace Correlation Engine RNA-Seq Reads RNA-Seq Analysis Down Stream Analysis

RNA-Seq Data Analysis Support through HSLS MBIS http://info.hsls.pitt.edu/updatereport/?p=9974

RNA Seq Questionnaire What is the scientific objective of the RNA Seq experiment? How many classes will be compared? Are only coding RNA (mRNA) or long non coding RNA, miRNA expected to be detected? Did all the samples pass RNA quality checks before sequencing? Are there biological replicates? If so how many? What type of sequencing platform was used to sequence the reads? Illumina, Ion torrent, Solid Where was the sequencing performed? Facility name and contact info When was the sequencing performed? Year/date Which RNA – extraction method was used in the experiment? Total RNA/ poly A/ rRNA depletion method and kit name and if possible, link to protocol Whether the protocol is strand specific or not? Unstranded/ forward/reverse, kit name and if possible link to protocol Whether the data is single end or paired end? What is the expected read length? Do the reads contain adapters or removed? If not please provide adapter sequence, if available, or link (usually can get this info from facility) What are the experimental conditions to perform differential expression analysis? Which organism and the reference genome to be used for analysis?

CLC Genomics Workbench

CLCGx 12 Genomics Workbench BioMedical Workbench

Install Plugins

CLCbio Genomics Workbench System Requirements Windows Vista, Windows 7, Windows 8, Windows 10, Windows Server 2012 or 2016 Mac: OS X 10.10, 10.11 and macOS 10.12, 10.13, 10.14 Linux: RHEL 7 and later, Suse Linux Enterprise Server 11 and later. (The software is expected to run without problem on other recent Linux systems, but we do not guarantee this.) 8 GB RAM required 16 GB RAM recommended 1024 x 768 display required 1600 x 1200 display recommended Intel or AMD CPU required 500GB disc space required in the CLC Genomics server

HPC Partnership with CRC to Mitigate Computational Bottleneck NGS Analysis @ Pitt HSLS License Server

CLCBio Genomics Workbench Server - You can connect your CLC Genomics Workbench software to the 8000-core HTC cluster available to University of Pittsburgh researchers through the Center for Research Computing (CRC). https://crc.pitt.edu/ - This allows you to transparently migrate data from your workstation to the cluster, and run analyses on the cluster, which then run independently of your workstation (i.e. you can shutdown your machine and your analyses will continue unabated).

Center for Research computing (CRC) https://crc.pitt.edu/

Request Access to CRC

CLC Genomics Workbench Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check on the CLCbio website) If you have not already done so, request a user account/allocation on the Center for Research Computing (CRC) for HTC cluster by filling out the required information https://crc.pitt.edu/ If your computer is not connected to the Pitt network (e.g. you are working from home or on a trip), or you are working from a laptop that is connected to the Pitt wireless system, make sure you setup Pitt VPN, so that you can communicate with the CLC Bioserver on HTC cluster. Start the CLC Genomics Workbench

Connect to CLC Server @ CRC

Access to CRC-HTC Cluster – CLC Server If you DO NOT HAVE CRC-HTC account: Use the following for a limited access during workshop UserID: hslsmolb PW: library1# Server name: clcbio.crc.pitt.edu Port: 7777 If you have CRC-HTC account Use – pitt user name; pitt password Server name: clcbio.crc.pitt.edu Port: 7777

File Structure at CRC CLC Gx Server folders organized by PI’s name

Pre-analyzed Results

RNA-Seq Data

Bulk RNA-seq Study http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0099625

http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE52778

NCBI SRA

NCBI SRA

NCBI SRA Untreated Vs DEX

RNA-Seq Basics

RNA-Seq vs. Microarrays covers more dynamic range allows to discover novel transcripts able to detect SNPs more costly ($300-$1000/sample) than Microarray ($100-$200/sample) Generates 30-40 times larger dataset than Microarray uncompressed RNA-Seq raw files: >5GB Microarray RNA-Seq Riki Kawaguchi’s Blog: https://bioinfomagician.wordpress.com/about/ Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE. 2014 Jan 16;9(1):e78644.

convert to cDNA fragments adaptors ligation short seq reads align reads to reference genome

https://www.nature.com/articles/s41576-019-0150-2

Bulk RNA-Seq fragmentation of RNA before cDNA synthesis was shown to reduce 3ʹ:5ʹ bias4, and strand-specific library preparation methods, which allow sense and antisense transcripts to be differentiated, were shown to provide a more accurate estimate of transcript abundance

Bulk RNA-Seq Data Analysis Workflow http://education.knoweng.org/sequenceng/

Bulk RNA-Seq Data Analysis Steps Command Line Tools Graphical User Interface  In workflow A, aligners such as TopHat112, STAR113 or HISAT2 (ref.114) use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq133 and featureCounts134, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)142), gene expression is modelled using tools such as edgeR143, DESeq2 (ref.155) and limma+voom156, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation. In workflow B, newer, alignment-free tools, such as Kallisto119 and Salmon120, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts. Alternatively, workflow C begins by aligning the reads (typically performed with TopHat112, although STAR113 and HISAT114 can also be used), followed by the use of CuffLinks131to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts. Other tools in common use include StringTie116, which assembles a transcriptome model from TopHat112(or similar tools) before the results are passed through to RSEM105 or MMSEQ132 to estimate transcript abundance, and then to Ballgown157 to identify differentially expressed genes or transcripts, and SOAPdenovo-trans117, which simultaneously aligns and assembles reads for analysis via the path of choice. Taken from Stark etal., Nat Rev Genet 2019 paper Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. (2019). doi:10.1038/s41576-019-0150-2

CommandLine vs Graphical User Interface CLI GUI

CLC Genomics Software User Interface

Contact CLCBio Support Team CLCGX 12.0 User Manual: http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/index.php?manual=Introduction_CLC_Genomics_Workbench.html

Create a Folder in CRC-HTC Cluster 1 2

Create Workshop Folder@ HTC-CLC Server 1 2 3

CLCGX Tools for RNA-Seq Data Analysis 1 2

Import FASTQ Reads to CLCGx

Import FASTQ Reads to CLCGx Import your saved data from local computer or from CRC servers NCBI SRA download in CLC

Illumina 6,235591 NGS Technologies ABI SoLid 27,315 Ion Torrent 88,946 NCBI Seq Read Archive Illumina 6,235591 ABI SoLid 27,315 Ion Torrent 88,946 PacBio 52,538 MinIon 7,404

Import Reads Stored in Local Computer Files to CLCGx 1 2

Import Reads to CLC 3 4 5

Import Reads from CRC Server Select Grid option – HTC Data CRC can assign each group (faculty) an import/export directory on the server. Member of the group shared this import/export directory with read/write permissions. Please open a support ticket on CRC website if you do not find a folder matching your group. https://crc.pitt.edu/tickets

Download Reads from NCBI SRA database

NCBI SRA download in CLC

Download FASTQ Reads from EBI ENA https://www.ebi.ac.uk/ena

Help : Import Illumina Reads

FASTQ Format http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

Results By CLC : Imported Illumina Reads TrainingMaterials Workshops CBF_AMLeukemiaProject RNASeq _GSE101788 RNASeq_DifferentialExpression Reads Reads are already downloaded. You can find the reads in Server Folder – TraingMaterials – Pre-analyzed Result_RNA-Seq

Imported Illumina Reads

A Preprocessing includes experimental design, sequencing design, and quality control steps. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

Number of Replicates Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [20]. Increasing sequencing depth also can improve statistical power for lowly expressed genes.

QC for Sequencing Reads

https://galaxyproject. github https://galaxyproject.github.io/training-material/topics/sequence-analysis/tutorials/quality-control/slides.html#1

FASTQC Project http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Phred Score wikipedia

Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. – As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability. Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews) http://bioinfo-core.org/index.php/9th_Discussion-28_October_2010

Create a Seq QC Report 1 2

Results By CLC: Read QC Report

Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability.

Read Trimming (based on quality of reads or adapters)

Trim Reads

Read Trimming

Annotate Reads: Create a Metadata Table

Create and Import a Metadata Table Spread Sheet

Import Metadata

Import Metadata

Read Mapping

Read Mapping Wikipedia

Read Mapping Ozsolak et al. Nature Review Genetics

CLC Read Mapper Documentation http://resources.qiagenbioinformatics.com/white-papers/White_paper_on_CLC_read_mapper.pdf

Read Mapping 5

Reads Mapping 7

Reads Mapping 8

Reference Genome

Reference Genomes https://www.ncbi.nlm.nih.gov/grc http://useast.ensembl.org/info/data/ftp/index.html

Reference Genome Human : Grch38 Mouse: mm10 -- C57BL/6J Mouse 16 other strains are now available http://useast.ensembl.org/info/data/ftp/index.html?redirect=no

Read Mapping

Read Mapping 9

Reads Mapping 10

Reads Mapping

Mapping Result GE : Gene Expression; TE: Transcript Expression; FG: Fusion Gene

Reads Mapping

Normalization and Expression Values TMM: weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM) used by EDGER and CLCGx

Normalization Methods

Reads Mapping GE

Transcript Expression

Read Mapping Report – SRR5861494 An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’).

Transcript Level Expression

The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. We expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases.

Create a Combined RNA-Seq Report

 The biotypes are "as a percentage of all transcripts" or "as a percentage of all genes". For a poly-A enrichment experiment, it is expected that the majority of reads correspond to proteincoding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied.

For a poly-A enrichment experiment, it is expected that the majority of reads correspond to protein coding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied. CLC Gx Manual http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/current/User_Manual.pdf

A Preprocessing includes experimental design, sequencing design, and quality control steps. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8

Create a PCA Plot - QC at the sample level

Differential Expression Differential Expressions Between Two Groups – ex: Treated vs Untreated, KO vs WT Differential Expressions between Multiple Groups

Differential Expressions Between Two Groups – Treated vs Untreated First, select mapped reads from Test Samples Then, select mapped reads from Control Samples

Commonly Used Tools for Differential Gene Expression Analysis

Differential Gene Expression – Treated vs Untreated TMM Normalization (Trimmed Mean of M values) calculates effective libraries sizes, which are then used as part of the per-sample normalization. TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed.

Differential Expression - Treated vs Untreated Use the metadata table to define groups

Differential Gene Expression

Differential Expression – Gene level

Fold Change in Natural Scale vs Log Scale GraphPad Statistics Guide : https://www.graphpad.com/guides/prism/7/statistics/index.htm

Data Visualization

Differential Expression - Volcano Plot

Create a HeatMap

Create a HeatMap

Create a HeatMap

Running CLC Genomics software on CRC HTC Cluster

Create a Track

Expression Browser – all in one large spread sheet

Downstream Analysis

Downstream Analysis DEG Annotates differentially expressed genes from an RNA-seq experiment, using the curated public data from GEO

NextBio Research

Export Data from CLC

Find Correlated Gene Expression Studies from GEO

Find Correlated Gene Expression Studies from GEO

Ingenuity IPA Analysis

Suggested MBIS Workshops