Download presentation
Presentation is loading. Please wait.
Published byAnna Procházková Modified over 5 years ago
1
Bulk RNA-Seq Analysis Using CLCGenomics Workbench
2019 Ansuman Chattopadhyay, PhD Asst Director, Molecular Biology information service Health sciences library system University of pittsburgh Sri Chaparala, MS Bioinformatics Specialist Health Sciences Library System University of Pittsburgh
2
Topics Brief introduction to RNA-Seq experiments Analyze RNA-seq data
Download seq reads from EBI-ENA/NCBI SRA Import reads to CLC Genomics Workbench Align reads to Reference Genome Estimate expressions in the gene level Estimate expressions in the transcript isoform level Statistical analysis of the differential expressed genes and transcripts Create Heat Map, Volcano Plots, and Venn Diagram
3
Differential Gene Expressions
Raw Reads Venn Diagram Volcano Plot
4
Scaife Hall, Falk Library, Classroom 2
Descriptions & Registration: 4th Single Cell RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On SEPTEMBER 11th ChIP-Seq & CLC Genomics 10am-12pm Overview & 1-3pm Hands-On 25th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 2nd Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On Fall 2019 HSLS MolBio Workshops OCTOBER 9th Pathway Analysis—Open Access Tools 10am-12pm Overview & 1-3pm Hands-On 23rd ChIP-Seq & Partek Flow 1-4pm 30th Gene Regulation 1-4pm Scaife Hall, Falk Library, Classroom 2 6th Single Cell RNA-Seq 10am-12pm Overview & 1-3pm Hands-On NOVEMBER 13th Gene Expression Visualization 1-4pm 20th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 4th Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On DECEMBER 11th Genetic Variation 10am-12pm Overview & 1-3pm Hands-On
5
CRC Workshops
6
Pitt
8
HSLS MolBio
9
Partek Flow : Software for scRNA-Seq Data Analysis
10
NGS Software @ HSLS MolBio
NGS Analysis Sanger Seq Analysis
11
RNA-Seq Software @ HSLS MolBio
Enrichment Analysis Deferentially Expressed Genes CLC Genomics Work Bench Ingenuity Pathway Analysis Functions Diseases Pathways Key Pathway Advisor Upstream Regulators Volcano Plot PCA Plot Venn Diagram Heat Map Any Organism Illumina BaseSpace Correlation Engine RNA-Seq Reads RNA-Seq Analysis Down Stream Analysis
12
RNA-Seq Data Analysis Support through HSLS MBIS
13
RNA Seq Questionnaire What is the scientific objective of the RNA Seq experiment? How many classes will be compared? Are only coding RNA (mRNA) or long non coding RNA, miRNA expected to be detected? Did all the samples pass RNA quality checks before sequencing? Are there biological replicates? If so how many? What type of sequencing platform was used to sequence the reads? Illumina, Ion torrent, Solid Where was the sequencing performed? Facility name and contact info When was the sequencing performed? Year/date Which RNA – extraction method was used in the experiment? Total RNA/ poly A/ rRNA depletion method and kit name and if possible, link to protocol Whether the protocol is strand specific or not? Unstranded/ forward/reverse, kit name and if possible link to protocol Whether the data is single end or paired end? What is the expected read length? Do the reads contain adapters or removed? If not please provide adapter sequence, if available, or link (usually can get this info from facility) What are the experimental conditions to perform differential expression analysis? Which organism and the reference genome to be used for analysis?
14
CLC Genomics Workbench
15
CLCGx 12 Genomics Workbench BioMedical Workbench
16
Install Plugins
17
CLCbio Genomics Workbench
System Requirements Windows Vista, Windows 7, Windows 8, Windows 10, Windows Server 2012 or 2016 Mac: OS X 10.10, and macOS 10.12, 10.13, 10.14 Linux: RHEL 7 and later, Suse Linux Enterprise Server 11 and later. (The software is expected to run without problem on other recent Linux systems, but we do not guarantee this.) 8 GB RAM required 16 GB RAM recommended 1024 x 768 display required 1600 x 1200 display recommended Intel or AMD CPU required 500GB disc space required in the CLC Genomics server
18
HPC Partnership with CRC to Mitigate Computational Bottleneck
NGS Pitt HSLS License Server
19
CLCBio Genomics Workbench Server
- You can connect your CLC Genomics Workbench software to the core HTC cluster available to University of Pittsburgh researchers through the Center for Research Computing (CRC). - This allows you to transparently migrate data from your workstation to the cluster, and run analyses on the cluster, which then run independently of your workstation (i.e. you can shutdown your machine and your analyses will continue unabated).
20
Center for Research computing (CRC)
21
Request Access to CRC
22
CLC Genomics Workbench
Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check on the CLCbio website) If you have not already done so, request a user account/allocation on the Center for Research Computing (CRC) for HTC cluster by filling out the required information If your computer is not connected to the Pitt network (e.g. you are working from home or on a trip), or you are working from a laptop that is connected to the Pitt wireless system, make sure you setup Pitt VPN, so that you can communicate with the CLC Bioserver on HTC cluster. Start the CLC Genomics Workbench
23
Connect to CLC Server @ CRC
24
Access to CRC-HTC Cluster – CLC Server
If you DO NOT HAVE CRC-HTC account: Use the following for a limited access during workshop UserID: hslsmolb PW: library1# Server name: clcbio.crc.pitt.edu Port: 7777 If you have CRC-HTC account Use – pitt user name; pitt password Server name: clcbio.crc.pitt.edu Port: 7777
25
File Structure at CRC CLC Gx Server
folders organized by PI’s name
26
Pre-analyzed Results
27
RNA-Seq Data
28
Bulk RNA-seq Study
30
NCBI SRA
31
NCBI SRA
32
NCBI SRA Untreated Vs DEX
33
RNA-Seq Basics
34
RNA-Seq vs. Microarrays
covers more dynamic range allows to discover novel transcripts able to detect SNPs more costly ($300-$1000/sample) than Microarray ($100-$200/sample) Generates times larger dataset than Microarray uncompressed RNA-Seq raw files: >5GB Microarray RNA-Seq Riki Kawaguchi’s Blog: Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE Jan 16;9(1):e78644.
35
convert to cDNA fragments
adaptors ligation short seq reads align reads to reference genome
37
Bulk RNA-Seq fragmentation of RNA before cDNA synthesis was shown to reduce 3ʹ:5ʹ bias4, and strand-specific library preparation methods, which allow sense and antisense transcripts to be differentiated, were shown to provide a more accurate estimate of transcript abundance
38
Bulk RNA-Seq Data Analysis Workflow
39
Bulk RNA-Seq Data Analysis Steps
Command Line Tools Graphical User Interface In workflow A, aligners such as TopHat112, STAR113 or HISAT2 (ref.114) use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq133 and featureCounts134, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)142), gene expression is modelled using tools such as edgeR143, DESeq2 (ref.155) and limma+voom156, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation. In workflow B, newer, alignment-free tools, such as Kallisto119 and Salmon120, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts. Alternatively, workflow C begins by aligning the reads (typically performed with TopHat112, although STAR113 and HISAT114 can also be used), followed by the use of CuffLinks131to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts. Other tools in common use include StringTie116, which assembles a transcriptome model from TopHat112(or similar tools) before the results are passed through to RSEM105 or MMSEQ132 to estimate transcript abundance, and then to Ballgown157 to identify differentially expressed genes or transcripts, and SOAPdenovo-trans117, which simultaneously aligns and assembles reads for analysis via the path of choice. Taken from Stark etal., Nat Rev Genet 2019 paper Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. (2019). doi: /s
40
CommandLine vs Graphical User Interface
CLI GUI
41
CLC Genomics Software User Interface
42
Contact CLCBio Support Team
CLCGX 12.0 User Manual:
43
Create a Folder in CRC-HTC Cluster
1 2
44
Create Workshop Folder@ HTC-CLC Server
1 2 3
45
CLCGX Tools for RNA-Seq Data Analysis
1 2
46
Import FASTQ Reads to CLCGx
47
Import FASTQ Reads to CLCGx
Import your saved data from local computer or from CRC servers NCBI SRA download in CLC
48
Illumina 6,235591 NGS Technologies ABI SoLid 27,315 Ion Torrent 88,946
NCBI Seq Read Archive Illumina 6,235591 ABI SoLid ,315 Ion Torrent ,946 PacBio ,538 MinIon ,404
49
Import Reads Stored in Local Computer Files to CLCGx
1 2
50
Import Reads to CLC 3 4 5
51
Import Reads from CRC Server
Select Grid option – HTC Data CRC can assign each group (faculty) an import/export directory on the server. Member of the group shared this import/export directory with read/write permissions. Please open a support ticket on CRC website if you do not find a folder matching your group.
52
Download Reads from NCBI SRA database
53
NCBI SRA download in CLC
54
Download FASTQ Reads from EBI ENA
55
Help : Import Illumina Reads
56
FASTQ Format
57
Results By CLC : Imported Illumina Reads
TrainingMaterials Workshops CBF_AMLeukemiaProject RNASeq _GSE101788 RNASeq_DifferentialExpression Reads Reads are already downloaded. You can find the reads in Server Folder – TraingMaterials – Pre-analyzed Result_RNA-Seq
58
Imported Illumina Reads
59
A Preprocessing includes experimental design, sequencing design, and quality control steps.
60
Number of Replicates Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [20]. Increasing sequencing depth also can improve statistical power for lowly expressed genes.
61
QC for Sequencing Reads
62
https://galaxyproject. github
63
FASTQC Project
64
Phred Score wikipedia
65
Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
66
Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. – As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability. Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
67
Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)
68
Create a Seq QC Report 1 2
69
Results By CLC: Read QC Report
70
Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability.
71
Read Trimming (based on quality of reads or adapters)
72
Trim Reads
73
Read Trimming
74
Annotate Reads: Create a Metadata Table
75
Create and Import a Metadata Table
Spread Sheet
76
Import Metadata
77
Import Metadata
78
Read Mapping
79
Read Mapping Wikipedia
80
Read Mapping Ozsolak et al. Nature Review Genetics
81
CLC Read Mapper Documentation
82
Read Mapping 5
83
Reads Mapping 7
84
Reads Mapping 8
85
Reference Genome
86
Reference Genomes https://www.ncbi.nlm.nih.gov/grc
87
Reference Genome Human : Grch38 Mouse: mm10 -- C57BL/6J
Mouse 16 other strains are now available
88
Read Mapping
89
Read Mapping 9
90
Reads Mapping 10
91
Reads Mapping
92
Mapping Result GE : Gene Expression; TE: Transcript Expression; FG: Fusion Gene
93
Reads Mapping
94
Normalization and Expression Values
TMM: weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM) used by EDGER and CLCGx
95
Normalization Methods
96
Reads Mapping GE
97
Transcript Expression
98
Read Mapping Report – SRR5861494
An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’).
99
Transcript Level Expression
100
The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. We expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases.
101
Create a Combined RNA-Seq Report
103
The biotypes are "as a percentage of all transcripts" or "as a percentage of all genes". For a poly-A enrichment experiment, it is expected that the majority of reads correspond to proteincoding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied.
104
For a poly-A enrichment experiment, it is expected that the majority of reads correspond to protein coding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied. CLC Gx Manual
105
A Preprocessing includes experimental design, sequencing design, and quality control steps.
106
Create a PCA Plot - QC at the sample level
107
Differential Expression
Differential Expressions Between Two Groups – ex: Treated vs Untreated, KO vs WT Differential Expressions between Multiple Groups
108
Differential Expressions Between Two Groups – Treated vs Untreated
First, select mapped reads from Test Samples Then, select mapped reads from Control Samples
109
Commonly Used Tools for Differential Gene Expression Analysis
110
Differential Gene Expression – Treated vs Untreated
TMM Normalization (Trimmed Mean of M values) calculates effective libraries sizes, which are then used as part of the per-sample normalization. TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed.
111
Differential Expression - Treated vs Untreated
Use the metadata table to define groups
112
Differential Gene Expression
113
Differential Expression – Gene level
114
Fold Change in Natural Scale vs Log Scale
GraphPad Statistics Guide :
115
Data Visualization
116
Differential Expression - Volcano Plot
117
Create a HeatMap
118
Create a HeatMap
119
Create a HeatMap
120
Running CLC Genomics software on CRC HTC Cluster
121
Create a Track
122
Expression Browser – all in one large spread sheet
123
Downstream Analysis
124
Downstream Analysis DEG Annotates differentially expressed genes from
an RNA-seq experiment, using the curated public data from GEO
125
NextBio Research
126
Export Data from CLC
127
Find Correlated Gene Expression Studies from GEO
128
Find Correlated Gene Expression Studies from GEO
129
Ingenuity IPA Analysis
130
Suggested MBIS Workshops
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.