Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bulk RNA-Seq Analysis Using CLCGenomics Workbench

Similar presentations


Presentation on theme: "Bulk RNA-Seq Analysis Using CLCGenomics Workbench"— Presentation transcript:

1 Bulk RNA-Seq Analysis Using CLCGenomics Workbench
2019 Ansuman Chattopadhyay, PhD Asst Director, Molecular Biology information service Health sciences library system University of pittsburgh Sri Chaparala, MS Bioinformatics Specialist Health Sciences Library System University of Pittsburgh

2 Topics Brief introduction to RNA-Seq experiments Analyze RNA-seq data
Download seq reads from EBI-ENA/NCBI SRA Import reads to CLC Genomics Workbench Align reads to Reference Genome Estimate expressions in the gene level Estimate expressions in the transcript isoform level Statistical analysis of the differential expressed genes and transcripts Create Heat Map, Volcano Plots, and Venn Diagram

3 Differential Gene Expressions
Raw Reads Venn Diagram Volcano Plot

4 Scaife Hall, Falk Library, Classroom 2
Descriptions & Registration: 4th Single Cell RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On SEPTEMBER 11th ChIP-Seq & CLC Genomics 10am-12pm Overview & 1-3pm Hands-On 25th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 2nd Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On Fall 2019 HSLS MolBio Workshops OCTOBER 9th Pathway Analysis—Open Access Tools 10am-12pm Overview & 1-3pm Hands-On 23rd ChIP-Seq & Partek Flow 1-4pm 30th Gene Regulation 1-4pm Scaife Hall, Falk Library, Classroom 2 6th Single Cell RNA-Seq 10am-12pm Overview & 1-3pm Hands-On NOVEMBER 13th Gene Expression Visualization 1-4pm 20th Pathway Analysis—IPA & MetaCore 10am-12pm Overview & 1-3pm Hands-On 4th Bulk RNA-Seq 10-11am Genomics Research Core 11am-12pm Overview 1-3pm Hands-On DECEMBER 11th Genetic Variation 10am-12pm Overview & 1-3pm Hands-On

5 CRC Workshops

6 Pitt

7

8 HSLS MolBio

9 Partek Flow : Software for scRNA-Seq Data Analysis

10 NGS Software @ HSLS MolBio
NGS Analysis Sanger Seq Analysis

11 RNA-Seq Software @ HSLS MolBio
Enrichment Analysis Deferentially Expressed Genes CLC Genomics Work Bench Ingenuity Pathway Analysis Functions Diseases Pathways Key Pathway Advisor Upstream Regulators Volcano Plot PCA Plot Venn Diagram Heat Map Any Organism Illumina BaseSpace Correlation Engine RNA-Seq Reads RNA-Seq Analysis Down Stream Analysis

12 RNA-Seq Data Analysis Support through HSLS MBIS

13 RNA Seq Questionnaire What is the scientific objective of the RNA Seq experiment? How many classes will be compared? Are only coding RNA (mRNA) or long non coding RNA, miRNA expected to be detected? Did all the samples pass RNA quality checks before sequencing? Are there biological replicates? If so how many? What type of sequencing platform was used to sequence the reads? Illumina, Ion torrent, Solid Where was the sequencing performed? Facility name and contact info When was the sequencing performed? Year/date Which RNA – extraction method was used in the experiment? Total RNA/ poly A/ rRNA depletion method and kit name and if possible, link to protocol Whether the protocol is strand specific or not? Unstranded/ forward/reverse, kit name and if possible link to protocol Whether the data is single end or paired end? What is the expected read length? Do the reads contain adapters or removed? If not please provide adapter sequence, if available, or link (usually can get this info from facility) What are the experimental conditions to perform differential expression analysis? Which organism and the reference genome to be used for analysis?

14 CLC Genomics Workbench

15 CLCGx 12 Genomics Workbench BioMedical Workbench

16 Install Plugins

17 CLCbio Genomics Workbench
System Requirements Windows Vista, Windows 7, Windows 8, Windows 10, Windows Server 2012 or 2016 Mac: OS X 10.10, and macOS 10.12, 10.13, 10.14 Linux: RHEL 7 and later, Suse Linux Enterprise Server 11 and later. (The software is expected to run without problem on other recent Linux systems, but we do not guarantee this.) 8 GB RAM required 16 GB RAM recommended 1024 x 768 display required 1600 x 1200 display recommended Intel or AMD CPU required 500GB disc space required in the CLC Genomics server

18 HPC Partnership with CRC to Mitigate Computational Bottleneck
NGS Pitt HSLS License Server

19 CLCBio Genomics Workbench Server
- You can connect your CLC Genomics Workbench software to the core HTC cluster available to University of Pittsburgh researchers through the Center for Research Computing (CRC). - This allows you to transparently migrate data from your workstation to the cluster, and run analyses on the cluster, which then run independently of your workstation (i.e. you can shutdown your machine and your analyses will continue unabated).

20 Center for Research computing (CRC)

21 Request Access to CRC

22 CLC Genomics Workbench
Ensure you have the most up-to-date version of the CLCbio Genomics Workbench (the software should tell you if there's a more recent version when you start it, or you can check on the CLCbio website) If you have not already done so, request a user account/allocation on the Center for Research Computing (CRC) for HTC cluster by filling out the required information If your computer is not connected to the Pitt network (e.g. you are working from home or on a trip), or you are working from a laptop that is connected to the Pitt wireless system, make sure you setup Pitt VPN, so that you can communicate with the CLC Bioserver on HTC cluster. Start the CLC Genomics Workbench

23 Connect to CLC Server @ CRC

24 Access to CRC-HTC Cluster – CLC Server
If you DO NOT HAVE CRC-HTC account: Use the following for a limited access during workshop UserID: hslsmolb PW: library1# Server name: clcbio.crc.pitt.edu Port: 7777 If you have CRC-HTC account Use – pitt user name; pitt password Server name: clcbio.crc.pitt.edu Port: 7777

25 File Structure at CRC CLC Gx Server
folders organized by PI’s name

26 Pre-analyzed Results

27 RNA-Seq Data

28 Bulk RNA-seq Study

29

30 NCBI SRA

31 NCBI SRA

32 NCBI SRA Untreated Vs DEX

33 RNA-Seq Basics

34 RNA-Seq vs. Microarrays
covers more dynamic range allows to discover novel transcripts able to detect SNPs more costly ($300-$1000/sample) than Microarray ($100-$200/sample) Generates times larger dataset than Microarray uncompressed RNA-Seq raw files: >5GB Microarray RNA-Seq Riki Kawaguchi’s Blog: Zhao S, Fung-Leung WP, Bittner A, Ngo K, Liu X. Comparison of RNA-Seq and microarray in transcriptome profiling of activated T cells. PLoS ONE Jan 16;9(1):e78644.

35 convert to cDNA fragments
adaptors ligation short seq reads align reads to reference genome

36

37 Bulk RNA-Seq fragmentation of RNA before cDNA synthesis was shown to reduce 3ʹ:5ʹ bias4, and strand-specific library preparation methods, which allow sense and antisense transcripts to be differentiated, were shown to provide a more accurate estimate of transcript abundance

38 Bulk RNA-Seq Data Analysis Workflow

39 Bulk RNA-Seq Data Analysis Steps
Command Line Tools Graphical User Interface  In workflow A, aligners such as TopHat112, STAR113 or HISAT2 (ref.114) use a reference genome to map reads to genomic locations, and then quantification tools, such as HTSeq133 and featureCounts134, assign reads to features. After normalization (usually using methods embedded in the quantification or expression modelling tools, such as trimmed mean of M-values (TMM)142), gene expression is modelled using tools such as edgeR143, DESeq2 (ref.155) and limma+voom156, and a list of differentially expressed genes or transcripts is generated for further visualization and interpretation. In workflow B, newer, alignment-free tools, such as Kallisto119 and Salmon120, assemble a transcriptome and quantify abundance in one step. The output from these tools is usually converted to count estimates (using tximport130 (TXI)) and run through the same normalization and modelling used in workflow A, to output a list of differentially expressed genes or transcripts. Alternatively, workflow C begins by aligning the reads (typically performed with TopHat112, although STAR113 and HISAT114 can also be used), followed by the use of CuffLinks131to process raw reads and the CuffDiff2 package to output transcript abundance estimates and a list of differentially expressed genes or transcripts. Other tools in common use include StringTie116, which assembles a transcriptome model from TopHat112(or similar tools) before the results are passed through to RSEM105 or MMSEQ132 to estimate transcript abundance, and then to Ballgown157 to identify differentially expressed genes or transcripts, and SOAPdenovo-trans117, which simultaneously aligns and assembles reads for analysis via the path of choice. Taken from Stark etal., Nat Rev Genet 2019 paper Stark, R., Grzelak, M. & Hadfield, J. RNA sequencing: the teenage years. Nat. Rev. Genet. (2019). doi: /s

40 CommandLine vs Graphical User Interface
CLI GUI

41 CLC Genomics Software User Interface

42 Contact CLCBio Support Team
CLCGX 12.0 User Manual:

43 Create a Folder in CRC-HTC Cluster
1 2

44 Create Workshop Folder@ HTC-CLC Server
1 2 3

45 CLCGX Tools for RNA-Seq Data Analysis
1 2

46 Import FASTQ Reads to CLCGx

47 Import FASTQ Reads to CLCGx
Import your saved data from local computer or from CRC servers NCBI SRA download in CLC

48 Illumina 6,235591 NGS Technologies ABI SoLid 27,315 Ion Torrent 88,946
NCBI Seq Read Archive Illumina 6,235591 ABI SoLid ,315 Ion Torrent ,946 PacBio ,538 MinIon ,404

49 Import Reads Stored in Local Computer Files to CLCGx
1 2

50 Import Reads to CLC 3 4 5

51 Import Reads from CRC Server
Select Grid option – HTC Data CRC can assign each group (faculty) an import/export directory on the server. Member of the group shared this import/export directory with read/write permissions. Please open a support ticket on CRC website if you do not find a folder matching your group.

52 Download Reads from NCBI SRA database

53 NCBI SRA download in CLC

54 Download FASTQ Reads from EBI ENA

55 Help : Import Illumina Reads

56 FASTQ Format

57 Results By CLC : Imported Illumina Reads
TrainingMaterials Workshops CBF_AMLeukemiaProject RNASeq _GSE101788 RNASeq_DifferentialExpression Reads Reads are already downloaded. You can find the reads in Server Folder – TraingMaterials – Pre-analyzed Result_RNA-Seq

58 Imported Illumina Reads

59 A Preprocessing includes experimental design, sequencing design, and quality control steps.

60 Number of Replicates Filtering out genes that are expressed at low levels prior to differential expression analysis reduces the severity of the correction and may improve the power of detection [20]. Increasing sequencing depth also can improve statistical power for lowly expressed genes.

61 QC for Sequencing Reads

62 https://galaxyproject. github

63 FASTQC Project

64 Phred Score wikipedia

65 Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

66 Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training
Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. – As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability. Taken from Introduction to ChIP-Seq by HPC Tutorial by HBC Training

67 Assessing Sequence Data Quality (led by Dawei Lin and Simon Andrews)

68 Create a Seq QC Report 1 2

69 Results By CLC: Read QC Report

70 Acceptable duplication, k-mer or GC content levels are experiment- and organism-specific, but these values should be homogeneous for samples in the same experiments. We recommend that outliers with over 30 % disagreement to be discarded. As a general rule, read quality decreases towards the 3’end of reads, and if it becomes too low, bases should be removed to improve mappability.

71 Read Trimming (based on quality of reads or adapters)

72 Trim Reads

73 Read Trimming

74 Annotate Reads: Create a Metadata Table

75 Create and Import a Metadata Table
Spread Sheet

76 Import Metadata

77 Import Metadata

78 Read Mapping

79 Read Mapping Wikipedia

80 Read Mapping Ozsolak et al. Nature Review Genetics

81 CLC Read Mapper Documentation

82 Read Mapping 5

83 Reads Mapping 7

84 Reads Mapping 8

85 Reference Genome

86 Reference Genomes https://www.ncbi.nlm.nih.gov/grc

87 Reference Genome Human : Grch38 Mouse: mm10 -- C57BL/6J
Mouse 16 other strains are now available

88 Read Mapping

89 Read Mapping 9

90 Reads Mapping 10

91 Reads Mapping

92 Mapping Result GE : Gene Expression; TE: Transcript Expression; FG: Fusion Gene

93 Reads Mapping

94 Normalization and Expression Values
TMM: weighted trimmed mean of the log expression ratios (trimmed mean of M values (TMM) used by EDGER and CLCGx

95 Normalization Methods

96 Reads Mapping GE

97 Transcript Expression

98 Read Mapping Report – SRR5861494
An important mapping quality parameter is the percentage of mapped reads, which is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. For example, we expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) [15], with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’).

99 Transcript Level Expression

100 The percentage of mapped reads is a global indicator of the overall sequencing accuracy and of the presence of contaminating DNA. We expect between 70 and 90 % of regular RNA-seq reads to map onto the human genome (depending on the read mapper used) with a significant fraction of reads mapping to a limited number of identical regions equally well (‘multi-mapping reads’). Other important parameters are the uniformity of read coverage on exons and the mapped strand. If reads primarily accumulate at the 3’end of transcripts in poly(A)-selected samples, this might indicate low RNA quality in the starting material. The GC content of mapped reads may reveal PCR biases.

101 Create a Combined RNA-Seq Report

102

103  The biotypes are "as a percentage of all transcripts" or "as a percentage of all genes". For a poly-A enrichment experiment, it is expected that the majority of reads correspond to proteincoding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied.

104 For a poly-A enrichment experiment, it is expected that the majority of reads correspond to protein coding regions. For an rRNA depletion protocol, a variety of non-coding RNA regions may also be observed. The percentage of reads mapping to rRNA should usually be <15%. If over 15% of the reads mapped to rRNA, it could be that the poly-A enrichment/rRNA depletion protocol failed. The sample can still be used for differential expression and variant calling, but expression values such as TPM and RPKM may not be comparable to those of other samples. To troubleshoot the issues in future experiments, check for rRNA depletion prior to library preparation. Also, if an rRNA depletion kit was used, check that the kit matches the species being studied. CLC Gx Manual

105 A Preprocessing includes experimental design, sequencing design, and quality control steps.

106 Create a PCA Plot - QC at the sample level

107 Differential Expression
Differential Expressions Between Two Groups – ex: Treated vs Untreated, KO vs WT Differential Expressions between Multiple Groups

108 Differential Expressions Between Two Groups – Treated vs Untreated
First, select mapped reads from Test Samples Then, select mapped reads from Control Samples

109 Commonly Used Tools for Differential Gene Expression Analysis

110 Differential Gene Expression – Treated vs Untreated
TMM Normalization (Trimmed Mean of M values) calculates effective libraries sizes, which are then used as part of the per-sample normalization. TMM normalization adjusts library sizes based on the assumption that most genes are not differentially expressed.

111 Differential Expression - Treated vs Untreated
Use the metadata table to define groups

112 Differential Gene Expression

113 Differential Expression – Gene level

114 Fold Change in Natural Scale vs Log Scale
GraphPad Statistics Guide :

115 Data Visualization

116 Differential Expression - Volcano Plot

117 Create a HeatMap

118 Create a HeatMap

119 Create a HeatMap

120 Running CLC Genomics software on CRC HTC Cluster

121 Create a Track

122 Expression Browser – all in one large spread sheet

123 Downstream Analysis

124 Downstream Analysis DEG Annotates differentially expressed genes from
an RNA-seq experiment, using the curated public data from GEO

125 NextBio Research

126 Export Data from CLC

127 Find Correlated Gene Expression Studies from GEO

128 Find Correlated Gene Expression Studies from GEO

129 Ingenuity IPA Analysis

130 Suggested MBIS Workshops


Download ppt "Bulk RNA-Seq Analysis Using CLCGenomics Workbench"

Similar presentations


Ads by Google