Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.

Slides:

Advertisements

Similar presentations

Next-Generation Sequencing: Methodology and Application

Advertisements

RNA-Seq as a Discovery Tool

High throughput sequencing Barbera van Schaik

An Introduction to Studying Expression Data Through RNA-seq

Disease-causing bacteria (smooth colonies) Harmless bacteria (rough colonies) Heat-killed, disease- causing bacteria (smooth colonies) Control (no growth)

Finding the Lost Treasure of NGS Data Yan Guo, PhD.

RNA-seq: the future of transcriptomics ……. ?

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. CHAPTER 18 LECTURE SLIDES.

RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

RNA = RiboNucleic Acid Synthesis: to build

Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.

High Throughput Sequencing

Department of Bioinformatics and Computational Biology

RNA Ribonucleic Acid.

Diabetes and Endocrinology Research Center The BCM Microarray Core Facility: Closing the Next Generation Gap Alina Raza 1, Mylinh Hoang 1, Gayan De Silva.

Central Dogma First described by Francis Crick

Polymerase Chain Reaction WORKSHOP (3)

Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics

Next generation sequencing Xusheng Wang 4/29/2010.

Whole Exome Sequencing for Variant Discovery and Prioritisation

A cell and its population of genes :. DNA forms double strands by a process called hybridization:

Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.

RNA AND PROTEIN SYNTHESIS RNA vs DNA RNADNA 1. 5 – Carbon sugar (ribose) 5 – Carbon sugar (deoxyribose) 2. Phosphate group Phosphate group 3. Nitrogenous.

Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.

High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.

RNA Ribonucleic Acid. Structure of RNA  Single stranded  Ribose Sugar  5 carbon sugar  Phosphate group  Adenine, Uracil, Cytosine, Guanine.

Chapter 13. The Central Dogma of Biology: RNA Structure: 1. It is a nucleic acid. 2. It is made of monomers called nucleotides 3. There are two differences.

Gene Regulations and Mutations

Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.

Taqman Technology and Its Application to Epidemiology Yuko You, M.S., Ph.D. EPI 243, May 15 th, 2008.

Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.

RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.

Lecture 6. Functional Genomics: DNA microarrays and re-sequencing individual genomes by hybridization.

Chapter 12 DNA, RNA, Gene function, Gene regulation, and Biotechnology.

Introduction to RNAseq

TEIN 4: RNA-seq analysis Akzam Saidin Sr. Bioinformatics Scientist.

Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.

Lecture-3 EXOME SEQUENCING Huseyin Tombuloglu, Phd GBE423 Genomics & Proteomics.

Microarrays and Other High-Throughput Methods BMI/CS 576 Colin Dewey Fall 2010.

Lesson Four Structure of a Gene. Gene Structure What is a gene? Gene: a unit of DNA on a chromosome that codes for a protein(s) –Exons –Introns –Promoter.

Manuel Holtgrewe Algorithmic Bioinformatics, Department of Mathematics and Computer Science PMSB Project: RNA-Seq Read Simulation.

Transcriptome What is it - genome wide transcript abundance How do you obtain it - Arrays + MPSS What do you do with it when you have it - ?

Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.

Canadian Bioinformatics Workshops

Chapter 14 GENETIC TECHNOLOGY. A. Manipulation and Modification of DNA 1. Restriction Enzymes Recognize specific sequences of DNA (usually palindromes)

Different microarray applications Rita Holdhus Introduction to microarrays September 2010 microarray.no Aim of lecture: To get some basic knowledge about.

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

Molecular Biology of Cancer AND Cancer Informatics (omics) David Boone.

Genetic Code and Interrupted Gene Chapter 4. Genetic Code and Interrupted Gene Aala A. Abulfaraj.

Next generation sequencing

Genomon a high-integrity pipeline for cancer genome and transcriptome sequence analysis Kenichi Chiba(1), Yuichi Shiraishi(1), Ai Okada(1), Hiroko.

Cancer Genomics Core Lab

Cancer Genomics and Class Discovery

Dr. Christoph W. Sensen und Dr. Jung Soh Trieste Course 2017

Chapter 14 GENETIC VARIATION.

High-Throughput Gene Expression and Mutation Profiling: Current Methods and Future Perspectives Breast Care 2013;8: DOI: / Fig.

Sequencing Data Analysis

RNA Ribonucleic Acid.

What is RNA? Do Now: What is RNA made of?

Genomic alterations in breast cancer cell line MDA-MB-231.

Diverse abnormalities manifest in RNA

Next-generation DNA sequencing

Canadian Bioinformatics Workshops

Sequence Analysis - RNA-Seq 2

Pan-cancer genome and transcriptome analyses of 1,699 paediatric leukaemias and solid tumours By: Anh Pham.

Sequencing Data Analysis

DNA Deoxyribonucleic Acid.

Presentation transcript:

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo

What is Sequencing? Sequencing is the process of determining the precise order of nucleotides. Non high throughput sequencing: Sanger Sequencing: The basic chain termination method, developed by Frederick Sanger in 1974. Generates all possible single-stranded DNA molecules complementary to a given template, and beginning at a common 5' base.

The Pros and Cons of Sanger Sequencing Pros: Highly accurate targetable Cons: Cost $15 per /1000 base pairs, to sequencing the whole genome will cost roughly: 30bil/1000x$15=$15m Low detection rate of alternative allele

Current Generation Sequencing Illumina ABI Solid 454 Life Science Price Low medium High Read Length 50-100 400-1000 Read Depth Difficulty Easy

Sequencing Type By Source RNA: mRNA, Small RNA, Total RNA DNA: Whole Genome or targeted (Exome, mitochondrial, genes of interest, etc)

Sequencing Data Raw Image data is more than 2TB per sample Raw data is about 5-15GB per single end sample or 10-30GB per pair end sample for RNAseq or Exome Sequencing. Whole genome data can easily exceed 200GB per sample. In general 5x raw data size is needed to finish processing Raw data is usually in FASTQ format, the base quality is in Phred scale Older Illumina pipeline uses Phred 64 scale, newer CASAVA 1.8 pipeline uses Sanger scale.

Single end vs Paired end Paired end data has double amount of data than single end. Paired end is more expensive than single end. Paired end data is easier to do quality control (insert size, removing duplicate) Paired end data provides more opportunities to detect structural variance.

What can you obtain from DNAseq SNPs (require only normal or tumor) Somatic Mutations (require tumor and normal pair) Copy Number Variation (work best with whole genome sequencing) Small Structural Variance: Insertion, deletion Large Structure Variance: (Translocation, Inversion)

What can you obtain from RNAseq Gene Expression SNP (only for expressed genes) Novel Splicing Variants Genes Fusion RNAseq has been used primarily as a replacement of microarray

How does RNAseq compare to Microarray? Since 2008, people has been saying that RNAseq will replace microarray for gene expression profiling. VANTAGE stopped offering microarray service earlier this year. Wang, Z., M. Gerstein, and M. Snyder, RNA-Seq: a revolutionary tool for transcriptomics. Nat Rev Genet, 2009. 10(1): p. 57-63. 2. Shendure, J., The beginning of the end for microarrays? Nat Methods, 2008. 5(7): p. 585-7.

Data Distribution Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

Result Consistency Guo, Y., et al., Large Scale Comparison of Gene Expression Levels by Microarrays and RNAseq Using TCGA Data. PLoS One, 2013. 8(8): p. e71462.

RNAseq vs Microarray - advantages Miroarray Result Type Rich, not limited to expression Limited to expression only Expression Can quantify expression on exon and gene level Can quantify expression on exon or gene level Novel Discovery Can be used for novel discovery Can only detect what is on the chip Analysis Difficult Easy Interpretation Price for assay Price has become comparable to microarray, however the analysis hardware and analysis time may increase the final cost Price is stable

Processing RNA

Raw data @HWI-ST508:203:D078GACXX:8:1101:1296:1011 1:N:0:ATCACG NTGGAGTCCTAGGCACAGCTCTAAGCCTCCTTATTCGAGCCGAGCTGGGCC + #4=DDDDDDDDDDE<DAEEEIDFEIEIEIEIIIIIIDEDDDDA@DDDDII@

@EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG the unique instrument name 136 the run id FC706VJ the flowcell id 2 flowcell lane 2104 tile number within the flowcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails filter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence

Phred Score Phred Quality Score Probability of incorrect base call Base call accuracy 10 1 in 10 90 % 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 % 50 1 in 100000 99.999 %

Quality Control Quality control should be conducted at multiple steps during sequencing data processing Raw data Alignment Results (Expression for RNA, and SNP/mutation for DNA) Guo, Y., et al., Three-stage quality control strategies for DNA re-sequencing data. Brief Bioinform, 2013.

Raw Data QC - Tools FAST QC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ FASTX-Toolkit http://hannonlab.cshl.edu/fastx_toolkit/ QC3 https://github.com/slzhao/QC3 NGS QC Toolkit http://59.163.192.90:8080/ngsqctoolkit/

Raw Data QC - What to Look For

Alignment QC - Tools QC3 https://github.com/slzhao/QC3 Qqplot http://genome.sph.umich.edu/wiki/QPLOT SAMStat http://samstat.sourceforge.net/

Alignment QC - What to Look For

Expression QC - Tools MultiRankSeq https://github.com/slzhao/MultiRankSeq

Clustering Algorithms Start with a collection of n objects each represented by a p–dimensional feature vector xi , i=1, …n. The goal is to divide these n objects into k clusters so that objects within a clusters are more “similar” than objects between clusters. k is usually unknown. Popular methods: hierarchical, k-means, SOM, mixture models, etc.

Distance Calculation in Sequencing Smith-Waterman algorithm Sequence 1 = ACACACTA Sequence 2 = AGCACACA w(gap) = 0 w(match) = +2 w(a, − ) = w( − ,b) = w(mismatch) = − 1

Distance Calculation in Microarray Pearson Correlation Two profiles (vectors) and +1  Pearson Correlation  – 1

Similarity Measurements Euclidean Distance

Linkage Single Linkage: D(X, Y) = min(d(x, y)), x ϵ X, y ϵ Y Complete Linkage: D(X, Y) = max(d(x, y)), x ϵ X, y ϵ Y Average Linkage:

Experssion QC - What to Look For

Batch Effect

Correction of Batch Effect Guo, Y., et al., Statistical strategies for microRNAseq batch effect reduction. Translational Cancer Research, 2014. 3(3): p. 260-265.

Normalization of RNAseq Reads Per Kilo base per Million reads (RPKM)

RNAseq Data Alignment TopHat2 http://ccb.jhu.edu/software/tophat/index.shtml MapSplice http://www.netlab.uky.edu/p/bioinfo/MapSplice

Gene Quantification CufflInks for RPKM http://cufflinks.cbcb.umd.edu/ HTSeq for read count http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html

Data Gene Symbol 1 2 3 4 5 6 DDR1 9.376298 8.961996 9.271935 8.968211 8.663588 9.214028 RFC2 7.950475 7.795976 7.124782 8.156603 7.821047 6.613421 HSPA6 5.584798 5.12491 5.77907 5.849914 5.593596 5.042853 PAX8 6.355186 6.245788 6.388794 6.737545 6.662428 6.279758 GUCA1A 2.961001 3.226968 3.092915 3.187618 3.067353 3.159364 UBE1L 7.437969 7.422707 8.298944 6.124551 6.263097 7.548323 THRA 6.606546 6.687768 6.910623 7.166293 6.711748 6.632955 PTPN21 7.392678 6.772702 6.834253 6.840313 6.813115 6.68312 CCL5 2.710744 2.479818 2.51898 2.61285 2.885117 2.668616 CYP2E1 3.871231 4.085553 5.031865 5.053069 5.080394 5.557095 EPHB3 4.289411 3.771091 3.798425 3.893421 4.01667 4.200385 ESRRA 7.151026 7.219117 6.900173 7.841436 7.254173 7.119073 CYP2A6 4.568492 4.33565 4.5123 4.672211 4.587597 4.561608 SCARB1 6.134823 6.440855 5.739945 6.269867 5.534482 5.281546 TTLL12 9.346916 8.955574 8.868433 9.825905 9.387397 9.1008 C2orf59 4.42666 5.219388 4.799542 5.204245 4.846079 3.934838 WFDC2 4.706794 4.974295 5.149892 4.417064 4.273504 4.638822 MAPK1 4.777312 4.797072 4.249238 4.252584 3.687591 4.412024 7.875045 7.902457 7.572943 8.10576 7.793828 7.635768 ADAM32 4.629726 5.27395 4.351249 5.249061 5.204216 5.412291

Example of Quantile Normalization Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Original Original Sort S1 Sort S2 Sort S3 Sorted S1 S2 S3 G1 2 4 G2 5 14 G3 6 8 G4 3 G5 9 S1 2 3 4 5 S2 3 4 5 6 S3 4 8 9 14 S1 S2 S3 G1 2 3 4 G2 8 G3 G4 5 9 G5 6 14

Take Average for Each Row Sorted S1 S2 S3 2 3 4 8 5 9 6 14 S1 S2 S3 3 S1 S2 S3 3 5 S1 S2 S3 3 5 S1 S2 S3 3 5 6 Averaged S1 S2 S3 3 5 6 8

Reorder Red = G1; Green = G2; Blue = G3; Yellow = G4; Black = G5 Averaged S1 S2 S3 3 5 6 8 S1 S2 S3 3 5 S1 S2 S3 3 5 8 S1 S2 S3 3 5 8 6 S1 S2 S3 3 5 8 6 S1 S2 S3 3 5 8 6

Differential Expression Analysis Cuffdiff from Cufflinks package Trapnell, C., et al., Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc, 2012. 7(3): p. 562-78. DESeq http://bioconductor.org/packages/release/bioc/html/DESeq.html EdgeR http://www.bioconductor.org/packages/release/bioc/html/edgeR.html NBPSeq http://cran.r-project.org/web/packages/NBPSeq/index.html TSPM http://omictools.com/sequencing/rna-seq/normalization-de/tspm-r-s2496.html baySeq http://www.bioconductor.org/packages/release/bioc/html/baySeq.html

Which Method Is the Best? Guo, Y., et al., Evaluation of read count based RNAseq analysis methods. BMC Genomics, 2013. 14 Suppl 8: p. S2.

Consistency

Consistency

Inconsistency Method Adj pvalue Log2FC Rank DESeq 0.278 3.00 2572 edgeR 0.047 2.92 712 baySeq 0.907 NA 24962 Cuffdiff <0.001 5.83 13 Disease1 Disease2 Disease3 Control1 Control2 Control3 Read count (IGHG2) 391 2038 338 634 10282 1764 Total Read count 49870084 65550902 71454121 35641084 44863975 49052840 Adjusted Read Count 78 311 47 178 2292 360

Combined Approach log2FoldChange(DESeq2) pValue(DESeq2) pAdj(DESeq2) log2FoldChange(edgeR) pValue(edgeR) pAdj(edgeR) log2FoldChange(raw) 1-Likelihood(baySeq) AdjLikelihood(baySeq) rank(DESeq) rank(edgeR) rank(baySeq) rankMethod1 ENSMUSG00000090862_Rps13 -5.86335 7.02E-210 1.07E-205 -6.19904 1.80E-109 4.01E-105 -6.02447 4.21E-07 1.81E-07 1 4 6 ENSMUSG00000058546_Rpl23a -3.67515 3.27E-140 2.49E-136 -3.75807 2.14E-58 2.38E-54 -3.57503 2.53E-05 5.21E-06 2 5 9 ENSMUSG00000091957_Gm8841 -4.68658 1.91E-72 7.27E-69 -5.33723 1.60E-50 8.90E-47 -5.14873 4.71E-05 1.70E-05 8 16 ENSMUSG00000062683_Atp5g2 -4.62274 4.86E-69 1.48E-65 -5.27352 4.06E-50 1.81E-46 -5.06178 7.54E-05 2.83E-05 10 20 ENSMUSG00000082697_Gm12913 -3.94956 8.13E-80 4.13E-76 -4.21738 4.01E-52 2.98E-48 -4.10532 0.000296 9.17E-05 3 14 ENSMUSG00000060128_Gm10075 -4.59774 7.59E-64 1.65E-60 -5.34137 2.30E-42 6.41E-39 -5.13809 2.68E-05 8.81E-06 7 21 ENSMUSG00000058558_Rpl5 -3.06317 8.73E-68 2.22E-64 -3.1805 1.29E-42 4.12E-39 -2.99193 0.000313 0.000119 29 ENSMUSG00000063316_Rpl27 -3.26559 2.86E-62 5.45E-59 -3.4434 2.48E-43 9.20E-40 -3.27313 0.000409 0.000151 18 32 ENSMUSG00000073702_Rpl31 -2.71836 4.03E-41 4.72E-38 -2.8753 7.47E-33 1.67E-29 -2.70413 0.000453 0.000167 13 19 42 ENSMUSG00000085279_Gm15965 -4.21114 1.41E-33 1.34E-30 -6.49343 3.47E-24 4.29E-21 -6.53916 7.18E-05 2.31E-05 43 ENSMUSG00000078686_Mup9 -3.77538 8.13E-32 6.19E-29 -4.7123 5.65E-22 5.47E-19 -4.58514 1.71E-13 23 44 ENSMUSG00000093337_Mir5109 2.810771 9.06E-36 9.20E-33 3.112215 3.04E-32 6.16E-29 3.311894 0.000751 0.000244 15 11 22 48 ENSMUSG00000049517_Rps23 -2.38227 3.15E-44 4.37E-41 -2.45472 9.73E-29 1.67E-25 -2.25997 0.001089 0.000337 25 49 Guo, Y., et al., MultiRankSeq: Multiperspective Approach for RNAseq Differential Expression Analysis and Quality Control. BioMed Research International, 2014. 2014: p. 8.

Presentation Using Heatmap and Cluster Zhao, S., et al., Advanced Heat Map and Clustering Analysis Using Heatmap3. BioMed Research International, 2014. 2014: p. 6.

Difference Between Heatmaps

Questions We Can Answer with Cluster Microarray data quality checking Does replicates cluster together? Does similar conditions, time points, tissue types cluster together?

Presentation Using Volcano Plot

Presentation Using Circos Plot

Test Your Hypothesis Without Performing Any Analysis GEO http://www.ncbi.nlm.nih.gov/geo/

Test Your Hypothesis Without Performing Any Analysis

Functional Analysis Samples Space n F M Suppose in a study, we are trying to find out if the proportion of smoking individual is significantly different between men and women. Smoking d c b a

Fisher’s Exact Test Male Female Total Smoking a b a + b Nonsmoking c d b + d a+b+c+d=n H0 : The proportion of smoking in male == the proportion of smoking in female H1 : The proportion of smoking in male != the proportion of smoking in female http://www.graphpad.com/quickcalcs/contingency1.cfm

Fisher’s Exact Test – in Functional Analysis All Genes Winner Genes Non Winner Genes Breast Cancer Genes a b Non Brest Cancer Genes c d d Winner Genes Breast Cancer Genes a c b

Analogy There are 18000 Balls: 200 + 17800 in a box. Blindfolded, you randomly draw 100 balls. What is the probability that you draw less than 50

WebGestalt http://bioinfo.vanderbilt.edu/webgestalt/

Gene Set Enrichment Analysis KS test based analysis (Ref) GSEA does not need a winner list first http://www.broadinstitute.org/gsea/index.jsp

SNV and Indel Difficulty due to high false positive rate RNAMapper (Miller, et al. Genome Research, 2013) SNVQ (Duitama, et al. (BMC Genomics, 2013) FX (Hong, et al. Bioinformatics, 2012) OSA (Hu, et al. Binformatics, 2012)

Microsatellite instability Examples: Yoon, et al. Genome Research 2013 Zheng, et al. BMC Genomics, 2013

RNA Editing and Allele-specific expression RNA editing tools and database DARNED, REDidb, dbRES, RADAR Allele-specific expression asSeq (Sun, et al. Biometrics, 2012) AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)

Exogenous RNA Virus (Same as DNA) Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012

nonCoding RNA