Visualising and Exploring BS-Seq Data

Slides:

Advertisements

Similar presentations

Differential Methylation Analysis

Advertisements

Authors: David N.C. Tse, Ofer Zeitouni. Presented By Sai C. Chadalapaka.

Visualising and Exploring BS-Seq Data

Statistics 100 Lecture Set 7. Chapters 13 and 14 in this lecture set Please read these, you are responsible for all material Will be doing chapters

Simon v2.3 RNA-Seq Analysis Simon v2.3.

Peter Tsai Bioinformatics Institute, University of Auckland

Lecture 4 Measurement Accuracy and Statistical Variation.

SeqMonk tools for methylation analysis

Committee Meeting April 24 th 2014 Characterizing epigenetic variation in the Pacific oyster (Crassostrea gigas) Claire Olson School of Aquatic and Fishery.

A P STATISTICS LESSON 9 – 1 ( DAY 1 ) SAMPLING DISTRIBUTIONS.

NSW Curriculum and Learning Innovation Centre Tinker with Tinker Plots Elaine Watkins, Senior Curriculum Officer, Numeracy.

LECTURE UNIT 7 Understanding Relationships Among Variables Scatterplots and correlation Fitting a straight line to bivariate data.

Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.

M11-Normal Distribution 1 1  Department of ISM, University of Alabama, Lesson Objective  Understand what the “Normal Distribution” tells you.

RNA-Seq Analysis Simon V4.1.

Presenting Results Laura Biggins v1.0 1.

Data Collection and Processing (DCP) 1. Key Aspects (1) DCPRecording Raw Data Processing Raw Data Presenting Processed Data CompleteRecords appropriate.

Agresti/Franklin Statistics, 1 of 88 Chapter 11 Analyzing Association Between Quantitative Variables: Regression Analysis Learn…. To use regression analysis.

Chapter 14: Inference for Regression. A brief review of chapter 4... (Regression Analysis: Exploring Association BetweenVariables )  Bi-variate data.

Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.

Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.

Lecture 3 – Sep 3. Normal quantile plots are complex to do by hand, but they are standard features in most statistical software. Good fit to a straight.

Describing Data Week 1 The W’s (Where do the Numbers come from?) Who: Who was measured? By Whom: Who did the measuring What: What was measured? Where:

Looking Within Human Genome King abdulaziz university Dr. Nisreen R Tashkandy GENOMICS ; THE PIG PICTURE.

Plot type specific considerations

Differential Methylation Analysis

Simon v RNA-Seq Analysis Simon v

SeqMonk tools for methylation analysis

SeqMonk tools for methylation analysis

Anticipating Patterns Statistical Inference

Simon v1.0 Motif Searching Simon v1.0.

Clinical Calculation 5th Edition

Supplementary Figure 1 A B C.

Biases and their Effect on Biological Interpretation

1. Data Processing Sci Info Skills.

Multiple Regression Prof. Andy Field.

Results for all features Results for the reduced set of features

Statistics 200 Lecture #5 Tuesday, September 6, 2016

Chapter 2: Describing Location in a Distribution

Understanding and Validating Experimental Expectations

Genome Wide Association Studies using SNP

Very important to know the difference between the trees!

How could data be used in an EPQ?

Describing Relationships

Analysing ChIP-Seq Data

Artefacts and Biases in Gene Set Analysis

Honors Statistics Chapter 4 Part 4

CHAPTER 29: Multiple Regression*

Simon v ChIP-Seq Analysis Simon v

IB BIOLOGY INTERNAL ASSESSMENT

Exploring and Understanding ChIP-Seq data

CHAPTER 7 Sampling Distributions

Misleading Bioinformatics Mistakes, Biases, Mis-Interpretations and how to avoid them Festival of Genomics 2017.

Volume 9, Issue 1, Pages (July 2017)

Eric Samorodnitsky, Jharna Datta, Benjamin M

Business and Management Research

STAT 111 Introductory Statistics

Principal Component Analysis

What Is a Sampling Distribution?

Simon V Motif Searching Simon V

Do Now In BIG CLEAR numbers, please write your height in inches on the index card.

Artefacts and Biases in Gene Set Analysis

Honors Statistics Review Chapters 4 - 5

Normalization for cDNA Microarray Data

Exploring and Presenting Results

Figure 1. The 12 species in this study and details of the improved G4-seq method. (A) Phylogenetic representation of ... Figure 1. The 12 species in this.

Figure 1. Nanopore methylation calls are consistent with expected results and established technologies. (A) Metaplot of ... Figure 1. Nanopore methylation.

Supplementary Figure 1 A B C.

Sequence Analysis - RNA-Seq 2

Presentation transcript:

Visualising and Exploring BS-Seq Data Simon Andrews simon.andrews@babraham.ac.uk @simon_andrews v2018-04

Starting Data L001_bismark_bt2_pe.deduplicated.bam Read 1 Read 2 Read 3 Genome L001_bismark_bt2_pe.deduplicated.bam CHG_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CHG_OT_L001_bismark_bt2_pe.deduplicated.txt.gz CHH_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CHH_OT_L001_bismark_bt2_pe.deduplicated.txt.gz CpG_OB_L001_bismark_bt2_pe.deduplicated.txt.gz CpG_OT_L001_bismark_bt2_pe.deduplicated.txt.gz L001_bismark_bt2_pe.deduplicated.cov.gz

Decide early on which data to use Methylation contexts CpG: Only generally relevant context for mammals CHG: Only known to be relevant in plants CHH: Generally unmethylated Methylation strands CpG methylation is generally symmetric Normally makes sense to merge OT / OB strands

Always start by looking at your data. Think about what you expect Reads (red=for blue=rev) Calls (red=meth blue=unmeth)

Try to understand anything unusual Reduced Representation Library

Try to understand anything unusual Very messed up cDNA contaminated library

Try to understand anything unusual Around 600x average genome density Coverage Outliers

Coverage Outliers

Coverage Outliers Normally the result of mis-mapping repetitive sequences not in the genome assembly Centromeric / telomeric sequences are common Can be a significant proportion of all data Can throw off calculations of overall methylation Should be flagged and hits in those regions ignored

GC Content is most likely but Coverage Bias GC Content is most likely but others could exist

Coverage bias can lead to apparent methylation bias

Quantitating your methylation data

Where to make measures Per base Unbiased windows Targeted regions Very large number of measures Poor accuracy for individual bases Unbiased windows Tiled over whole genome Need to decide how they will be defined Targeted regions Which regions What context

Accuracy and Power Region A Region B Variation in CpG density 50% Methylation 50% Methylation Variation in CpG density Variation in coverage depth

Try to make comparable measures Observation level correlates with stability. Want to try to have similar amounts of data in each measurement window. Equalises noise for visualisation and power for analysis.

Unbiased Analysis Fix the amount of data in each window Fixed number of CpGs per window Allow the resolution to vary 50 CpG window lengths

Targeted Quantitation Measure over features CpG islands Be careful where you get your locations Try to fix sizes Promoters Should probably split into CpG island and non-CpG island Gene bodies Filter by biotype to remove small RNA genes?

How to Quantitate methylation calls Percentage methylation (Methylated calls / Total Calls) * 100 = meth = unmeth (6/10) * 100 = 60% methylated

Assigning a % methylation value to a region can be difficult. Total methylated calls = 15 Total unmethylated calls = 10 Methylation level = (15/(15+10))*100 = 60%

You get different answers quantitating per base or per region Percentage methylation from all calls independently = 46% Percentage methylation from mean methylation per base = 80%

Coverage differences can look like methylation differences 67% 43% Common = 60% in both

Coverage differences aren’t just a theoretical concern – they affect real data p=3.2-7

Coverage differences aren’t just a theoretical concern – they affect real data 47% 94%

Levels of Complexity Simple Complex Percentage of all calls which are methylated Per base methylation, averaged over a region Bases excluded because of low coverage As above, but requiring the same bases to be used in each sample Doesn't scale well Complex

(Even) More Complex Methods Smoothing or regression of actual measures along a chromosome. Aims to reduce noise from sampling variation Relies on consistent linear patterns Imputation of missing values Additional normalisation or correction Will be discussed later…

Visualisation and Exploration

Use visualisation to understand the basic structure of your data before asking questions Patterning What sorts of changes in methylation do I observe along a chromosome Distributions What are the overall levels and distributions of methylation values in my samples Relationships On a global scale what is the overall relationship between methylation levels in different conditions

Visualise your quantitated data alongside the raw methylation calls.

Different representations scale to different numbers of samples.

Understand and compare your methylation distributions before formulating a question.

Plotting comparisons will identify global differences which might be interesting

Look at the data underneath and around potentially interesting points

Different representations might make the picture clearer

Different representations might make the picture clearer

Large global changes might mean that local analysis is no longer relevant

Small differences in distribution can be normalised to improve comparisons

Summary Visualisations

Trend Plots Effects at individual loci can be subtle Want to find more generalised effect Collate information across whole genome Look at the general trends Relies on the effect being consistent

Trend plot considerations Axis Scaling What measures How much context Features to use Fixed vs relative scale

Clustering

Clustering Correlation Clustering Euclidean Clustering Focusses on the differences between conditions Absolute values not important Look for similar trends Show median normalised values Euclidean Clustering Focusses on absolute differences between conditions Look for similar levels Show raw values

Clustering

Exploration Summary (1) Look at the distribution of your raw reads/calls Match expectations to the type of library Always start with an unbiased quantitation Fix the amount of data in each window Think about how to best quantitate Check the quantitation matches the raw data

Exploration Summary (2) Check the distributions of methylation values in your samples Directly compare your values to look for global differences They might be the source of the interesting biology Might spot small global differences which require normalisation Summarise trends around features Might justify targeted quantitation