Data Analysis for High-Throughput Sequencing

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

Randa Stringer Supervisor: Dr. Guillaume Par é A review of quality control and pre- processing measures for the Illumina 450K BeadChip.
Epigenetics Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
Finding Transcription Factor Binding Sites BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Peter Tsai Bioinformatics Institute, University of Auckland
DEG Mi-kyoung Seo.
RNA-seq: the future of transcriptomics ……. ?
Analysis of ChIP-Seq Data
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Canadian Bioinformatics Workshops
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Previous Lecture: NGS Alignment
Microarray Type Analyses using Second Generation Sequencing
Microarray Data Analysis Using R Studies in Tissue Databases Mark Reimers, NCI.
High Throughput Sequencing
Biases in RNA-Seq data Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions)
mRNA-Seq: methods and applications
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
RNA-Seq and RNA Structure Prediction
High-Throughput Sequencing
DNA Methylation Assays High Throughput Data Analysis BIOS , VCU Winter 2010 Mark Reimers, PhD.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
Practical Issues in Microarray Data Analysis Mark Reimers National Cancer Institute Bethesda Maryland.
Expression Analysis of RNA-seq Data
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
An Introduction to ENCODE Mark Reimers, VIPBG (borrowing heavily from John Stamatoyannopoulos and the ENCODE papers)
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA
Introduction to DESeq and edgeR packages Peter A.C. ’t Hoen.
RNAseq analyses -- methods
Genomics and High Throughput Sequencing Technologies: Applications Jim Noonan Department of Genetics.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Verna Vu & Timothy Abreo
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
I519 Introduction to Bioinformatics, Fall, 2012
Transcriptome Analysis by High-Throughput Sequencing (RNA-Seq) Mark Reimers Virginia Institute for Psychiatric and Behavioral Genetics.
1 Identifying differentially expressed genes from RNA-seq data Many recent algorithms for calling differentially expressed genes: edgeR: Empirical analysis.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
EDACC Quality Characterization for Various Epigenetic Assays
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Introduction to RNAseq
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Investigate Variation of Chromatin Interactions in Human Tissues Hiren Karathia, PhD., Sridhar Hannenhalli, PhD., Michelle Girvan, PhD.
RNA-seq: Quantifying the Transcriptome
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Biases in RNA-Seq data. Transcript length bias Two transcripts of length 50 and 100 have the same abundance in a control sample. The expression of both.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
No reference available
Lecture 12 RNA – seq analysis.
DNAse Hyper-Sensitivity BNFO 602 Biological Sequence Analysis, Spring 2014 Mark Reimers, Ph.D.
Affymetrix User’s Group Meeting Boston, MA May 2005 Keynote Topics: 1. Human genome annotations: emergence of non-coding transcripts -tiling arrays: study.
Canadian Bioinformatics Workshops
Introduction to Next Generation Sequencing. Strategies For Interrogating the Transcriptome Known genes Predicted genes Surrogate strategy Exon verification.
Canadian Bioinformatics Workshops
Aim: to provide you with a brief overview of biases in RNA-seq data such that you become aware of this potential problem (and solutions) Biases in RNA-Seq.
RNA-Seq Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Statistics Behind Differential Gene Expression
RNA Quantitation from RNAseq Data
RNA-Seq analysis in R (Bioconductor)
The RNA-Seq Bid Idea: Statistical Design and Analysis for RNA Sequencing Data The RNA-Seq Big Idea Team: Yaqing Zhao1,2, Erika Cule1†, Andrew Gehman1,
Detect alternative splicing
Gene expression estimation from RNA-Seq data
Sensitivity of RNA‐seq.
Sequence Analysis - RNA-Seq 2
Fig. 5 E2F1 also interacts with alternatively spliced transcripts from the MECOM gene. E2F1 also interacts with alternatively spliced transcripts from.
Presentation transcript:

Data Analysis for High-Throughput Sequencing Mark Reimers Tobias Guennel Department of Biostatistics

Unto the Frontiers of Ignorance “I love the way this workshop starts off with things we understand fairly well and works up to the cutting edge of things we don’t understand at all” - Mike Neale, Oct 14, 2010

The New Boyfriend/Girlfriend

Where Does HTS Really Make the Difference? Sequencing for novel variants ChIP-Seq for DNA-binding proteins or less common histone marks Allele-specific expression COMING SOON DNA methylation

Outline Biases in reads RNA-Seq Finding peaks in ChIP-Seq normalization basic tests differential splicing Finding peaks in ChIP-Seq

Technical Biases – Sequence Start The initial bases of reads are highly biased, and the bias depends on RNA/DNA preparation

Sequence Biases – K-mers Differ (Schroeder et al, PLoS One, 2010) calculated proportions of words (k-mers) starting at various positions Expected frequencies if bases random

Position of single mismatch in uniquely mapped tags Courtesy Jean & Danielle Thierry-Mieg

Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Jean & Danielle Thierry-Mieg

Technical Biases – Initiation Sites COX1

Different Platforms Have Different Biases (Harismendy et al, Genome Biology, 2009) sequenced a section of 4 HapMap individuals on Roche 454, on Illumina, and on SOLiD 454 had most even coverage

Initiation Biases Dwarf Splicing Counts of reads along gene APOE in different tissues of data from Wold lab. (a) Brain, (b) liver, (c) skeletal muscle

Variation in Technical Biases Sometimes the initial base biases change substantially – most base proportions change together – one PC explains 95% In most preparations the initiation site biases change by a few percent In a few preparations the initiation site biases change by ~20%-30% This may have consequences for representation in ChIP-Seq assays

RNA-Seq Data Analysis

Biases in Proportions Fragments compete for real-estate on the lane If a few dozen genes are highly expressed in one tissue, they will competitively inhibit the sequencing of other genes, resulting in what appears to be lower expression

Effects of Competition (Robinson & Oshlak, Genome Biology, 2010)

A Simple Normalization Align the medians of the housekeeping genes, or the genes that are not expressed at very high levels in any sample, across the samples

A Simple Model for Counts Poisson distribution of counts within a gene with mean proportional to Np SD of variation equal to square root of Np Problem: Actual variation of counts between replicate samples is significantly higher than root Np Probably reflecting systematic biases

Hacks for Over-Dispersion Like l fudge-factor in GWAS Use negative binomial model There is no relation to meaning of distribution – numbers of nulls until something happens Convenient way to parametrise over-dispersion Bioconductor package edgeR estimates parameters by Maximum Likelihood

Alternate Transcripts: Splicing Index For each exon, the proportion of transcripts in which the exon appears Hard to estimate because different exons have different representation probabilities Use ratios of exons Use constitutive exons (if known) as baseline: for them SI=1 from Wang et al, Nature, 2008

Detecting Alternate Splicing – I (Wang et al, Nature, 2008) measured splicing index for several tissues

Splicing: Junction Reads Some reads will span two different exons Need long enough reads to be able to reliably map both sides Can use information from one exon to identify gene and restrict possibilities for 5’ end other exon from Wang et al NAR 2010

ChIP-Seq

Courtesy Raphael Gottardo

A View of ChIP-Seq Data Typically reads are quite sparsely distributed over the genome Controls (i.e. no pull-down by antibody) often show smaller peaks at the same locations Probably due to open chromatin at promoter Rozowsky et al Nature Methods, 2009

Always Have a Control High correlation between peaks in control samples and peaks in ChIP sample Must subtract estimate of background from control tags From Zhang et al, Genome Biol 2008

Locating Binding Sites Use the fact that reads on opposite sides of the site represent are sequenced in opposite senses From Zhao et al NAR 2009