High-Throughput Sequencing

Slides:



Advertisements
Similar presentations
RNA-seq library prep introduction
Advertisements

Methods to read out regulatory functions
MEDIP, MAP AND MIRA Biological Affinity-Based Methods of DNA Methylation Detecton: Genome Wide.
High-Throughput Sequencing Technologies
Sodium Bisulfite Methods for Genome Wide Methylation Methods MALDI-TOF BISULFITE SEQUENCING GOLDEN GATE PYROSEQUENCING.
Transcriptome Sequencing with Reference
Peter Tsai Bioinformatics Institute, University of Auckland
RNA-seq: the future of transcriptomics ……. ?
Analysis of ChIP-Seq Data
Data Analysis for High-Throughput Sequencing
Canadian Bioinformatics Workshops
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520
Transcriptomics Jim Noonan GENE 760.
Microarray Type Analyses using Second Generation Sequencing
RNA-Seq An alternative to microarray. Steps Grow cells or isolate tissue (brain, liver, muscle) Isolate total RNA Isolate mRNA from total RNA (poly.
Genomics I: The Transcriptome RNA Expression Analysis Determining genomewide RNA expression levels.
High Throughput Sequencing
mRNA-Seq: methods and applications
CS 6293 Advanced Topics: Current Bioinformatics
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
High-Throughput Sequencing Technologies
RNA-Seq and RNA Structure Prediction
DNA Methylation Assays High Throughput Data Analysis BIOS , VCU Winter 2010 Mark Reimers, PhD.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Mapping protein-DNA interactions by ChIP-seq Zsolt Szilagyi Institute of Biomedicine.
Ji-hye Choi August Introduction (2006) ABRF-NGS (the Association fo Biomolecular Resource Facilities next-generation sequencing study)
Todd J. Treangen, Steven L. Salzberg
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Technology for Systems Biology. Nucleic Acid Hybridization In principle complementary strands will associate Chemistry is quite different on surfaces.
Next Generation DNA Sequencing
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
Vidyadhar Karmarkar Genomics and Bioinformatics 414 Life Sciences Building, Huck Institute of Life Sciences.
Verna Vu & Timothy Abreo
EDACC Primary Analysis Pipelines Cristian Coarfa Bioinformatics Research Laboratory Molecular and Human Genetics.
Proteome and interactome Bioinformatics.
I519 Introduction to Bioinformatics, Fall, 2012
CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project.
Summarization of Oligonucleotide Expression Arrays BIOS Winter 2010.
EDACC Quality Characterization for Various Epigenetic Assays
Next Generation Sequencing
1 Global expression analysis Monday 10/1: Intro* 1 page Project Overview Due Intro to R lab Wednesday 10/3: Stats & FDR - * read the paper! Monday 10/8:
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Starting Monday M Oct 29 –Back to BLAST and Orthology (readings posted) will focus on the BLAST algorithm, different types and applications of BLAST; in.
Idea: measure the amount of mRNA to see which genes are being expressed in (used by) the cell. Measuring protein might be more direct, but is currently.
Introduction to RNAseq
Microarray analysis Quantitation of Gene Expression Expression Data to Networks BIO520 BioinformaticsJim Lund Reading: Ch 16.
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
Trends Biomedical Science
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Analysis of ChIP-Seq Data Biological Sequence Analysis BNFO 691/602 Spring 2014 Mark Reimers.
No reference available
Lecture 12 RNA – seq analysis.
Biol 456/656 Molecular Epigenetics Lecture #5 Wed. Sept 2, 2015.
Introduction to Next Generation Sequencing. Strategies For Interrogating the Transcriptome Known genes Predicted genes Surrogate strategy Exon verification.
Canadian Bioinformatics Workshops
Next-generation sequencing technology
Metagenomic Species Diversity.
RNA Quantitation from RNAseq Data
Quality Control & Preprocessing of Metagenomic Data
Gene expression from RNA-Seq
Next-generation sequencing technology
Gene expression.
Volume 54, Issue 1, Pages (April 2014)
Adrien Le Thomas, Georgi K. Marinov, Alexei A. Aravin  Cell Reports 
Next-generation DNA sequencing
Volume 17, Issue 6, Pages (November 2016)
Baekgyu Kim, Kyowon Jeong, V. Narry Kim  Molecular Cell 
Sequence Analysis - RNA-Seq 2
Presentation transcript:

High-Throughput Sequencing Advanced Microarray Analysis BIOS 691-803, 2008 Dr. Mark Reimers, VCU

Quantitative HTS - Outline Technology Preprocessing Quantitative analysis Applications ChIP-Seq RNA-Seq Methyl-Seq

The Technology Most sequencing proceeds by addition of fluor-labeled bases Do this in parallel on a flat surface Capture each stage with good camera Align images

Roche - 454 Parallel Pyrosequencing on beads

Mardis, Trends in Genetics

454 Sequencing Operation

Illumina - Solexa

ABI SOLiD Resquencing each fragment with different primers Reconstruct each fragment separately

Paired-End Reads

Issues Pre-processing Quantitative analysis Base calling Mapping reads QA Quantitative analysis Variation and noise Biases Models Accuracy and validation

Pre-processing – Base Calling Not all steps completed properly Sequence can lag behind or skip ahead Hence most light spots a mixture of different colors Simple rule: use brightest signal

Types of mismatches in uniquely mapped tags with a single mismatch are profoundly asymmetric and biased Courtesy Thierry-Mieg

Typical Errors in Base-Calling

Position of single mismatch in uniquely mapped tags Courtesy Thierry-Mieg

Improving Base-Calling with SVM

Pre-processing – Mapping Reads Huge numbers (10M – 70M) BLAT (2002 high-speed method) Eland (proprietary Illumina) Other new methods: MAQ, SOAP

Quality Assessment Fraction of reads mapping to targets Typically 5-10M reads per lane and 60-80% map to targets Some repetitive sequence

Comparing Samples - A Simple Normalization Different numbers of counts per lane Divide counts in a region of interest (a genomic region or a gene or an exon) by all counts (total per million reads -TPM) For comparing genomic regions of different lengths divide also by length of region TPKM (total per kilobase per million)

Quant. Analysis - Variation Poisson model often used for random variation Most HTS data ‘over-dispersed’ relative to Poisson Negative Binomial often used Parameter fitted

Quantitative Analysis - Biases Not all regions represented equally GC rich regions represented more Independent of GC some chromosome regions represented more Euchromatin bias Sequence initiation site biases ‘Mapability’ biases – some regions won’t have any uniquely mapped tags

GC Bias Density of reads depends strongly on GC content of regions

Genomic Position Biases Count tags from randomly sheared DNA in red with GC content in blue

Start Position Bias

Consistent Start Position Bias Counts per start site in lane 1 vs lane 2

RNA-Seq

RNA-Seq Data Gene Model Kidney Reads Liver Reads From Marioni et al 2008

Accuracy of Illumina RNA-Seq

Comparing RNA-Seq & Affy Issues How replicable is RNA-Seq? How consistent are the two technologies? Which is better? Marioni et al, Genome Research, 2008

Comparing Fold-Changes D.E. by ILM Red >250 Green <250 Black Not DE by ILM

Model for Variation Poisson counts hypergeometric comparison Make uniform p-values by adding random term Use lower tails only

False Positive Rates QQ-plots of p-values between tech. reps

Different Concentrations are NOT Comparable! QQ-plots of p-values between 3pM and 1.5 pM

Normalization of RNA-Seq Robinson et al noticed that most genes appeared less expressed in liver Fig 1 from Robinson & Oshlak, Genome Biology 2010

A Better Normalization for RNA-Seq - TMM Drop extremes of ratios Drop very high count genes Compute trimmed means of samples Center log-ratios between samples

New Things to do with RNA-Seq Allele-specific expression Splice variation Between tissues In disease Alternate initiation sites Select 5’ capped RNA fragments Alternate termination

Allelic Comparison It is possible to compare allele-specific expression counts Sample from VCU Replicate samples P-values for binomial tests of equality About half show differential expression!

Detecting Splice Variation Deep sequencing shows up clear variation in exon usage Wang et al Nature 2008

Tissue Map of Splice Variation From Wang et al Brain is most distinctive Individuals seem to differ Cell lines seem to have distinct splice patterns

Splicing is Complex Many different splice operations exist Only some of these characterized by counting exon reads

Issues in Detecting Splice Variants Counts in exons reflect biases (as yet uncharacterized) as well as actual abundance Reads that bridge splice junctions would be definitive but mapping is very dubious with short (<40 base) reads All possible splice junctions are not known Hard to even search through the known ones

Methodology for Splice Variants Count reads mapped to exons and and compare ratios across samples Wang et al, and most others Count reads that cross splice junctions

Methodology for Finding Junctions

ChIP-Seq

Chromatin Immuno-precipitation

ChIP-Seq Workflow Cross-link proteins to DNA Fragment DNA Extract with antibody Reverse cross links Sequence fragments DO CONTROLS!

ChIP-Seq Data From Rozowsky et al, Nature Biotech 2009

ChIP-Seq vs ChIP-chip

Peak-Finding - Simple Extend tags and count overlap How much to extend?

Peak Finding – Better Tags starting on opposite strands are likely to start at opposite ends Identifying the cross-over point leads to improved accuracy

The Value of Controls: ChIP vs. Control Reads Red dots are windows containing ChIP peaks and black dots are windows containing control peaks used for FDR calculation

Cause of Variation in Read Density In study of FoxA1 binding, even control reads enriched near FoxA1 binding site! Probably due to open chromatin near FoxA1 binding site Density of Control Channel reads around FoxA1 site Courtesy Shirley Liu

ChIP-Seq – MACS Key Ideas Smart peak imputation estimate Uses read directions Empirical estimate of fragment length Local frequency estimate Using control, if available Using wide estimate, otherwise Not using sequence

Read Lengths and Directions Some clear clusters – even before stats Reads on opposite sides of peak map to opposite strands Hence fragments have opposite directions Can estimate apparent fragment length

Fragment Lengths Puzzle: Fragments from sonication expected to be between 200 – 500 bp Estimated fragment size ~ 100 bp Shirley Liu’s explanation: preferential cutting near to TF ??

Comparison to ChIP-chip Broad correlation Not dramatic improve-ment in precision !

Methyl-Seq

Methylation Assays Affinity purification: e.g. MeDIP-Seq (methylated dinucleotide immunoprecipitation) Methylation-specific cleavage by endonucleases e.g. Methyl-Seq: Cleaves with HPA2 to identify Bisulphite conversion WGBS (Whole-Genome Bisulphite Sequencing) RRBS (Reduced Representation Bisulphite Sequencing) Cleaves with MSPI to reduce complexity

Affinity: MeDIP-Seq & MBD-Seq

Issues with Affinity Methods Analysis essentially like ChIP-Seq BUT: Sequence count reflects both density of CpG’s and proportions of methylation No individual CpG-level information Advantages: no conversion so sequence tags are easily mappable

Methyl-Seq Use HPAII to cleave only at unmethylated CCGG sites Size-select fragments (50-300) Sequence fragment ends Always starting at a CCGG Easy to map – few possible loci (<1M) Paired ends give actual fragment

Schematic Here

Issues for Methyl Seq Computational problem to re-assemble actual proportions of methylation at each locus from counts Prone to false positives because of incomplete digestion (for reasons other than methylation of CCGG site) e.g. insufficient time … rates vary by 50-fold depending on sequence context

WGBS Bisulphite conversion, fragmentation and shotgun sequencing Requires very many reads! Use of capture arrays reduces work… BUT different sequences have different capture efficiencies!

WGBS Data (from capture array) top, CHP-SKN-1; bottom, MDA-MB-231 NB. Inconsistent tag numbers

Issues with WGBS Lose many C’s Hard to map to genome Strategy depends on less penalty for mapping T to C Too many loci!

RRBS Too many methylation sites in genome Cleave with MSPI and size select in order to reduce number of fragments Convert C to T with bisulphite (not mC) Then sequence fragments 1.4 M fragments

Issues with RRBS Fairly broad but not complete coverage of ‘interesting’ regions of genome Bisulphite conversion of limited regions means mapping is fairly easy Bisulphite conversion not always complete

Meta-Genomics

What is Meta-genomics? Sequencing random fragments of DNA from all microbial denizens of a community (and traces of a few others) Sometimes broadly used for surveys of microbial diversity based on sequencing all 16SrRNA genes present

Kinds of Questions What is out there? Most microbial species not known What metabolic fluxes in any environment? What microbes associated with specific conditions? Including disease or health Human Microbiome Project

Environmental Meta-Genomics

Human Microbiome Project

Data Analysis Issues – 16S rRNA Identification of microbes – most are unknown and un-culturable Distinguishing errors in sequencing from novel microbes Biases in sequencing

Data Analysis Issues - Metagenomics Mapping and characterizing unknown protein sequences Usually assume conservation Full-coverage allows assembly of genomes Counting Biases probably smaller (Bork)