Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.

Slides:



Advertisements
Similar presentations
Functional Genomics with Next-Generation Sequencing
Advertisements

1 Q1-Q3 results. 2 RF lengths 3 Filtered RF length distribution.
Marius Nicolae Computer Science and Engineering Department
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Visualising and Exploring BS-Seq Data
Plant Genomics & Bioinformatics Jen Taylor : Bioinformatics Leader CSIRO Plant Industry EMBL Australia April 2011.
Base quality and read quality: How should data quality be measured? Gabor T. Marth Boston College Biology Department 1000 Genomes Meeting Cold Spring Harbor.
GNUMap: Unbiased Probabilistic Mapping of Next- Generation Sequencing Reads Nathan Clement Computational Sciences Laboratory Brigham Young University Provo,
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Measuring the degree of similarity: PAM and blosum Matrix
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Assembly.
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
DNA Barcode Data Analysis Boosting Accuracy by Combining Simple Classification Methods CSE 377 – Bioinformatics - Spring 2006 Sotirios Kentros Univ. of.
Introduction to computational genomics – hands on course Gene expression (Gasch et al) Unit 1: Mapper Unit 2: Aggregator and peak finder Solexa MNase Reads.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Statistics in Bioinformatics May 12, 2005 Quiz 3-on May 12 Learning objectives-Understand equally likely outcomes, counting techniques (Example, genetic.
High Throughput Sequencing
Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas
Sequencing Errors and Biases Biological Sequence Analysis BNFO 691/602 Spring 2013 Mark Reimers.
Computational analyses of yeast and human chromatin William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering.
Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences.
De-novo Assembly Day 4.
Expression Analysis of RNA-seq Data
MicroRNA Targets Prediction and Analysis. Small RNAs play important roles The Nobel Prize in Physiology or Medicine for 2006 Andrew Z. Fire and Craig.
Todd J. Treangen, Steven L. Salzberg
Introduction to Short Read Sequencing Analysis
Cryptic Variation in the Human mutation rate Alan Hodgkinson Adam Eyre-Walker, Manolis Ladoukakis.
KMERSTREAM Streaming algorithms for k-mer abundance estimation Páll joint work with Bjarni V. Halldórsson.
Massive Parallel Sequencing
Identification of Regulatory Binding Sites Using Minimum Spanning Trees Pacific Symposium on Biocomputing, pp , 2003 Reporter: Chu-Ting Tseng Advisor:
MapNext: a software tool for spliced and unspliced alignments and SNP detection of short sequence reads Hua Bao Sun Yat-sen University, Guangzhou,
TopHat Mi-kyoung Seo. Today’s paper..TopHat Cole Trapnell at the University of Washington's Department of Genome Sciences Steven Salzberg Center.
SIGNAL PROCESSING FOR NEXT-GEN SEQUENCING DATA RNA-seq CHIP-seq DNAse I-seq FAIRE-seq Peaks Transcripts Gene models Binding sites RIP/CLIP-seq.
Chromatin Immunoprecipitation DNA Sequencing (ChIP-seq)
I519 Introduction to Bioinformatics, Fall, 2012
Comp. Genomics Recitation 3 The statistics of database searching.
Gerton Lunter Wellcome Trust Centre for Human Genetics From calling bases to calling variants: Experiences with Illumina data.
Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)
BRUDNO LAB: A WHIRLWIND TOUR Marc Fiume Department of Computer Science University of Toronto.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Next Generation Sequencing
Alistair Chalk, Elisabet Andersson Stem Cell Biology and Bioinformatic Tools, DBRM, Karolinska Institutet, September Day 5-2 What bioinformatics.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
 CHANGE!! MGL Users Group meetings will now be on the 1 st Monday of each month 3:00-4:00 Room Note the change of time and room.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
Analysis of Next Generation Sequence Data BIOST /06/2015.
Indexing genomic sequences 逢甲大學 資訊工程系 許芳榮. Outline Introduction Unique markers Multi-layer unique markers Locating SNP on genome Aligning EST to genome.
DNA sequences alignment measurement Lecture 13. Introduction Measurement of “strength” alignment Nucleic acid and amino acid substitutions Measurement.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Multi-Genome Multi- read (MGMR) progress report Main source for Background Material, slide backgrounds: Eran Halperin's Accurate Estimation of Expression.
Biases and their Effect on Biological Interpretation
Gene expression from RNA-Seq
Research in Computational Molecular Biology , Vol (2008)
GateKeeper: A New Hardware Architecture
Jin Zhang, Jiayin Wang and Yufeng Wu
Alternative Computational Analysis Shows No Evidence for Nucleosome Enrichment at Repetitive Sequences in Mammalian Spermatozoa  Hélène Royo, Michael Beda.
2nd (Next) Generation Sequencing
Summary and Recommendations
CSC2431 February 3rd 2010 Alecia Fowler
Misleading Bioinformatics Mistakes, Biases, Mis-Interpretations and how to avoid them Festival of Genomics 2017.
Maximize read usage through mapping strategies
Matthew W Jones-Rhoades, David P Bartel  Molecular Cell 
Summary and Recommendations
Fragment Assembly 7/30/2019.
Presentation transcript:

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry

CSIRO. Newton Meeting July Sequence coverage Assumptions Every k-mer has equal chance of being sequenced

CSIRO. Newton Meeting July Sequence coverage Read density

CSIRO. Newton Meeting July Sequence coverage Deviations from Assumptions?

CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space

CSIRO. Newton Meeting July Sequence coverage Assumptions : Digestion IlluminaSOLiD

CSIRO. Newton Meeting July Sequence coverage ChIPSeq MNase Linker Digest Sequence & Align Remove Nucleosomes

CSIRO. Newton Meeting July Sequence coverage ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes

CSIRO. Newton Meeting July Sequence coverage araTha9 Aligned Reads 36-Mer Monomer Composition

CSIRO. Newton Meeting July Sequence coverage araTha9 Aligned Reads 5’ +/- 16bp Monomer Composition

CSIRO. Newton Meeting July Sequence coverage MNase Site Preferencing Flick et al., J. Mol. Biology 1986

CSIRO. Newton Meeting July Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg taataggg gtattagg tctttgct cacattac tcccagac aaacaaca acacgagc tttgtttt tttgcata ttggttta gaggtttt

CSIRO. Newton Meeting July Sequence coverage ChIPSeq MNase Digest Sequence & Align Remove Nucleosomes

CSIRO. Newton Meeting July Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg taataggg gtattagg tctttgct cacattac tcccagac aaacaaca acacgagc tttgtttt tttgcata ttggttta gaggtttt

CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb

CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density

CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density

CSIRO. Newton Meeting July Sequence coverage Nucleosome potential

CSIRO. Newton Meeting July Sequence coverage MNase biases aiding interpretation? Can aid identification in a local sequence ? Dependent upon local sequence context Cautionary tale about analysing sequence contexts of ChipSeq data Nucleotide composition analyses must take into account digestion preferencing

CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space

CSIRO. Newton Meeting July Sequence coverage Hamming Edit Distances Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k For all possible kmers (36, 65 ) in Arabidopsis genome All vs.All, both strands Minimum HE distance Target SequenceCGTACATGC Probe SequenceCGTTCAGGC Substitution RequiredNNNYNNYNN Hamming2

CSIRO. Newton Meeting July Sequence coverage Arabidopsis Minimum Hamming Edit Distances 36mer

CSIRO. Newton Meeting July Sequence coverage Alignment issues hg18 dm3 araTha9 ce6 sacCer6

CSIRO. Newton Meeting July Sequence coverage Alignment artefacts : aligner properties Mismatch Read length Genome pre- processing Reads pre- processing Uses quality score Reports unmapped reads Multithread SOAP0-5  60  SOAP ?  Maq1-3 2 ?   Bowtie0-3 3  1024 Ubsalign0-20  1024  4 4 5

CSIRO. Newton Meeting July Sequence coverage Breakdown of sequencing run

CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 2 H …..AGCTTAGCCGGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 3 No Alignment

CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGATTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 0 H …..AGCTTAGCCGGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 2 No Alignment

CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 1 H …..AGCTTAGCCGGGTTCTGGTA…. AGATTAGCCTGGTACTGCTA 4 Alignment !

CSIRO. Newton Meeting July Sequence coverage Testing Aligner Accuracy Simulated reads Known correct location 25 million, 50 million Perfect match, up to 5 mismatches, up to 10 mismatches Error 3’ bias Numbers of : correctly aligned reads incorrectly aligned reads Unalignable reads Speed

CSIRO. Newton Meeting July Sequence coverage Alignment artefacts :Managing mismatch thresholds

CSIRO. Newton Meeting July Sequence coverage Alignment artefacts :Managing mismatch thresholds

CSIRO. Newton Meeting July Sequence coverage How does this affect interpretation ? Incorporation of edit differentials Leads to gains in the number of alignable reads Increased information Determination of the alignment Gains of % in mappable sites Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage

CSIRO. Newton Meeting July Sequence coverage Hamming distance variability

CSIRO. Newton Meeting July Sequence coverage Read Deserts

CSIRO. Newton Meeting July Sequence coverage Read Deserts

CSIRO. Newton Meeting July Sequence coverage Sequence deserts

CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Conclusions Sample preparation MNase Digestion Local biases present Alignment Parameter choices Mismatches – generally too low relative to uniqueness of kmers in the genome Multiple read mappings – can drive ‘absence’ of mapped reads Hamming edit distances and k-mer space Kmers have unique and genome specific properties Can be used to inform results of alignment

CSIRO. Newton Meeting July Sequence coverage Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CMIS / TBCP Paul Greenfield

CSIRO. Newton Meeting July Sequence coverage Paired end data – sample preparation C G A T insert

CSIRO. Newton Meeting July Sequence coverage Control and sample read density Control Sample