Download presentation
Presentation is loading. Please wait.
Published byKimberly Cannon Modified over 8 years ago
1
Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry
2
CSIRO. Newton Meeting July 2010 - Sequence coverage Assumptions Every k-mer has equal chance of being sequenced
3
CSIRO. Newton Meeting July 2010 - Sequence coverage Read density
4
CSIRO. Newton Meeting July 2010 - Sequence coverage Deviations from Assumptions?
5
CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space
6
CSIRO. Newton Meeting July 2010 - Sequence coverage Assumptions : Digestion IlluminaSOLiD http://seq.molbiol.ru/sch_lib_fr.html
7
CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq MNase Linker Digest Sequence & Align Remove Nucleosomes
8
CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes
9
CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Aligned Reads 36-Mer Monomer Composition
10
CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Aligned Reads 5’ +/- 16bp Monomer Composition
11
CSIRO. Newton Meeting July 2010 - Sequence coverage MNase Site Preferencing Flick et al., J. Mol. Biology 1986
12
CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg 49924549.10 taataggg 86442449.10 gtattagg 104425324.23 tctttgct 49024258.67 cacattac 1807522.88 tcccagac 69520 2.88 aaacaaca 100831591.58 acacgagc 81020.25 tttgtttt 32186350.19 tttgcata 460250.11 ttggttta 767110.01 gaggtttt 392600
13
CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq MNase Digest Sequence & Align Remove Nucleosomes
14
CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg 49924549.10 taataggg 86442449.10 gtattagg 104425324.23 tctttgct 49024258.67 cacattac 1807522.88 tcccagac 69520 2.88 aaacaaca 100831591.58 acacgagc 81020.25 tttgtttt 32186350.19 tttgcata 460250.11 ttggttta 767110.01 gaggtttt 392600
15
CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb
16
CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density
17
CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density
18
CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potential
19
CSIRO. Newton Meeting July 2010 - Sequence coverage MNase biases aiding interpretation? Can aid identification in a local sequence ? Dependent upon local sequence context Cautionary tale about analysing sequence contexts of ChipSeq data Nucleotide composition analyses must take into account digestion preferencing
20
CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space
21
CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming Edit Distances Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k For all possible kmers (36, 65 ) in Arabidopsis genome All vs.All, both strands Minimum HE distance Target SequenceCGTACATGC Probe SequenceCGTTCAGGC Substitution RequiredNNNYNNYNN Hamming2
22
CSIRO. Newton Meeting July 2010 - Sequence coverage Arabidopsis Minimum Hamming Edit Distances 36mer
23
CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment issues 0 2 4 6 8 10 12 14 hg18 dm3 araTha9 ce6 sacCer6
24
CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts : aligner properties Mismatch Read length Genome pre- processing Reads pre- processing Uses quality score Reports unmapped reads Multithread SOAP0-5 60 SOAP20-5 1 ? Maq1-3 2 ? Bowtie0-3 3 1024 Ubsalign0-20 1024 4 4 5
25
CSIRO. Newton Meeting July 2010 - Sequence coverage Breakdown of sequencing run
26
CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 2 H …..AGCTTAGCCGGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 3 No Alignment
27
CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGATTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 0 H …..AGCTTAGCCGGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 2 No Alignment
28
CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 1 H …..AGCTTAGCCGGGTTCTGGTA…. AGATTAGCCTGGTACTGCTA 4 Alignment !
29
CSIRO. Newton Meeting July 2010 - Sequence coverage Testing Aligner Accuracy Simulated reads Known correct location 25 million, 50 million Perfect match, up to 5 mismatches, up to 10 mismatches Error 3’ bias Numbers of : correctly aligned reads incorrectly aligned reads Unalignable reads Speed
30
CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts :Managing mismatch thresholds
31
CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts :Managing mismatch thresholds
32
CSIRO. Newton Meeting July 2010 - Sequence coverage How does this affect interpretation ? Incorporation of edit differentials Leads to gains in the number of alignable reads Increased information Determination of the alignment Gains of 5 - 10% in mappable sites Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage
33
CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming distance variability
34
CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts
35
CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts
36
CSIRO. Newton Meeting July 2010 - Sequence coverage Sequence deserts
37
CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Conclusions Sample preparation MNase Digestion Local biases present Alignment Parameter choices Mismatches – generally too low relative to uniqueness of kmers in the genome Multiple read mappings – can drive ‘absence’ of mapped reads Hamming edit distances and k-mer space Kmers have unique and genome specific properties Can be used to inform results of alignment
38
CSIRO. Newton Meeting July 2010 - Sequence coverage Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CMIS / TBCP Paul Greenfield
39
CSIRO. Newton Meeting July 2010 - Sequence coverage Paired end data – sample preparation C G A T insert
40
CSIRO. Newton Meeting July 2010 - Sequence coverage Control and sample read density Control Sample
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.