Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry
CSIRO. Newton Meeting July Sequence coverage Assumptions Every k-mer has equal chance of being sequenced
CSIRO. Newton Meeting July Sequence coverage Read density
CSIRO. Newton Meeting July Sequence coverage Deviations from Assumptions?
CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space
CSIRO. Newton Meeting July Sequence coverage Assumptions : Digestion IlluminaSOLiD
CSIRO. Newton Meeting July Sequence coverage ChIPSeq MNase Linker Digest Sequence & Align Remove Nucleosomes
CSIRO. Newton Meeting July Sequence coverage ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes
CSIRO. Newton Meeting July Sequence coverage araTha9 Aligned Reads 36-Mer Monomer Composition
CSIRO. Newton Meeting July Sequence coverage araTha9 Aligned Reads 5’ +/- 16bp Monomer Composition
CSIRO. Newton Meeting July Sequence coverage MNase Site Preferencing Flick et al., J. Mol. Biology 1986
CSIRO. Newton Meeting July Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg taataggg gtattagg tctttgct cacattac tcccagac aaacaaca acacgagc tttgtttt tttgcata ttggttta gaggtttt
CSIRO. Newton Meeting July Sequence coverage ChIPSeq MNase Digest Sequence & Align Remove Nucleosomes
CSIRO. Newton Meeting July Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg taataggg gtattagg tctttgct cacattac tcccagac aaacaaca acacgagc tttgtttt tttgcata ttggttta gaggtttt
CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb
CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density
CSIRO. Newton Meeting July Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density
CSIRO. Newton Meeting July Sequence coverage Nucleosome potential
CSIRO. Newton Meeting July Sequence coverage MNase biases aiding interpretation? Can aid identification in a local sequence ? Dependent upon local sequence context Cautionary tale about analysing sequence contexts of ChipSeq data Nucleotide composition analyses must take into account digestion preferencing
CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space
CSIRO. Newton Meeting July Sequence coverage Hamming Edit Distances Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k For all possible kmers (36, 65 ) in Arabidopsis genome All vs.All, both strands Minimum HE distance Target SequenceCGTACATGC Probe SequenceCGTTCAGGC Substitution RequiredNNNYNNYNN Hamming2
CSIRO. Newton Meeting July Sequence coverage Arabidopsis Minimum Hamming Edit Distances 36mer
CSIRO. Newton Meeting July Sequence coverage Alignment issues hg18 dm3 araTha9 ce6 sacCer6
CSIRO. Newton Meeting July Sequence coverage Alignment artefacts : aligner properties Mismatch Read length Genome pre- processing Reads pre- processing Uses quality score Reports unmapped reads Multithread SOAP0-5 60 SOAP ? Maq1-3 2 ? Bowtie0-3 3 1024 Ubsalign0-20 1024 4 4 5
CSIRO. Newton Meeting July Sequence coverage Breakdown of sequencing run
CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 2 H …..AGCTTAGCCGGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 3 No Alignment
CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGATTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 0 H …..AGCTTAGCCGGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 2 No Alignment
CSIRO. Newton Meeting July Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 1 H …..AGCTTAGCCGGGTTCTGGTA…. AGATTAGCCTGGTACTGCTA 4 Alignment !
CSIRO. Newton Meeting July Sequence coverage Testing Aligner Accuracy Simulated reads Known correct location 25 million, 50 million Perfect match, up to 5 mismatches, up to 10 mismatches Error 3’ bias Numbers of : correctly aligned reads incorrectly aligned reads Unalignable reads Speed
CSIRO. Newton Meeting July Sequence coverage Alignment artefacts :Managing mismatch thresholds
CSIRO. Newton Meeting July Sequence coverage Alignment artefacts :Managing mismatch thresholds
CSIRO. Newton Meeting July Sequence coverage How does this affect interpretation ? Incorporation of edit differentials Leads to gains in the number of alignable reads Increased information Determination of the alignment Gains of % in mappable sites Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage
CSIRO. Newton Meeting July Sequence coverage Hamming distance variability
CSIRO. Newton Meeting July Sequence coverage Read Deserts
CSIRO. Newton Meeting July Sequence coverage Read Deserts
CSIRO. Newton Meeting July Sequence coverage Sequence deserts
CSIRO. Newton Meeting July Sequence coverage Impacts on read coverage - Conclusions Sample preparation MNase Digestion Local biases present Alignment Parameter choices Mismatches – generally too low relative to uniqueness of kmers in the genome Multiple read mappings – can drive ‘absence’ of mapped reads Hamming edit distances and k-mer space Kmers have unique and genome specific properties Can be used to inform results of alignment
CSIRO. Newton Meeting July Sequence coverage Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CMIS / TBCP Paul Greenfield
CSIRO. Newton Meeting July Sequence coverage Paired end data – sample preparation C G A T insert
CSIRO. Newton Meeting July Sequence coverage Control and sample read density Control Sample