Presentation is loading. Please wait.

Presentation is loading. Please wait.

Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO.

Similar presentations


Presentation on theme: "Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO."— Presentation transcript:

1 Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO Plant Industry

2 CSIRO. Newton Meeting July 2010 - Sequence coverage Assumptions Every k-mer has equal chance of being sequenced

3 CSIRO. Newton Meeting July 2010 - Sequence coverage Read density

4 CSIRO. Newton Meeting July 2010 - Sequence coverage Deviations from Assumptions?

5 CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space

6 CSIRO. Newton Meeting July 2010 - Sequence coverage Assumptions : Digestion IlluminaSOLiD http://seq.molbiol.ru/sch_lib_fr.html

7 CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq MNase Linker Digest Sequence & Align Remove Nucleosomes

8 CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq - Nucleosome Sample: MNase digested Size fractionated Control: MNase digested Random sizes

9 CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Aligned Reads 36-Mer Monomer Composition

10 CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Aligned Reads 5’ +/- 16bp Monomer Composition

11 CSIRO. Newton Meeting July 2010 - Sequence coverage MNase Site Preferencing Flick et al., J. Mol. Biology 1986

12 CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg 49924549.10 taataggg 86442449.10 gtattagg 104425324.23 tctttgct 49024258.67 cacattac 1807522.88 tcccagac 69520 2.88 aaacaaca 100831591.58 acacgagc 81020.25 tttgtttt 32186350.19 tttgcata 460250.11 ttggttta 767110.01 gaggtttt 392600

13 CSIRO. Newton Meeting July 2010 - Sequence coverage ChIPSeq MNase Digest Sequence & Align Remove Nucleosomes

14 CSIRO. Newton Meeting July 2010 - Sequence coverage araTha9 Control MNase Site Preferencing SequenceOccurrencesSequence StartsPreference (%) ctataggg 49924549.10 taataggg 86442449.10 gtattagg 104425324.23 tctttgct 49024258.67 cacattac 1807522.88 tcccagac 69520 2.88 aaacaaca 100831591.58 acacgagc 81020.25 tttgtttt 32186350.19 tttgcata 460250.11 ttggttta 767110.01 gaggtttt 392600

15 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials – Read Density Normalised Read Density Base Coordinate 1 Kb

16 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density

17 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potentials MNase Potential Normalised Read Density

18 CSIRO. Newton Meeting July 2010 - Sequence coverage Nucleosome potential

19 CSIRO. Newton Meeting July 2010 - Sequence coverage MNase biases aiding interpretation? Can aid identification in a local sequence ? Dependent upon local sequence context Cautionary tale about analysing sequence contexts of ChipSeq data Nucleotide composition analyses must take into account digestion preferencing

20 CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Outline Sample preparation MNase Digestion Alignment Parameter choices Mismatches Multiple read mappings Hamming edit distances and k-mer space

21 CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming Edit Distances Defined as the number of substitution edit operations, required to transform one sequence of length k into another of length k For all possible kmers (36, 65 ) in Arabidopsis genome All vs.All, both strands Minimum HE distance Target SequenceCGTACATGC Probe SequenceCGTTCAGGC Substitution RequiredNNNYNNYNN Hamming2

22 CSIRO. Newton Meeting July 2010 - Sequence coverage Arabidopsis Minimum Hamming Edit Distances 36mer

23 CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment issues 0 2 4 6 8 10 12 14 hg18 dm3 araTha9 ce6 sacCer6

24 CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts : aligner properties Mismatch Read length Genome pre- processing Reads pre- processing Uses quality score Reports unmapped reads Multithread SOAP0-5  60  SOAP20-5 1 ?  Maq1-3 2 ?   Bowtie0-3 3  1024 Ubsalign0-20  1024  4 4 5

25 CSIRO. Newton Meeting July 2010 - Sequence coverage Breakdown of sequencing run

26 CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 2 H …..AGCTTAGCCGGGTACTGGTA…. AGATTAGCCTGGTACTGCTA 3 No Alignment

27 CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGATTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 0 H …..AGCTTAGCCGGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 2 No Alignment

28 CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming edits and Ubsalign HE difference AGATTAGCCTGGTACTGCTA …..AGCTTAGCCTGGTACTGCTA…. AGATTAGCCTGGTACTGCTA 1 H …..AGCTTAGCCGGGTTCTGGTA…. AGATTAGCCTGGTACTGCTA 4 Alignment !

29 CSIRO. Newton Meeting July 2010 - Sequence coverage Testing Aligner Accuracy Simulated reads Known correct location 25 million, 50 million Perfect match, up to 5 mismatches, up to 10 mismatches Error 3’ bias Numbers of : correctly aligned reads incorrectly aligned reads Unalignable reads Speed

30 CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts :Managing mismatch thresholds

31 CSIRO. Newton Meeting July 2010 - Sequence coverage Alignment artefacts :Managing mismatch thresholds

32 CSIRO. Newton Meeting July 2010 - Sequence coverage How does this affect interpretation ? Incorporation of edit differentials Leads to gains in the number of alignable reads Increased information Determination of the alignment Gains of 5 - 10% in mappable sites Hamming edit distributions provide useful information Impact of MNase digestion on short read sequence coverage

33 CSIRO. Newton Meeting July 2010 - Sequence coverage Hamming distance variability

34 CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts

35 CSIRO. Newton Meeting July 2010 - Sequence coverage Read Deserts

36 CSIRO. Newton Meeting July 2010 - Sequence coverage Sequence deserts

37 CSIRO. Newton Meeting July 2010 - Sequence coverage Impacts on read coverage - Conclusions Sample preparation MNase Digestion Local biases present Alignment Parameter choices Mismatches – generally too low relative to uniqueness of kmers in the genome Multiple read mappings – can drive ‘absence’ of mapped reads Hamming edit distances and k-mer space Kmers have unique and genome specific properties Can be used to inform results of alignment

38 CSIRO. Newton Meeting July 2010 - Sequence coverage Acknowledgements CSIRO PI Bioinformatics Team Andrew Spriggs Stuart Stephen Emily Ying Jose Robles Michael James CSIRO Prog X Chris Helliwell Frank Gubler Liz Dennis CSIRO Transformational Biology Capability Platform David Lovell Mark Morrison CMIS / TBCP Paul Greenfield

39 CSIRO. Newton Meeting July 2010 - Sequence coverage Paired end data – sample preparation C G A T insert

40 CSIRO. Newton Meeting July 2010 - Sequence coverage Control and sample read density Control Sample


Download ppt "Genome-wide characteristics of sequence coverage by next-generation sequencing: how does this impact interpretation? Jen Taylor Bioinformatics Team CSIRO."

Similar presentations


Ads by Google