Presentation is loading. Please wait.

Presentation is loading. Please wait.

Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009.

Similar presentations


Presentation on theme: "Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009."— Presentation transcript:

1 Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009

2 New sequencing technologies…

3 … offer vast throughput … & many applications read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina/Solexa, AB/SOLiD sequencers ABI capillary sequencer Roche/454 pyrosequencer (100Mb-1Gb in 200-450bp reads) (10-50Gb in 25-100 bp reads) 1 Mb 100 Gb

4 Genome resequencing for variation discovery SNPs short INDELs structural variations the most immediate application area

5 Genome resequencing for mutational profiling Organismal reference sequence likely to change “classical genetics” and mutational analysis

6 De novo genome sequencing Lander et al. Nature 2001 difficult problem with short reads promising, especially as reads get longer

7 Identification of protein-bound DNA Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007) DNA methylation. (Meissner et al. Nature 2008) natural applications for next-gen. sequencers

8 Transcriptome sequencing: transcript discovery Mortazavi et al. Nature Methods 2008 Ruby et al. Cell, 2006 high-throughput, but short reads pose challenges

9 Transcriptome sequencing: expression profiling Jones-Rhoads et al. PLoS Genetics, 2007 Cloonan et al. Nature Methods, 2008 high-throughput, short-read sequencing should make a major impact, and potentially replace expression microarrays

10 … & enable personal genome sequencing

11 The re-sequencing informatics pipeline REF (ii) read mapping IND (i) base calling IND (iii) SNP and short INDEL calling (v) data viewing, hypothesis generation (iv) SV calling

12 The variation discovery “toolbox” base callers read mappers SNP callers SV callers assembly viewers

13 Base error characteristics vary Illumina 454

14 Read lengths vary read length [bp] 0 100200300 ~200-450 (variable) 25-100(fixed) 25-50 (fixed) 25-60 (variable) 400

15 Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

16 Representational biases

17 Fragment duplication

18 Read mapping is like a jigsaw … and they give you the picture on the box 2. Read mapping …you get the pieces… Unique pieces are easier to place than others…

19 Multiply-mapping reads Reads from repeats cannot be uniquely mapped back to their true region of origin “Traditional” repeat masking does not capture repeats at the scale of the read length

20 Dealing with multiple mapping

21 Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb PE reads are now the standard for whole-genome short-read sequencing

22 Gapped alignments (for INDELs)

23 The MOSAIK read mapper gapped mapper option to report multiple map locations aligns 454, Illumina, SOLiD, Helicos reads works with standard file formats (SRF, FASTQ, SAM/BAM) Michael Strömberg

24 Alignment post-processing quality value re-calibration duplicate fragment removal

25 Data storage requirements

26 Alignment visualization too much data – indexed browsing too much detail – color coding, show/hide

27 SNP calling: old problem, new data sequencing errorpolymorphism

28 Allele calling in next-gen data SNP INS New technologies are perfectly suitable for accurate SNP calling, and some also for short-INDEL detection

29 SNP calling in multi-sample read sets P(G 1 =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G i =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G n =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(SNP) “genotype probabilities” P(B 1 =aacc|G 1 =aa) P(B 1 =aacc|G 1 =cc) P(B 1 =aacc|G 1 =ac) P(B i =aaaac|G i =aa) P(B i =aaaac|G i =cc) P(B i =aaaac|G i =ac) P(B n =cccc|G n =aa) P(B n =cccc|G n =cc) P(B n =cccc|G n =ac) “genotype likelihoods” Prior(G 1,..,G i,.., G n ) -----a----- -----c----- -----a----- -----c-----

30 Trio sequencing the child inherits one chromosome from each parent there is a small probability for a de novo (germ-line or somatic) mutation in the child

31 Alignment visualization too much data – indexed browsing too much detail – color coding, show/hide

32 Standard data formats SRF/FASTQ SAM/BAM GLF/VCF

33 Human genome polymorphism projects common SNPs

34 Human genome polymorphism discovery

35 The 1000 Genomes Project Pilot 1 Pilot 2 Pilot 3

36 1000G Pilot 3 – exon sequencing Targets: 1K genes / 10K targets Capture: Solid / liquid phase Sequencing: 454 / Illumina SE / PE Data producers: Baylor Broad Sanger Wash. U. Informatics methods: Multiple read mapping & SNP calling programs

37 Coverage varies

38 On/off target capture ref allele*:45% non-ref allele*:54% Target region SNP (outside target region)

39 Fragment duplication – revisited

40 Reference allele bias (*) measured at 450 het HapMap 3 sites overlapping capture target regions in sample NA07346 ref allele*:54% non-ref allele*:45% ref allele*:54% non-ref allele*:45%

41 SNP calling findings BCM/454BI/SLXWUGSC/SLXSC/SLX # Samples 32231611 per sample 35 X62 X117 X51 X # SNPs called 7,200 – 8,4004,500 – 4,7003,7003,500 – 3,700 % dbSNPs 39 - 5565 - 726875 - 85 Ts/Tv(#SNP) 1.7 – 2.61.9 – 2.32.32.5 – 2.6 # Novel SNPs 3,9981,5501,947892 based on a method comparison / testing exercise 80 samples drawn from the 4 Centers read mapping / SNP calling by the Baylor pipeline (BCM/454 data); the Broad and the BC pipelines (all 80 samples)

42 Overlap between call sets 452 172 38.05% 1.21 2,296 1,862 81.10% 3.40 413 24 5.81% 0.23 Broad calls BC calls # SNP calls: # dbSNPs: % dbSNPs: Ts/Tv ratio:

43 The 1000G Structural Variation Discovery Effort

44 Structural variation detection Feuk et al. Nature Reviews Genetics, 2006

45 SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

46 Detection Approaches Read Depth: good for big CNVs Sample Reference Lmap read contig Paired-end: all types of SV Split-Reads good break-point resolution deNovo Assembly ~ the future SV slides courtesy of Chip Stewart, Boston College

47 47 Read depth (RD)

48 Statistical & systematic biases

49 Single molecule sequencing? GC Bias Coverage bias

50 RD resolution density (log 10 ) Illumina observed read counts (per kb) expected read counts ( per kb) CN = 2 CN = 1 CN = 3

51 CNV events detected with RD Reads/2k b log 2 (o/e) Chromosome 2 Position [Mb] Reads/5k b log 2 (o/e) Reads/kb log 2 (o/e) Helicos 40 kb deletion 454 Illumina individual “ NA12878 ”

52 SV detection with PE read map positions Deletion Individual Reference LM L del Tandem duplication L dup Inversion L inv

53 Fragment length distributions long fragments ~ better fragment coverage and sensitivity to large events (454) tighter distributions ~ better breakpoint resolution and sensitivity for shorter events (Illumina)

54 The SV/CNV event display chromosome overview fragment lengths read depth event track 300 bp deletion in chromosome of NA12878 by Illumina paired-end data from the 1000 Genomes project Chip Stewart

55 Deletion event lengths ALUY L1

56 Mobile element insertions: PE reads Used with short-read data (Illumina, in our case) Detect clusters of 5’ & 3’ read pairs with one end mapping to a mobile element Clusters far from annotated elements are candidate insertion events ALU / L1 / SVA 5’ end 3’ end

57 Mobile element insertions: Split reads Requires longer reads (454) Reads “mapping into” mobile element not present in the reference genome sequence are candidate insertion events

58 Mobile element insertions: trio members 356 185 182 NA12891 684 NA12892 664 ALU insertions NA12878 733 47 79 96 10 Detection in 1000 G Pilot 2 CEU trio PE Illumina data

59 Mobile element insertions: PE vs. Split-reads 569 247163 454 Split-Read 817 SLX PE 733 ALU insertions in NA12878

60 BC event lists in 1000 Genomes data SV type Pilot 1 140 samples low coverage Pilot 2 6 samples high coverage deletions5,5554,718 tandem duplications 540406 mobile element insertions 3,2762,013

61 SV calls / validation in 1000G datasets

62 Software access http://bioinformatics.bc.edu/marthlab/Software_Release

63 Credits Elaine Mardis Andy Clark Aravinda Chakravarti Michael Egholm Scott Kahn Francisco de la Vega Patrice Milos John Thompson R01 HG004719 R01 HG003698 R21 AI081220 RC2 HG5552

64 Lab Several positions are available: grad students / postodocs / programmers

65 Can we find exon CNVs in Pilot 3 data?

66 SVs in exon sequencing data


Download ppt "Next-generation sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department Harvard Nanocourse October 7, 2009."

Similar presentations


Ads by Google