Presentation is loading. Please wait.

Presentation is loading. Please wait.

High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013.

Similar presentations


Presentation on theme: "High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013."— Presentation transcript:

1 High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013

2 Traditional DNA sequencing

3 Genetics of living organisms DNA Chromosomes

4 Radioactive label gel sequencing

5 Four-color capillary sequencing ~1 Mb ~100 Mb>100 Mb~3,000 Mb ABI 3700 four-color sequence trace

6 Individual human resequencing

7 Next-generation DNA sequencing

8 New sequencing technologies…

9 … vast throughput, many applications read length bases per machine run 10 bp1,000 bp100 bp 1 Gb 100 Mb 10 Mb 10 Gb Illumina, SOLiD ABI / capillary 454 1 Mb 100 Gb 1 Tb

10 DNA ligationDNA base extension Church, 2005 Sequencing chemistries

11 Template clonal amplification Church, 2005

12 Massively parallel sequencing Church, 2005

13 Chemistry of paired-end sequencing Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced. (Figure courtesy of Illumina)

14 Paired-end reads fragment amplification: fragment length 100 - 600 bp fragment length limited by amplification efficiency circularization: 500bp - 10kb (sweet spot ~3kb) fragment length limited by library complexity Korbel et al. Science 2007

15 Features of NGS data Short sequence reads 100-200bp 25-35bp (micro-reads) Huge amount of sequence per run Up to gigabases per run Huge number of reads per run Up to 100’s of millions Higher error as compared with Sanger sequencing Error profile different to Sanger

16 Application areas of next-gen sequencing

17 Application areas Genome resequencing variant discovery somatic mutation detection mutational profiling De novo assembly Identification of protein-bound DNA chromatin structure methylation transcription binding sites RNA-Seq expression transcript discovery Mikkelsen et al. Nature 2007 Cloonan et al. Nature Methods, 2008

18 SNP and short-INDEL discovery

19 Structural variation detection structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations copy number (for amplifications, deletions) from depth of read coverage

20 Identification of protein-bound DNA genome sequence aligned reads Chromatin structure (CHIP-SEQ) (Mikkelsen et al. Nature 2007) Transcription binding sites. (Robertson et al. Nature Methods, 2007)

21 Novel transcript discovery (genes) Mortazavi et al. Nature Methods novel exons novel transcripts containing known exons

22 Novel transcript discovery (miRNAs) Ruby et al. Cell, 2006

23 Expression profiling aligned reads Jones-Rhoads et al. PLoS Genetics, 2007 gene tag counting (e.g. SAGE, CAGE) shotgun transcript sequencing

24 De novo genome sequencing assembled sequence contigs short reads longer reads read pairs Lander et al. Nature 2001

25 The informatics of sequencing

26 Re-sequencing informatics pipeline REF (ii) read mapping IND (i) base calling IND (iii) SNP and short INDEL calling (v) data viewing, hypothesis generation (iv) SV calling

27 The variation discovery toolbox base callers read mappers SNP callers SV callers assembly viewers

28 Raw data processing / base calling Trace extraction Base calling These steps are usually handled well by the machine manufacturers’ software What most analysts want to see is base calls and well-calibrated base quality values

29 Sequence traces are machine-specific Base calling is increasingly left to machine manufacturers

30 …where they give you the cover on the box Read mapping… Is like a jigsaw puzzle…

31 Some pieces are easier to place than others… …pieces with unique features pieces that look like each other…

32 Repeats  multiple mapping problem Lander et al. 2001

33 Paired-end (PE) reads fragment length: 100 – 600bp Korbel et al. Science 2007 fragment length: 1 – 10kb PE reads are now the standard for whole-genome short-read sequencing

34 Mapping quality values 0.8 0.190.01

35 SNP calling

36 SNP calling: what goes into it? sequencing errortrue polymorphism Base qualities Base coverage Prior expectation

37 Bayesian SNP calling AAAAAAAAAA CCCCCCCCCC TTTTTTTTTT GGGGGGGGGG polymorphic permutation monomorphic permutation Bayesian posterior probability Base call + Base quality Expected polymorphism rate Base composition Depth of coverage

38 http://bioinformatics.bc.edu/~marth/PolyBayes Marth et al., Nature Genetics, 1999 First statistically rigorous SNP discovery tool Correctly analyzes alternative cDNA splice forms The PolyBayes software

39 SNP calling (continued) P(G 1 =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G 1 =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G i =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G i =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(G n =aa|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =cc|B 1 =aacc; B i =aaaac; B n = cccc) P(G n =ac|B 1 =aacc; Bi=aaaac; B n = cccc) P(SNP) “genotype probabilities” P(B 1 =aacc|G 1 =aa) P(B 1 =aacc|G 1 =cc) P(B 1 =aacc|G 1 =ac) P(B i =aaaac|G i =aa) P(B i =aaaac|G i =cc) P(B i =aaaac|G i =ac) P(B n =cccc|G n =aa) P(B n =cccc|G n =cc) P(B n =cccc|G n =ac) “genotype likelihoods” Prior(G 1,..,G i,.., G n ) -----a----- -----c----- -----a----- -----c-----

40 Insertion/deletion (INDEL) variants These variants have been on the “radar screen” for decades Accurate automated detection is difficult Different mutation mechanisms Often appear in repetitive sequence and therefore difficult to align Often multi-allelic Deleted allele has no base quality values

41 Alignment methods became more refined Original alignment After left realignmentAfter haplotype-aware realignment

42 Medium length INDELs still a problem Guillermo Angel

43 Structural variation detection Feuk et al. Nature Reviews Genetics, 2006

44 Structural variant detection (cont’d)

45 Detection Approaches Read Depth: good for big CNVs Sample Reference Lmap read contig Paired-end: all types of SV Split-Reads good break-point resolution deNovo Assembly ~ the future SV slides courtesy of Chip Stewart, Boston College

46 SV detection – resolution Expected CNVs Karyotype Micro-array Sequencing Relative numbers of events CNV event length [bp]

47 Standard data formats Reads: FASTQ Alignments: SAM/BAM Variants: VCF

48 Tools for analyzing & manipulating 1000G data samtools: http://samtools.sourceforge.net/ BamTools: http://sourceforge.net/projects/bamtools/ GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_T oolkit VCFTools: http://vcftools.sourceforge.net/ VcfCTools: https://github.com/AlistairNWard/vcfCTools Alignments: SAM/BAM Variants: VCF


Download ppt "High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013."

Similar presentations


Ads by Google