Presentation is loading. Please wait.

Presentation is loading. Please wait.

Canadian Bioinformatics Workshops www.bioinformatics.ca.

Similar presentations


Presentation on theme: "Canadian Bioinformatics Workshops www.bioinformatics.ca."— Presentation transcript:

1 Canadian Bioinformatics Workshops www.bioinformatics.ca

2

3 Module 3 Genomic Variation Discovery

4 Genetic Variation Discovery bioinformatics.ca Topics Introduction Interpreting raw data Read alignment (MOSAIK) SNP Discovery (GigaBayes) Visualization (Gambit) 1000 Genomes Project

5 Genetic Variation Discovery bioinformatics.ca Genetic Variations: Why? Inherited diseases Ancestral history Phenotypic differences

6 Genetic Variation Discovery bioinformatics.ca Genetic Variations: SNPs & INDELs

7 Genetic Variation Discovery bioinformatics.ca Structural Variations Paul Medvedev review in prep July 2009

8 Genetic Variation Discovery bioinformatics.ca Epigenetic Variations: ChIPSeq Anjali Shah (AB) Nature Methods April 2009

9 Genetic Variation Discovery bioinformatics.ca Interpreting raw data

10 Genetic Variation Discovery bioinformatics.ca Basecalling: Intro how do we translate the machine readouts to base calls? how do we estimate and represent sequencing errors (base quality values)?

11 Genetic Variation Discovery bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350.12% 531.62% 1010.00% 153.16% 201.00% 250.32% 300.10% 350.03% 400.01%

12 Genetic Variation Discovery bioinformatics.ca Calculating Error Rates substitutioninsertion deletion atggat*agtataacgtcaggctaaactgtagtatatggataaaatgaccatacgattaca tggatgagtataa*gtcaag gtatatggataaaatcaccata paralogous alignment atggattagtataacgtcaggctaaactgtagtatatggataaaatgaccatacgattaca local misalignment tggattagtataacgtcagc tatatggctaaaatgaccata polymorphic test data read ref attagtactaccatgtagtac unaligned reads

13 Genetic Variation Discovery bioinformatics.ca Error Profile: Roche 454 error rate is low (< 0.5%) most errors are INDELs insertions deletions substitutions

14 Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina

15 Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina (36 bp)

16 Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina Variability

17 Genetic Variation Discovery bioinformatics.ca Logistic Regression “Original”Recalibrated Mark DePristo Broad Institute June 2009

18 Genetic Variation Discovery bioinformatics.ca Read Alignment

19 Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly

20 Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly

21 Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly

22 Genetic Variation Discovery bioinformatics.ca Sequencing Technologies future

23 Genetic Variation Discovery bioinformatics.ca

24 Genetic Variation Discovery bioinformatics.ca Pipeline Snapshot

25 Genetic Variation Discovery bioinformatics.ca How Does It Work?

26 Genetic Variation Discovery bioinformatics.ca How Does It Work?

27 Genetic Variation Discovery bioinformatics.ca

28 Genetic Variation Discovery bioinformatics.ca Functionality: Platform Specific

29 Genetic Variation Discovery bioinformatics.ca Enabling INDEL Discovery INDEL validation rate: 89.3 %(216) SNP validation rate: 97.8 %(229)

30 Genetic Variation Discovery bioinformatics.ca Combining Read Technologies

31 Genetic Variation Discovery bioinformatics.ca Paired-End Reads Jarvie & Harkins (454) Nature Methods May 2008

32 Genetic Variation Discovery bioinformatics.ca Resolving Paired-End Reads

33 Genetic Variation Discovery bioinformatics.ca MosaikCoverage

34 Genetic Variation Discovery bioinformatics.ca MosaikText 1 12 807910 807945 O_5_1_907_1935 1 36 + 0 ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA 2 7 1019133 1019168 O_5_1_853_1522 1 36 - 0 ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC 3 4 952257 952292 O_5_1_742_688 1 36 - 0 TGGATCTCTCTTGAATACGTACAATGATACTGTTAT 4 8 176976 177011 O_5_1_1892_1827 1 36 - 0 TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC 5 5 516470 516505 O_5_1_753_575 1 36 + 0 GAATGACACAATATCATTAGTGGTCCCTCAGTTATA 6 2 561582 561617 O_5_1_824_756 1 36 + 0 ACCTTACAACAGTGCTAAAGTAGTTACAGTAAACCA 42 O_5_1_1132_922 GAAATCTCATCTCAAGGAGAAGGAAACAGCAGATCC U0 43 O_5_1_499_472 GCAATTATTATAGCTTTGTCCGATTGTTCTCTCCCT U1 44 O_5_1_1161_922 GTTTATGATTTATCTGGTACAAGTCAGGCTGTTGTC U0 45 O_5_1_848_673 ACTAATTCATTCGTTTACGTCTCAAATGATTAATAA U0 46 O_5_1_887_756 AATATAACGGCCAGGTATATCATTGGATCTCCTTCA U0 47 O_5_1_987_943 GATATATACAGTGTTCTTGCCGACATAACGGCTTAG U0 48 O_5_1_1131_902 GTGATAAAAGAATGTAGGATTATTTATAAGTCTGTA U0 49 O_5_1_785_360 ATATGGATGAATATAAATACAAGGACAAAAAACGTG U0 50 O_5_1_908_742 AAATATATATCAGAATTCACATTAGACAGGGCACTG U0 51 O_5_1_813_721 ATCTTCGATAATAGCAGCCTCAATTTCAGCGGTAGA U0 52 O_5_1_1671_688 GGTTTTCAAAGGCAATTTTTGAGCAATATGGGTTTC U0 53 O_5_1_912_527 GATGGAGAAAGCTGCCTATAACTTTATGGTAAGGAG U0 54 O_5_1_224_721 GAAGTACAAAATGTTTTCAGCATGTTCTTTCATAAC U0 55 O_5_1_847_99 AATAGATGCGCCATCTCCGAGAAAAAGTCTAGACAA U1 Current formats: bed, eland, axt, sam/bam (more added upon request)

35 Genetic Variation Discovery bioinformatics.ca The Love of Ambiguity IUPAC CodeMeaningComplementIUPAC CodeMeaningComplement AATSC/GS CCGYC/TR GGCKG/TM TTAVA/C/GB MA/CKHA/C/TD RA/GYDA/G/TH WA/TWBC/G/TV XnullXNA/C/G/TN

36 Genetic Variation Discovery bioinformatics.ca Aligners: Feature Set ELANDMAQNewblerSHRiMPSOAP Sequencing Platforms Illumina 454 SOLiD Helicos Capillary Illumina SOLiD 454Illumina SOLiD Helicos Illumina Alignment Algorithm Smith- Waterman Hash-based FlowMapper Smith- Waterman Hash-based Co-assembly Creation Gapped Alignments Paired-end Reads Platform Binaries Windows, Mac, Linux, Solaris, iPhone Mac, LinuxLinuxMac, Linux, Solaris Mac, Linux

37 Genetic Variation Discovery bioinformatics.ca Accuracy: Classification

38 Genetic Variation Discovery bioinformatics.ca Accuracy: Unique Read Alignment

39 Genetic Variation Discovery bioinformatics.ca Alignment Qualities mismatch BQs / total BQs information content (bits) actual alignment quality

40 Genetic Variation Discovery bioinformatics.ca SNP Discovery Gabor Marth

41 Genetic Variation Discovery bioinformatics.ca Genetic Variations: SNPs & INDELs

42 Genetic Variation Discovery bioinformatics.ca SNP Discovery: Goal sequencing errors SNP

43 Genetic Variation Discovery bioinformatics.ca SNP Discovery: Base Qualities High qualityLow quality

44 Genetic Variation Discovery bioinformatics.ca SNPs & Bayesian Statistics base quality# of individualsallele call in read

45 Genetic Variation Discovery bioinformatics.ca SNP Discovery AACGTTAGCATA strain 1 strain 2 strain 3 haploid individual 1 individual 3 individual 2 diploid AACGTTCGCATA AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA

46 Genetic Variation Discovery bioinformatics.ca Genotyping & Consensus Generation AACGTTAGCATA strain 1 [A] strain 2 [C] strain 3 [A] haploid individual 1 [A/C] individual 3 [A/A] individual 2 [C/C] diploid AACGTTCGCATA AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA

47 Genetic Variation Discovery bioinformatics.ca Handling Trios Take advantage of duplicate data De novo mutation rate

48 Genetic Variation Discovery bioinformatics.ca QC: Coverage Auton & Hernandez Cornell University June 2009

49 Genetic Variation Discovery bioinformatics.ca QC: Inter-SNP Distance

50 Genetic Variation Discovery bioinformatics.ca QC: Hardy-Weinberg Violations Auton & Hernandez Cornell University June 2009 HapMap sites in red, other sites in blue. CEU, P(seg)>0.5, coverage 2-5x

51 Genetic Variation Discovery bioinformatics.ca QC: Other metrics P(SNP) – Determining at the optimal P(SNP) threshold Transitions:transversions – Adjusting filters so that the ratio approaches 2

52 Genetic Variation Discovery bioinformatics.ca Visualisation Derek Barnett

53 Genetic Variation Discovery bioinformatics.ca Visualization: Consed

54 Genetic Variation Discovery bioinformatics.ca Visualization: Gambit Data validation Hypothesis generation Software development aid BAM support Firefox-like plugins

55 Genetic Variation Discovery bioinformatics.ca 1000 Genomes Project

56 Genetic Variation Discovery bioinformatics.ca 1000G: Goals Discover genetic variations – 1 % minor allele frequencies across genome – 0.1 – 0.5 % MAF across gene regions Variant alleles – Estimate frequencies – Identify haplotype background – Characterize linkage disequilibrium

57 Genetic Variation Discovery bioinformatics.ca 1000G: Pilot Projects Pilot 1 Low coverage 180 samples 70 samples @ 4X 110 samples @ 2X 2.7 Tbp total 202 Gbp 454 1.8 Tbp Illumina 640 Gbp AB SOLiD Pilot 1 Low coverage 180 samples 70 samples @ 4X 110 samples @ 2X 2.7 Tbp total 202 Gbp 454 1.8 Tbp Illumina 640 Gbp AB SOLiD Pilot 2 Deep trios (CEU & YRI) 6 samples 1.1 Tbp total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD Pilot 2 Deep trios (CEU & YRI) 6 samples 1.1 Tbp total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD Pilot 3 Exon capture 607 samples 2.2 Mbp of targets 8800 targets 10 – 20x coverage Pilot 3 Exon capture 607 samples 2.2 Mbp of targets 8800 targets 10 – 20x coverage

58 Genetic Variation Discovery bioinformatics.ca Pilot 2: Chr1 SNP Concordance

59 Genetic Variation Discovery bioinformatics.ca Pilot 2: Chr1 SNP Concordance

60 Genetic Variation Discovery bioinformatics.ca Pilot 2: INDEL Validation 1.0 = 100 % in one category 4.0 = 100 % in all categories

61 Genetic Variation Discovery bioinformatics.ca What have we learned? Garbage In, Garbage Out – SNP calls depend on the alignments – Alignments depend on the base calls – Base calls depend on accurate interpretation of machine readouts Choose the right tools Population genetics seems to be the ultimate quality control for SNP calls

62 Genetic Variation Discovery bioinformatics.ca The Usual Suspects L to R: Jiantao, Tony, Michele, Chip, Amit, Wen Fung, Deniz, Michael, Maddy, Gábor Derek


Download ppt "Canadian Bioinformatics Workshops www.bioinformatics.ca."

Similar presentations


Ads by Google