Download presentation
Presentation is loading. Please wait.
1
Canadian Bioinformatics Workshops www.bioinformatics.ca
3
Module 3 Genomic Variation Discovery
4
Genetic Variation Discovery bioinformatics.ca Topics Introduction Interpreting raw data Read alignment (MOSAIK) SNP Discovery (GigaBayes) Visualization (Gambit) 1000 Genomes Project
5
Genetic Variation Discovery bioinformatics.ca Genetic Variations: Why? Inherited diseases Ancestral history Phenotypic differences
6
Genetic Variation Discovery bioinformatics.ca Genetic Variations: SNPs & INDELs
7
Genetic Variation Discovery bioinformatics.ca Structural Variations Paul Medvedev review in prep July 2009
8
Genetic Variation Discovery bioinformatics.ca Epigenetic Variations: ChIPSeq Anjali Shah (AB) Nature Methods April 2009
9
Genetic Variation Discovery bioinformatics.ca Interpreting raw data
10
Genetic Variation Discovery bioinformatics.ca Basecalling: Intro how do we translate the machine readouts to base calls? how do we estimate and represent sequencing errors (base quality values)?
11
Genetic Variation Discovery bioinformatics.ca What is a base quality? Base QualityP error (obs. base) 350.12% 531.62% 1010.00% 153.16% 201.00% 250.32% 300.10% 350.03% 400.01%
12
Genetic Variation Discovery bioinformatics.ca Calculating Error Rates substitutioninsertion deletion atggat*agtataacgtcaggctaaactgtagtatatggataaaatgaccatacgattaca tggatgagtataa*gtcaag gtatatggataaaatcaccata paralogous alignment atggattagtataacgtcaggctaaactgtagtatatggataaaatgaccatacgattaca local misalignment tggattagtataacgtcagc tatatggctaaaatgaccata polymorphic test data read ref attagtactaccatgtagtac unaligned reads
13
Genetic Variation Discovery bioinformatics.ca Error Profile: Roche 454 error rate is low (< 0.5%) most errors are INDELs insertions deletions substitutions
14
Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina
15
Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina (36 bp)
16
Genetic Variation Discovery bioinformatics.ca Error Profile: Illumina Variability
17
Genetic Variation Discovery bioinformatics.ca Logistic Regression “Original”Recalibrated Mark DePristo Broad Institute June 2009
18
Genetic Variation Discovery bioinformatics.ca Read Alignment
19
Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly
20
Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly
21
Genetic Variation Discovery bioinformatics.ca Crash Course: Reference-guided Assembly
22
Genetic Variation Discovery bioinformatics.ca Sequencing Technologies future
23
Genetic Variation Discovery bioinformatics.ca
24
Genetic Variation Discovery bioinformatics.ca Pipeline Snapshot
25
Genetic Variation Discovery bioinformatics.ca How Does It Work?
26
Genetic Variation Discovery bioinformatics.ca How Does It Work?
27
Genetic Variation Discovery bioinformatics.ca
28
Genetic Variation Discovery bioinformatics.ca Functionality: Platform Specific
29
Genetic Variation Discovery bioinformatics.ca Enabling INDEL Discovery INDEL validation rate: 89.3 %(216) SNP validation rate: 97.8 %(229)
30
Genetic Variation Discovery bioinformatics.ca Combining Read Technologies
31
Genetic Variation Discovery bioinformatics.ca Paired-End Reads Jarvie & Harkins (454) Nature Methods May 2008
32
Genetic Variation Discovery bioinformatics.ca Resolving Paired-End Reads
33
Genetic Variation Discovery bioinformatics.ca MosaikCoverage
34
Genetic Variation Discovery bioinformatics.ca MosaikText 1 12 807910 807945 O_5_1_907_1935 1 36 + 0 ACCCTTGAAAAATGTTCGTTGACTCTAAATGAAATA 2 7 1019133 1019168 O_5_1_853_1522 1 36 - 0 ATCGAAAGCCCGCATCATTTTGATCTGCATCCTCAC 3 4 952257 952292 O_5_1_742_688 1 36 - 0 TGGATCTCTCTTGAATACGTACAATGATACTGTTAT 4 8 176976 177011 O_5_1_1892_1827 1 36 - 0 TGTGCGTTCTTTGCGGATATGGAAAATCTTGATATC 5 5 516470 516505 O_5_1_753_575 1 36 + 0 GAATGACACAATATCATTAGTGGTCCCTCAGTTATA 6 2 561582 561617 O_5_1_824_756 1 36 + 0 ACCTTACAACAGTGCTAAAGTAGTTACAGTAAACCA 42 O_5_1_1132_922 GAAATCTCATCTCAAGGAGAAGGAAACAGCAGATCC U0 43 O_5_1_499_472 GCAATTATTATAGCTTTGTCCGATTGTTCTCTCCCT U1 44 O_5_1_1161_922 GTTTATGATTTATCTGGTACAAGTCAGGCTGTTGTC U0 45 O_5_1_848_673 ACTAATTCATTCGTTTACGTCTCAAATGATTAATAA U0 46 O_5_1_887_756 AATATAACGGCCAGGTATATCATTGGATCTCCTTCA U0 47 O_5_1_987_943 GATATATACAGTGTTCTTGCCGACATAACGGCTTAG U0 48 O_5_1_1131_902 GTGATAAAAGAATGTAGGATTATTTATAAGTCTGTA U0 49 O_5_1_785_360 ATATGGATGAATATAAATACAAGGACAAAAAACGTG U0 50 O_5_1_908_742 AAATATATATCAGAATTCACATTAGACAGGGCACTG U0 51 O_5_1_813_721 ATCTTCGATAATAGCAGCCTCAATTTCAGCGGTAGA U0 52 O_5_1_1671_688 GGTTTTCAAAGGCAATTTTTGAGCAATATGGGTTTC U0 53 O_5_1_912_527 GATGGAGAAAGCTGCCTATAACTTTATGGTAAGGAG U0 54 O_5_1_224_721 GAAGTACAAAATGTTTTCAGCATGTTCTTTCATAAC U0 55 O_5_1_847_99 AATAGATGCGCCATCTCCGAGAAAAAGTCTAGACAA U1 Current formats: bed, eland, axt, sam/bam (more added upon request)
35
Genetic Variation Discovery bioinformatics.ca The Love of Ambiguity IUPAC CodeMeaningComplementIUPAC CodeMeaningComplement AATSC/GS CCGYC/TR GGCKG/TM TTAVA/C/GB MA/CKHA/C/TD RA/GYDA/G/TH WA/TWBC/G/TV XnullXNA/C/G/TN
36
Genetic Variation Discovery bioinformatics.ca Aligners: Feature Set ELANDMAQNewblerSHRiMPSOAP Sequencing Platforms Illumina 454 SOLiD Helicos Capillary Illumina SOLiD 454Illumina SOLiD Helicos Illumina Alignment Algorithm Smith- Waterman Hash-based FlowMapper Smith- Waterman Hash-based Co-assembly Creation Gapped Alignments Paired-end Reads Platform Binaries Windows, Mac, Linux, Solaris, iPhone Mac, LinuxLinuxMac, Linux, Solaris Mac, Linux
37
Genetic Variation Discovery bioinformatics.ca Accuracy: Classification
38
Genetic Variation Discovery bioinformatics.ca Accuracy: Unique Read Alignment
39
Genetic Variation Discovery bioinformatics.ca Alignment Qualities mismatch BQs / total BQs information content (bits) actual alignment quality
40
Genetic Variation Discovery bioinformatics.ca SNP Discovery Gabor Marth
41
Genetic Variation Discovery bioinformatics.ca Genetic Variations: SNPs & INDELs
42
Genetic Variation Discovery bioinformatics.ca SNP Discovery: Goal sequencing errors SNP
43
Genetic Variation Discovery bioinformatics.ca SNP Discovery: Base Qualities High qualityLow quality
44
Genetic Variation Discovery bioinformatics.ca SNPs & Bayesian Statistics base quality# of individualsallele call in read
45
Genetic Variation Discovery bioinformatics.ca SNP Discovery AACGTTAGCATA strain 1 strain 2 strain 3 haploid individual 1 individual 3 individual 2 diploid AACGTTCGCATA AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA
46
Genetic Variation Discovery bioinformatics.ca Genotyping & Consensus Generation AACGTTAGCATA strain 1 [A] strain 2 [C] strain 3 [A] haploid individual 1 [A/C] individual 3 [A/A] individual 2 [C/C] diploid AACGTTCGCATA AACGTTAGCATA AACGTTCGCATA AACGTTAGCATA
47
Genetic Variation Discovery bioinformatics.ca Handling Trios Take advantage of duplicate data De novo mutation rate
48
Genetic Variation Discovery bioinformatics.ca QC: Coverage Auton & Hernandez Cornell University June 2009
49
Genetic Variation Discovery bioinformatics.ca QC: Inter-SNP Distance
50
Genetic Variation Discovery bioinformatics.ca QC: Hardy-Weinberg Violations Auton & Hernandez Cornell University June 2009 HapMap sites in red, other sites in blue. CEU, P(seg)>0.5, coverage 2-5x
51
Genetic Variation Discovery bioinformatics.ca QC: Other metrics P(SNP) – Determining at the optimal P(SNP) threshold Transitions:transversions – Adjusting filters so that the ratio approaches 2
52
Genetic Variation Discovery bioinformatics.ca Visualisation Derek Barnett
53
Genetic Variation Discovery bioinformatics.ca Visualization: Consed
54
Genetic Variation Discovery bioinformatics.ca Visualization: Gambit Data validation Hypothesis generation Software development aid BAM support Firefox-like plugins
55
Genetic Variation Discovery bioinformatics.ca 1000 Genomes Project
56
Genetic Variation Discovery bioinformatics.ca 1000G: Goals Discover genetic variations – 1 % minor allele frequencies across genome – 0.1 – 0.5 % MAF across gene regions Variant alleles – Estimate frequencies – Identify haplotype background – Characterize linkage disequilibrium
57
Genetic Variation Discovery bioinformatics.ca 1000G: Pilot Projects Pilot 1 Low coverage 180 samples 70 samples @ 4X 110 samples @ 2X 2.7 Tbp total 202 Gbp 454 1.8 Tbp Illumina 640 Gbp AB SOLiD Pilot 1 Low coverage 180 samples 70 samples @ 4X 110 samples @ 2X 2.7 Tbp total 202 Gbp 454 1.8 Tbp Illumina 640 Gbp AB SOLiD Pilot 2 Deep trios (CEU & YRI) 6 samples 1.1 Tbp total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD Pilot 2 Deep trios (CEU & YRI) 6 samples 1.1 Tbp total 87 Gbp 454 773 Gbp Illumina 270 Gbp AB SOLiD Pilot 3 Exon capture 607 samples 2.2 Mbp of targets 8800 targets 10 – 20x coverage Pilot 3 Exon capture 607 samples 2.2 Mbp of targets 8800 targets 10 – 20x coverage
58
Genetic Variation Discovery bioinformatics.ca Pilot 2: Chr1 SNP Concordance
59
Genetic Variation Discovery bioinformatics.ca Pilot 2: Chr1 SNP Concordance
60
Genetic Variation Discovery bioinformatics.ca Pilot 2: INDEL Validation 1.0 = 100 % in one category 4.0 = 100 % in all categories
61
Genetic Variation Discovery bioinformatics.ca What have we learned? Garbage In, Garbage Out – SNP calls depend on the alignments – Alignments depend on the base calls – Base calls depend on accurate interpretation of machine readouts Choose the right tools Population genetics seems to be the ultimate quality control for SNP calls
62
Genetic Variation Discovery bioinformatics.ca The Usual Suspects L to R: Jiantao, Tony, Michele, Chip, Amit, Wen Fung, Deniz, Michael, Maddy, Gábor Derek
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.