Download presentation
Presentation is loading. Please wait.
Published byDina Wilcox Modified over 9 years ago
1
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo
3
Alignment ATCGGGAATGCCGTTAACGGTTGGCGT Reference genome Human genome is about 3 billion base pair (3,000,000,000)in length. If read is 100 bp long, what is the probability of unique alignment? 1/(4x4x4…4) =1/4 100 =1/1.60694E+60
4
Alignment Tools BWA http://bio-bwa.sourceforge.net/http://bio-bwa.sourceforge.net/ Bowtie http://bowtie- bio.sourceforge.net/index.shtmlhttp://bowtie- bio.sourceforge.net/index.shtml Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units. Both are based on Borrows-Wheeler Algorithm
5
Alignment Results – Bam files SAM – uncompressed Bam – compressed http://samtools.github.io/hts- specs/SAMv1.pdf http://samtools.github.io/hts- specs/SAMv1.pdf Sort and index before performing analysis Don’t forget to perform QC on alignment
6
How to call SNPs http://www.broadinstitute.org/igv/
7
Local Realignment
8
Recalibration Why do we need realignment and recalibration for DNA but not RNA?
9
SNP calling GATK https://www.broadinstitute.org/gatk/https://www.broadinstitute.org/gatk/ Varscan http://varscan.sourceforge.net/http://varscan.sourceforge.net/
10
VCF files
11
Annotation using ANNOVAR http://www.openbioinformatics.org/annovar/
12
Somatic Mutation Different from SNP (not germline) Both tumor and normal samples are needed to accurately define a somatic mutation Tumor sample is almost never 100% tumor
13
Somatic mutation callers MuTect http://www.broadinstitute.org/cancer/cga/m utect http://www.broadinstitute.org/cancer/cga/m utect Varscan http://varscan.sourceforge.net/http://varscan.sourceforge.net/
14
Quality Control on SNPs Number of Novel Non-synonymous SNP ~ 100 – 200 Transition / transversion ratio Heterozygous / non reference homozygous ratio Heterozygous consistency Strand Bias Cycle Bias
15
Ti/Tv ratio
16
Heterozygous / non reference homozygous ratio
17
Ti/Tv ratio by race and regions
18
Heterozygous / non reference homozygous ratio by race and regions
19
Heterozygous Genotype Consistency
20
Strand Bias Table 1. Strand bias examples from real data ChrPosdeptha1a1 b2b2 c3c3 d4d4 Forward Strand Genotype Reverse Strand Genotype 6329750142155101 HeterzygousHomozygous 18196796238201170 HeterzygousHomozygous 12102156543115970 HeterzygousHomozygous 1. Forward strand reference allele 2. Forward strand non reference allele 3. Reverse strand reference allele 4. Reverse strand non reference allele
21
Cycle Bias
22
Pooled Analysis Pool samples together without barcode Save money Can only be used to evaluate allele frequency
23
Pooled Analysis - Conclusion
24
Advanced Data Mining
25
The known and unknown of sequencing data
28
Known – Things we always know that Sequencing data can do SNV, mutation CNV Xie et al. BMC Bioinformatics 2009 Structural Variants Alkan et al. Nature Review Genetics, 2011
29
Known Unknown – Other information we found that sequencing data contain
30
How is additional data mining possible? Data mining is possible because capture techniques are not perfect.
31
Capture Efficiency of The Three Major Capture Kits
32
Potential Functions of Intron and Intergenic ENCODE suggested that over 80% human genome maybe functional. Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)
33
Coverage of the Unintended Regions The coverage don’t just drop off suddenly after the capture region end. Capture region example: chr11000 1500 10001500 10001500
34
Reads Aligned to Non Target Regions Can Be Used to Detect SNPs Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010) Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)
35
Known unknown - Mitochondria However, mitochondria is only 16569 BP Assumptions: 40 mil reads 100BP long read
36
Dealing with nuMTs
37
Alignment Results
38
Extract mitochondria from exome sequencing Tools: Picardi et al. Nature Methods 2012 Guo et al. Bioinformatics, 2013 (MitoSeek) Diagnosis: Dinwiddie et al. Genmics 2013 Nemeth et al, Brain 2013
39
Virus Virus sequences can be captured through high throughput sequencing of human samples HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012) HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)
40
HPV AlignmentExample
41
Tools for Detecting Virus from Sequencing data PathSeq (Kostic, et al. Nature, 2011 Biotechnology) VirusSeq (Chen, et al. Bioinformatics, 2012) ViralFusionSeq (Li, et al. Bioinformatics, 2012) VirusFinder (Wang, et al. PlOS ONE, 2013)
42
The Data Mining Ideas applied to RNA RNAseq has been used a replacement of microarray. Other application of RNAseq include dection of alternative splicing, and fusion genes. Additional data mining opportunities also available for RNAseq data
43
SNV and Indel Difficulty due to high false positive rate RNAMapper (Miller, et al. Genome Research, 2013) SNVQ (Duitama, et al. (BMC Genomics, 2013) FX (Hong, et al. Bioinformatics, 2012) OSA (Hu, et al. Binformatics, 2012)
44
Microsatellite instability Examples: Yoon, et al. Genome Research 2013 Zheng, et al. BMC Genomics, 2013
45
RNA Editing and Allele-specific expression RNA editing tools and database DARNED, REDidb, dbRES, RADAR Allele-specific expression asSeq (Sun, et al. Biometrics, 2012) AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)
46
Exogenous RNA Virus (Same as DNA) Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012
47
nonCoding RNA
48
Unknown
49
Exome Samuels, et al. Trends in Genetics, 2013
50
RNAseq
51
Quality Control QualityQuantity Guo et al. Briefings in Bioinformatics, 2013
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.