Download presentation
Presentation is loading. Please wait.
Published byClara Rolston Modified over 9 years ago
1
Finding the Lost Treasure of NGS Data Yan Guo, PhD
3
VANGARD https://medschool.vanderbilt.edu/cqs/vangard
5
Modules Overview for DNA-sequence Exome / whole-Genome Bam files bwa alignment FastQC bamQC fastq files structural variant analysis GATK refinement SNP/INDEL vcf files somatic mutation gene-level analysis gene associates Translocation, inversion, copy number variants gene coding changes realignment recalibration mark-duplication best practice filter dbsnp / indel resources
6
RNAseq Bam files tophat alignment FastQC SeQC fastq files cufflinks annotations cuffdiff comparisons Refinement cuffmerge gene-fusion analysis functional/ pathway cufflinks annotations cuffdiff comparisons genes identifying novel genes discovery cluster Gene List gene quantification
7
DNAseq SNPs Somatic Mutations Small Indels Large Structural Change CNV RNAseq Gene expression difference Splicing Variants Fusion Genes What do you expect to find in NGS data?
8
What you don’t expect to find in NGS data? Is targeted? Exome sequencing reads Mapped reads Targeted DNA Unmapped DNA reads Untargeted DNA Virus/Microbe DNA Contamination Intronic DNA Intergenic DNA Mitochondrial DNA Is mapped? No Yes
9
Exome Capture
11
Why do we care about intron and intergenic regions some introns can encode specific proteins and can be processed after splicing to form noncoding RNA molecules. (Rearick, Prakash et al. 2011)Rearick, Prakash et al. 2011 Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic) The ENCODE Project: ENCyclopedia Of DNA Elements
12
GWAS catalog SNPs Kit Target total bases Missing Exon SNPs Missing intron SNPs Missing Intergenic SNPs SureSelect(v2)3762774738739463323 TrueSeq6208528620639803320 SeqCap EZ (v3.0)6419074732638803317
17
Samples Average depth IntronicSplicing 1 ncRNA 2 Intergen ic Exonic Non- synonymous Stopgai n Stoploss Agilent (N=22) ≥ 221741489129914801431386 ≥ 57362395794442691142295 ≥ 10476637439328673892194 1000G (N=6) ≥ 24561196484658491101 ≥ 5278412360281533761 ≥ 1014199194162423351 Illumina (N=6) ≥ 26114098596592500 ≥ 5240805015344000 ≥ 10105803273498000 1. Variant is within 2-bp of a splicing junction 2. Variant overlaps a transcript without coding annotation in the gene definition
19
Mitochondria Mitochondria play an important role in cellular energy metabolism, free radical generation, and apoptosis (Andrews, Kubacka et al. 1999; Verma and Kumar 2007).Andrews, Kubacka et al. 1999Verma and Kumar 2007 Mitochondrial DNA (mtDNA) is a maternally-inherited 16,569-bp closed-circle genome that encodes two rRNAs, 22 tRNAs, and 10 polypeptides. Dysfunctions in mitochondrial function are an important cause of many neurological diseases (Fernandez-Vizarra, Bugiani et al. 2007) and drug toxicities (Lemasters, Qian et al. 1999; Wallace and Starkov 2000) and may contribute to carcinogenesis and tumor progression (Modica-Napolitano and Singh 2004; Chen 2012).Fernandez-Vizarra, Bugiani et al. 2007Lemasters, Qian et al. 1999Wallace and Starkov 2000Modica-Napolitano and Singh 2004Chen 2012
21
Mitochondria Extraction Strategy
22
Results
24
Virus Known oncogenic viruses are estimated to cause 15 to 20 percent of all cancers in humans (Parkin 2006).Parkin 2006 Understanding the viral integration pattern of cancer- associated viruses may uncover novel oncogenes and tumor suppressors that are associated with cellular transformation. Viral genomes have been detected using off-target exome sequencing reads (Barzon, Lavezzo et al. 2011; Li and Delwart 2011; Chevaliez, Rodriguez et al. 2012; Radford, Chapman et al. 2012; Capobianchi, Giombini et al. 2013).Barzon, Lavezzo et al. 2011 Li and Delwart 2011Chevaliez, Rodriguez et al. 2012 Radford, Chapman et al. 2012Capobianchi, Giombini et al. 2013
25
One example using HNSCC
26
Virus Detection in HNSCC in TCGA Siteclin_hpv_ishclin_hpv_p16ExomeSeqlow_passRNAseqHPV Buccal Mucosa000000 000000 000000 000000 000000 000000 000000 000000 Oropharynx001001 000000 000000 Tonsil111014 111014 111014 001114 001114 011013 101013 000113 001012 001012 001012 001012 001012 001012 001001
27
Existing Tools PathSeq (Kostic, Ojesina et al. 2011)Kostic, Ojesina et al. 2011 VirusSeq (Chen, Yao et al. 2012)Chen, Yao et al. 2012 ViralFusionSeq (Li, Wan et al. 2013)Li, Wan et al. 2013
28
SNP and Somatic Mutation Identification using RNAseq Data Traditionally, somatic mutations are detected using Sanger sequencing or RT-PCR by comparing paired tumor and normal samples. One obvious limitation of such methods is that we have to limit our search to a certain genomic region of interest. With the maturity of next generation sequencing, we can now screen all coding genes or even the whole genome for somatic mutations at a reasonable cost.
29
Why do we want to detect mutation in RNAseq data? You don’t have DNA sequencing data Detecting mutation was not the original goal, but why not There are much more RNAseq data than DNAseq data A mutation in RNA is more relevant than a mutation in DNA
30
Difficulties Not enough depth in the non-expressed genes to detect mutation Reverse transcribe RNA to cDNA introduce more error Hard to distinguish mutation from RNA editing In summary, somatic mutation detection using RNAseq data contains much more false positives.
31
Somatic Mutation Caller Designed Specifically for RNAseq Data
32
Other Ways you can mine your data
33
Summary Get your priority right, never design a study just for secondary analysis targets If you have old data, think about else you can do with it, try to maximize the full potential of your data At VANGARD, we help you with your basic genomic data analysis needs Advanced data analysis can be done through collaboration.
34
Acknowledgement Yu Shyr Tiger Sheng Chung-I Li Jiang Li Mike Guo David Samuels Chun Li
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.