Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo.

Slides:



Advertisements
Similar presentations
Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis Yan Guo.
Advertisements

RNAseq.
Finding the Lost Treasure of NGS Data Yan Guo, PhD.
Genome-wide Association Study Focus on association between SNPs and traits Tendency – Larger and larger sample size – Use of more narrowly defined phenotypes(blood.
DNAseq analysis Bioinformatics Analysis Team
Ruibin Xi Peking University School of Mathematical Sciences
Bioinformatics at Molecular Epidemiology - new tools for identifying indels in sequencing data Kai Ye
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Data Analysis for High-Throughput Sequencing
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
High Throughput Sequencing
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Special Topics in Genomics Lecture 1: Introduction Instructor: Hongkai Ji Department of Biostatistics
NGS Workshop Variant Calling
Considerations for Analyzing Targeted NGS Data BRCA Tim Hague,CTO.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Genome & Exome Sequencing Read Mapping Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
NGS Workshop Variant Calling and Structural Variants from Exomes/WGS
NGS Cancer Systems Biology Workshop Variant Calling and Structural Variants from Exomes/WGS Ramesh Nair May 30, 2014.
Expression Analysis of RNA-seq Data
MES Genome Informatics I - Lecture VIII. Interpreting variants Sangwoo Kim, Ph.D. Assistant Professor, Severance Biomedical Research Institute,
Genetics-multistep tumorigenesis genomic integrity & cancer Sections from Weinberg’s ‘the biology of Cancer’ Cancer genetics and genomics Selected.
Experimental validation. Integration of transcriptome and genome sequencing uncovers functional variation in human populations Tuuli Lappalainen et al.
Next-Generation Sequencing
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
Computational methods for genomics-guided immunotherapy
Development and Application of SNP markers in Genome of shrimp (Fenneropenaeus chinensis) Jianyong Zhang Marine Biology.
Next-Generation Sequencing Eric Jorgenson Epidemiology 217 2/28/12.
The generalized transcription of the genome Víctor Gámez Visairas Genomics Course 2014/15.
Considerations for Analyzing Targeted NGS Data Exome Tim Hague, CTO.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
SCRIPPS GENOME ADVISER Galina Erikson Senior Bioinformatics Programmer The Scripps Translational Science Institute Scripps Translational Science Institute.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Copy Number Variation Eleanor Feingold University of Pittsburgh March 2012.
Cancer genomics Yao Fu March 4, Cancer is a genetic disease In the early 1970’s, Janet Rowley’s microscopy studies of leukemia cell chromosomes.
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
California Pacific Medical Center
Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.
No reference available
Short read alignment BNFO 601. Short read alignment Input: –Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
Current Data And Future Analysis Thomas Wieland, Thomas Schwarzmayr and Tim M Strom Helmholtz Zentrum München Institute of Human Genetics Geneva, 16/04/12.
Calling Somatic Mutations using VarScan
A brief guide to sequencing Dr Gavin Band Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for Health.
Introduction to Variant Analysis of Exome- and Amplicon sequencing data Lecture by: Date: Training: Extended version see: Dr. Christian Rausch 29 May 2015.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
1 Finding disease genes: A challenge for Medicine, Mathematics and Computer Science Andrew Collins, Professor of Genetic Epidemiology and Bioinformatics.
Canadian Bioinformatics Workshops
From Reads to Results Exome-seq analysis at CCBR
A comparison of somatic mutation callers in breast cancer samples and matched blood samples THOMAS BRETONNET BIOINFORMATICS AND COMPUTATIONAL BIOLOGY UNIT.
SNP and Genomic analysis SNP/genomic signature Clinical sampling Personalized chemotherapy Personalized Targeted therapy Personalized RNA therapy Personalized.
Canadian Bioinformatics Workshops
Data and Hartwig Medical Foundation
Genomon a high-integrity pipeline for cancer genome and transcriptome sequence analysis Kenichi Chiba(1), Yuichi Shiraishi(1), Ai Okada(1), Hiroko.
Cancer Genomics Core Lab
Disease risk prediction
Extract DNA and RNA from the same E. coli culture
Computational methods for genomics-guided immunotherapy
EMC Galaxy Course November 24-25, 2014
Gene expression estimation from RNA-Seq data
Fig. S3 Human genome consensus coding sequence
Annotation of Sequence Variants in Cancer Samples
Annotation of Sequence Variants in Cancer Samples
BF528 - Genomic Variation and SNP Analysis
BF528 - Whole Genome Sequencing and Genomic Variation
Figure Genetic characterization of the novel GYG1 gene mutation (A) GYG1_cDNA sequence and position of primers used. Genetic characterization of the novel.
Presentation transcript:

Vanderbilt Center for Quantitative Sciences Summer Institute Sequencing Analysis (DNA) Yan Guo

Alignment ATCGGGAATGCCGTTAACGGTTGGCGT Reference genome Human genome is about 3 billion base pair (3,000,000,000)in length. If read is 100 bp long, what is the probability of unique alignment? 1/(4x4x4…4) =1/4 100 =1/ E+60

Alignment Tools BWA Bowtie bio.sourceforge.net/index.shtmlhttp://bowtie- bio.sourceforge.net/index.shtml Doing accurate alignment for a 30 million reads will take 30 million x 3billion time units. Both are based on Borrows-Wheeler Algorithm

Alignment Results – Bam files SAM – uncompressed Bam – compressed specs/SAMv1.pdf specs/SAMv1.pdf Sort and index before performing analysis Don’t forget to perform QC on alignment

How to call SNPs

Local Realignment

Recalibration Why do we need realignment and recalibration for DNA but not RNA?

SNP calling GATK Varscan

VCF files

Annotation using ANNOVAR

Somatic Mutation Different from SNP (not germline) Both tumor and normal samples are needed to accurately define a somatic mutation Tumor sample is almost never 100% tumor

Somatic mutation callers MuTect utect utect Varscan

Quality Control on SNPs Number of Novel Non-synonymous SNP ~ 100 – 200 Transition / transversion ratio Heterozygous / non reference homozygous ratio Heterozygous consistency Strand Bias Cycle Bias

Ti/Tv ratio

Heterozygous / non reference homozygous ratio

Ti/Tv ratio by race and regions

Heterozygous / non reference homozygous ratio by race and regions

Heterozygous Genotype Consistency

Strand Bias Table 1. Strand bias examples from real data ChrPosdeptha1a1 b2b2 c3c3 d4d4 Forward Strand Genotype Reverse Strand Genotype HeterzygousHomozygous HeterzygousHomozygous HeterzygousHomozygous 1. Forward strand reference allele 2. Forward strand non reference allele 3. Reverse strand reference allele 4. Reverse strand non reference allele

Cycle Bias

Pooled Analysis Pool samples together without barcode Save money Can only be used to evaluate allele frequency

Pooled Analysis - Conclusion

Advanced Data Mining

The known and unknown of sequencing data

Known – Things we always know that Sequencing data can do SNV, mutation CNV Xie et al. BMC Bioinformatics 2009 Structural Variants Alkan et al. Nature Review Genetics, 2011

Known Unknown – Other information we found that sequencing data contain

How is additional data mining possible? Data mining is possible because capture techniques are not perfect.

Capture Efficiency of The Three Major Capture Kits

Potential Functions of Intron and Intergenic ENCODE suggested that over 80% human genome maybe functional. Majority of the GWAS SNPs are not in coding regions (706 exon, 3986 intron, 3323 intergenic)

Coverage of the Unintended Regions The coverage don’t just drop off suddenly after the capture region end. Capture region example: chr

Reads Aligned to Non Target Regions Can Be Used to Detect SNPs Tibetan exome study : Through exome sequencing of 50 Tibetan subjects, 2 intron SNPs were identified to be associated with high altitude. (Yi, et al. Science 2010) Non capture region study: Non capture region’s reads were studied to show they can infer reliable SNPs. (Guo, et al BMC Genomics)

Known unknown - Mitochondria However, mitochondria is only BP Assumptions: 40 mil reads 100BP long read

Dealing with nuMTs

Alignment Results

Extract mitochondria from exome sequencing Tools: Picardi et al. Nature Methods 2012 Guo et al. Bioinformatics, 2013 (MitoSeek) Diagnosis: Dinwiddie et al. Genmics 2013 Nemeth et al, Brain 2013

Virus Virus sequences can be captured through high throughput sequencing of human samples HBV in liver cancer samples (Sung, et al. Nature Genetics, 2012) (Jiang, et al. Genome Research, 2012) HPV in head and neck cancer (Chen, et al. Bioinformatics, 2012)

HPV AlignmentExample

Tools for Detecting Virus from Sequencing data PathSeq (Kostic, et al. Nature, 2011 Biotechnology) VirusSeq (Chen, et al. Bioinformatics, 2012) ViralFusionSeq (Li, et al. Bioinformatics, 2012) VirusFinder (Wang, et al. PlOS ONE, 2013)

The Data Mining Ideas applied to RNA RNAseq has been used a replacement of microarray. Other application of RNAseq include dection of alternative splicing, and fusion genes. Additional data mining opportunities also available for RNAseq data

SNV and Indel Difficulty due to high false positive rate RNAMapper (Miller, et al. Genome Research, 2013) SNVQ (Duitama, et al. (BMC Genomics, 2013) FX (Hong, et al. Bioinformatics, 2012) OSA (Hu, et al. Binformatics, 2012)

Microsatellite instability Examples: Yoon, et al. Genome Research 2013 Zheng, et al. BMC Genomics, 2013

RNA Editing and Allele-specific expression RNA editing tools and database DARNED, REDidb, dbRES, RADAR Allele-specific expression asSeq (Sun, et al. Biometrics, 2012) AlleleSeq (Rozowsky, et al. Molecular Systems Biology, 2011)

Exogenous RNA Virus (Same as DNA) Food RNA (you are what you eat) Wang, et al. PLOS ONE, 2012

nonCoding RNA

Unknown

Exome Samuels, et al. Trends in Genetics, 2013

RNAseq

Quality Control QualityQuantity Guo et al. Briefings in Bioinformatics, 2013