Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.

Slides:



Advertisements
Similar presentations
Towards Personalized Genomics-Guided Cancer Immunotherapy Ion Mandoiu Department of Computer Science & Engineering Joint work with Sahar Al Seesi (CSE)
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
(A) Mutations within neoepitopes lead to structural alterations across the peptide backbone, as illustrated with structural snapshots from the simulations.
RNA-seq: the future of transcriptomics ……. ?
University of Connecticut
Presented by: Pham Kien Cuong NUS Graduate School for Integrative Sciences and Engineering.
Computational Challenges in Whole-Genome Association Studies Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Defense for the Degree of Doctorate in Philosophy Computer.
Ion Mandoiu Computer Science and Engineering Department
General methods of SNP discovery: PolyBayes Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
Bioinformatics Tools for Personalized Cancer Immunotherapy
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Algorithms for Genotype and Haplotype Inference from Low- Coverage Short Sequencing Reads Ion Mandoiu Computer Science and Engineering Department University.
Imputation-based local ancestry inference in admixed populations Ion Mandoiu Computer Science and Engineering Department University of Connecticut Joint.
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics Pipelines for RNA-Seq Data Analysis
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Proposal for the Degree of Doctorate in Philosophy Computer.
Polymorphism discovery informatics Gabor T. Marth Department of Biology Boston College Chestnut Hill, MA
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
High Throughput Sequencing
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Next generation sequencing Xusheng Wang 4/29/2010.
Li and Dewey BMC Bioinformatics 2011, 12:323
A cell and its population of genes :. DNA forms double strands by a process called hybridization:
Todd J. Treangen, Steven L. Salzberg
Transcriptome analysis With a reference – Challenging due to size and complexity of datasets – Many tools available, driven by biomedical research – GATK.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
©Edited by Mingrui Zhang, CS Department, Winona State University, 2008 Identifying Lung Cancer Risks.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
Identification of Cancer-Specific Motifs in
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
© 2010 by The Samuel Roberts Noble Foundation, Inc. 1 The Samuel Roberts Noble Foundation, 2510 Sam Noble Parkway, Ardmore, OK, 73401, USA 2 National Center.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
RNA-Seq Primer Understanding the RNA-Seq evidence tracks on the GEP UCSC Genome Browser Wilson Leung08/2014.
Introduction to RNAseq
Geuvadis Analysis Meeting 16/02/2012 Micha Sammeth CNAG – Barcelona.
GigAssembler. Genome Assembly: A big picture
Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.
Imputation-based local ancestry inference in admixed populations
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
P.M. VanRaden and D.M. Bickhart Animal Genomics and Improvement Laboratory, Agricultural Research Service, USDA, Beltsville, MD, USA
No reference available
Lecture 12 RNA – seq analysis.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
Canadian Bioinformatics Workshops
071126_EAS56_0057_FC – lanes 1-8 read 2 b a _EAS56_0057_FC – lanes 1-8 read 1 Table S1. Summary tables for a read 1 and b read 2 of a.
Cancer Vaccine Design Ion Mandoiu
Computational methods for genomics-guided immunotherapy
Imputation-based local ancestry inference in admixed populations
Sahar Al Seesi University of Connecticut CANGS 2017
Discovery tools for human genetic variations
Genome organization and Bioinformatics
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Presentation transcript:

Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion Mandoiu 1 1 University of Connecticut. Department of Computer Sciences & Engineering 2 University of Connecticut Health Center

Introduction RNA-Seq is the method of choice for studying functional effects of genetic variability RNA-Seq poses new computational challenges compared to genome sequencing In this paper we present: – a strategy to map transcriptome reads using both the genome reference sequence and the CCDS database. – a novel Bayesian model for SNV discovery and genotyping based on quality scores

Read Mapping Reference genome sequence >ref|NT_ |Mm19_82865_37: Mus musculus chromosome 19 genomic contig, strain C57BL/6J GATCATACTCCTCATGCTGGACATTCTGGTTCCTA GTATATCTGGAGAGTTAAGATGGGGAATTATGTCA ACTTTCCCTCTTCCTATGCCAGTTATGCATAATGCA CAAATATTTCCACGCTTTTTCACTACAGATAAAG AACTGGGACTTGCTTATTTACCTTTAGATGAACAG ATTCAGGCTCTGCAAGAAAATAGAATTTTCTTCAT ACAGGGAAGCCTGTGCTTTGTACTAATTTCTTCATT GGGATGTCAGGATTCACAATGACAGTGCTGGATGAG +HWI-EAS299_2:2:1:1536:631 ATTACACCACCTTCAGCCCAGGTGGTTGGAGTACTC +HWI-EAS299_2:2:1:771:94 :::::::::::::::::::::::::::2:: Read sequences & quality scores SNP calling G T C A T A T A A C T C 7 1 SNP Calling from Genomic DNA Reads

Mapping mRNA Reads

C. Trapnell, L. Pachter, and S.L. Salzberg. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics, 25(9):1105–1111, 2009.

Mapping and Merging Strategy Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads Mapped reads

Read Merging GenomeCCDSAgree?Hard MergeSoft Merge Unique YesKeep Unique NoThrow UniqueMultipleNoThrowKeep UniqueNot MappedNoKeep MultipleUniqueNoThrowKeep Multiple NoThrow MultipleNot MappedNoThrow Not mappedUniqueNoKeep Not mappedMultipleNoThrow Not mappedNot MappedYesThrow

SNV Detection and Genotyping AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Locus i RiRi r(i) : Base call of read r at locus i ε r(i) : Probability of error reading base call r(i) G i : Genotype at locus i

SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

Current Models Maq: – Keep just the alleles with the two largest counts – Pr (R i | G i =H i H i ) is the probability of observing k alleles r(i) different than H i – Pr (R i | G i =H i H’ i ) is approximated as a binomial with p=0.5 SOAPsnp – Pr (r i | G i =H i H’ i ) is the average of Pr(r i |H i ) and Pr(r i |G i =H’ i ) – A rank test on the quality scores of the allele calls is used to confirm heterozygocity

SNV Detection and Genotyping Calculate conditional probabilities by multiplying contributions of individual reads

Accuracy Assessment of Variants Detection 113 million Illumina mRNA reads generated from blood cell tissue of Hapmap individual NA12878 (NCBI SRA database accession numbers SRX and SRX000566) – We tested genotype calling using as gold standard 3.4 million SNPs with known genotypes for NA12878 available in the database of the Hapmap project – True positive: called variant for which Hapmap genotype coincides – False positive: called variant for which Hapmap genotype does not coincide

Comparison of Mapping Strategies

Comparison of Variant Calling Strategies

Data Filtering

Allow just x reads per start locus to eliminate PCR amplification artifacts Chepelev et. al. algorithm: – For each locus groups starting reads with 0, 1 and 2 mismatches – Choose at random one read of each group

Comparison of Data Filtering Strategies

Accuracy per RPKM bins

Conclusions We presented a new strategy to map mRNA reads using both the reference genome and the CCDS database and a new bayesian model for SNV detection and genotyping Experiments on publicly available datasets show that our methods outperform widely used SNV detection methods Future Work: – Improve genotype calling by adapting our model to differential allelic expression – Use our methods on RNA-Seq data from cancer tumor data

Acknowledgments Brent Graveley and Duan Fei (UCHC) NSF awards IIS , IIS , and DBI UCONN Research Foundation UCIG grant