Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.

Slides:



Advertisements
Similar presentations
Towards Personalized Genomics-Guided Cancer Immunotherapy Ion Mandoiu Department of Computer Science & Engineering Joint work with Sahar Al Seesi (CSE)
Advertisements

Marius Nicolae Computer Science and Engineering Department
RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
 Experimental Setup  Whole brain RNA-Seq Data from Sanger Institute Mouse Genomes Project [Keane et al. 2011]  Synthetic hybrids with different levels.
RNAseq.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Greg Phillips Veterinary Microbiology
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Bioinformatics pipeline for detection of immunogenic cancer mutations by high throughput mRNA sequencing Jorge Duitama 1, Ion Mandoiu 1, and Pramod Srivastava.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Defense for the Degree of Doctorate in Philosophy Computer.
Ion Mandoiu Computer Science and Engineering Department
Linkage Disequilibrium-Based Single Individual Genotyping from Low-Coverage Short Sequencing Reads Justin Kennedy 1 Joint work with Sanjiv Dinakar 1, Yozen.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
The Extraction of Single Nucleotide Polymorphisms and the Use of Current Sequencing Tools Stephen Tetreault Department of Mathematics and Computer Science.
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Bioinformatics Pipeline for Fosmid based Molecular Haplotype Sequencing Jorge Duitama1,2, Thomas Huebsch1, Gayle McEwen1, Sabrina Schulz1, Eun-Kyung Suk1,
Bioinformatics Tools for Personalized Cancer Immunotherapy
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Bioinformatics Methods for Diagnosis and Treatment of Human Diseases Jorge Duitama Dissertation Proposal for the Degree of Doctorate in Philosophy Computer.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.
Towards accurate detection and genotyping of expressed variants from whole transcriptome sequencing data Jorge Duitama 1, Pramod Srivastava 2, and Ion.
High Throughput Sequencing
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
Whole Exome Sequencing for Variant Discovery and Prioritisation
Li and Dewey BMC Bioinformatics 2011, 12:323
Todd J. Treangen, Steven L. Salzberg
Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Next Generation Sequencing and its data analysis challenges Background Alignment and Assembly Applications Genome Epigenome Transcriptome.
Computational methods for genomics-guided immunotherapy
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
RNA-Seq Assembly 转录组拼接 唐海宝 基因组与生物技术研究中心 2013 年 11 月 23 日.
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Sahar Al Seesi and Ion Măndoiu Computer Science and Engineering
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Computational methods for genomics-guided immunotherapy Sahar Al Seesi Computer Science & Engineering Department, UCONN Immunology Department, UCONN Health.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
TOX680 Unveiling the Transcriptome using RNA-seq Jinze Liu.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Adrian Caciula (GSU), Serghei Mangul (UCLA) James Lindsay, Ion.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Canadian Bioinformatics Workshops
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Computational methods for genomics-guided immunotherapy
Alexander Zelikovsky Computer Science Department
Gene expression estimation from RNA-Seq data
Sahar Al Seesi University of Connecticut CANGS 2017
Discovery tools for human genetic variations
Pairing T-cell Receptor Sequences using Pooling and Min-cost Flows
Dec. 22, 2011 live call UCONN: Ion Mandoiu, Sahar Al Seesi
Presentation transcript:

Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut

Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work

Advances in High-Throughput Sequencing (HTS) Roche/454 FLX Titanium million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run bp read length SOLiD billion PE reads/run 35-50bp read length

Illumina Workflow – Library Preparation Genomic DNA mRNA

Illumina Workflow – Cluster Generation

Illumina Workflow – Sequencing by Synthesis

Cost of Whole Genome Sequencing C.Venter J. Watson NA18507

HTS is a transformative technology Numerous applications besides de novo genome sequencing: – RNA-Seq – Non-coding RNAs – ChIP-Seq – Epigenetics – Structural variation – Metagenomics – Paleogenomics – … HTS applications

Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work

Genomics-Guided Cancer Immunotherapy CTCAATTGATGAAATTGTTCTGAAACT GCAGAGATAGCTAAAGGATACCGGGTT CCGGTATCCTTTAGCTATCTCTGCCTC CTGACACCATCTGTGTGGGCTACCATG … AGGCAAGCTCATGGCCAAATCATGAGA Tumor mRNA Sequencing SYFPEITHI ISETDLSLL CALRRNESL … Tumor Specific Epitopes Peptide Synthesis Immune System Stimulation Mouse Image Source: Tumor Remission

Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing

Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing

Mapping mRNA Reads

Read Merging GenomeCCDSAgree?Hard MergeSoft Merge Unique YesKeep Unique NoThrow UniqueMultipleNoThrowKeep UniqueNot MappedNoKeep MultipleUniqueNoThrowKeep Multiple NoThrow MultipleNot MappedNoThrow Not mappedUniqueNoKeep Not mappedMultipleNoThrow Not mappedNot MappedYesThrow

SNV Detection and Genotyping AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Locus i RiRi r(i) : Base call of read r at locus i ε r(i) : Probability of error reading base call r(i) G i : Genotype at locus i

SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one

SNV Detection and Genotyping Calculate conditional probabilities by multiplying contributions of individual reads

Data Filtering

Accuracy per RPKM bins

Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing

Haplotyping Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent. ACGTTACATTGCCACTCAATC--TGGA ACGTCACATTG-CACTCGATCGCTGGA Heterozygous variants

Haplotyping LocusEventAlleles 1SNVC,T 2DeletionC,- 3SNVA,G 4Insertion-,GC LocusEventAlleles Hap 1Alleles Hap 2 1SNVTC 2DeletionC- 3SNVAG 4Insertion-GC

RefHap Algorithm Reduce the problem to Max-Cut. Solve Max-Cut Build haplotypes according with the cut Locus12345 f1f f2f f3f f4f h h

Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing

Immunology Background J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3: , 2003

Epitope Prediction C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239: , 2004

Results on Tumor Data

Experimental Validation Mutations reported by [Noguchi et al 94] found by the pipeline Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5 Immunogenic potential under experimental validation in the Srivastava lab at UCHC

Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work

RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE)

Alternative Splicing [Griffith and Marra 07]

Challenges to Accurate Estimation of Gene Expression Levels Read ambiguity (multireads) What is the gene length? ABCDE

Previous approaches to GE Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform  Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]

Read Ambiguity in IE ABCDE AC

Previous approaches to IE [Jiang&Wong 09] – Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution

Our contribution Unified probabilistic model and Expectation- Maximization Algorithm for IE considering – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores

Read-Isoform Compatibility

Fragment length distribution Paired reads ABC AC ABC ACAC ABC i j F a (i) F a (j)

Fragment length distribution Single reads ABC AC ABC AC ABC AC i j F a (i) F a (j)

IsoEM algorithm E-step M-step

Error Fraction Curves - Isoforms 30M single reads of length 25 (simulated)

Error Fraction Curves - Genes 30M single reads of length 25 (simulated)

Validation on MAQC Samples

Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work

Viral Quasispecies RNA viruses (HIV, HCV) Many replication mistakes Quasispecies Quasispecies (qsps) = co-existing closely related variants Variants differ in virulence ability to escape the immune system resistance to antiviral therapies tissue tropism How do qsps contribute to viral persistence and evolution?

454 Pyrosequencing Pyrosequencing Pyrosequencing =Sequencing by Synthesis. GS FLX Titanium GS FLX Titanium : reads  Fragments (reads): bp  Sequence of the reads single  System software assembles reads into a single genome  We need a software that assembles reads into multiple genomes!

Quasispecies Spectrum Reconstruction (QSR) Problem Given pyrosequencing reads from a quasispecies population of unknown size and distribution Reconstructspectrum Reconstruct the quasispecies spectrum sequences frequencies

ViSpA Viral Spectrum Assembler

454 Sequencing Errors Error rate ~0.1%. Fixed number of incorporated bases vs. light intensity value. Incorrect resolution of homopolymers => over-calls (insertions) 65-75% of errors under-calls (deletions) 20-30% of errors

Preprocessing of Aligned Reads D 1.Deletions in reads: D all N. Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. I 2.Insertions into reference: I Remove insertions, confirmed by a single read. 3.Imputation of missing values N

Read Graph: Vertices Subread with n mismatches Subread = completely contained in some read with ≤ n mismatches. Superread Superread = not a subread => the vertex in the read graph. ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT

Read Graph: Edges Edge b/w two vertices exists if there is an overlap between superreads with m mismatches. they agree on their overlap with ≤ m mismatches. Auxiliary vertices: source and sink

Read Graph: Edge Cost The most probable source-sink path through each vertex Cost: uncertainty that two superreads are from the same qsps. OverhangΔ Overhang Δ is the shift in start positions of two overlapping superreads. Δ

Contig Assembling Max Bandwidth Path Max Bandwidth Path through vertex path minimizing maximum edge cost for the path and each subpath Consensus of path’s superreads N Each position: >70%-majority or N Weighted consensus obtained on all reads Remove duplicates Duplicated sequences = statistical evidence rl read r of length l sL qsps s of length L k k is #mismatches, t/L t/L is a mutation rate

Expectation Maximization Bipartite graph:  Q q f q  Q q is a candidate with frequency f q  R r o r  R r is a read with observed frequency o r h q,r r q j  Weight h q,r = probability that read r is produced by qsps q with j mismatches E step:M step:

HCV Qsps (P. Balfe) reads from 5.2Kb-long region of HCV-1a genomes intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%] reads average length=292bp Indels: ~77% of reads Insertions length: 1 (86%), 3 (9.8%) Deletions length: 1 (98%) N N: ~7% of reads

HCV Data Statistics

NJ Tree for 12 Most Frequent Qsps (No Insertions) The top sequence: 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads. In sum: 35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads. Reconstructed sequence with highest frequency 99% identical to one of the ORFs obtained by cloning the quasispecies.

Conclusions & Future Work Freely available implementations of these methods available at Ongoing work – Monitoring immune responses by TCR sequencing – Isoform discovery – Computational deconvolution of heterogeneous samples – Reconstruction & frequency estimation of virus quasispecies from Ion Torrent reads

Acknowledgments Immunogenomics Jorge Duitama (KU Leuven) Pramod K. Srivastava, Adam Adler, Brent Graveley, Duan Fei (UCHC) Matt Alessandri and Kelly Gonzalez (Ambry Genetics) IsoEM Marius Nicolae (Uconn) Alex Zelikovsky, Serghei Mangul (GSU) ViSpA Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei Mangul, (GSU), and Kelly Westbrooks (Life Technologies) Peter Balfe (Birmingham University, UK) Funding NSF awards IIS , IIS , and DBI UCONN Research Foundation UCIG grant