Download presentation
Presentation is loading. Please wait.
1
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut
2
Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work
3
http://www.economist.com/node/16349358 Advances in High-Throughput Sequencing (HTS) Roche/454 FLX Titanium 400-600 million reads/run 400bp avg. length Illumina HiSeq 2000 Up to 6 billion PE reads/run 35-100bp read length SOLiD 4 1.4-2.4 billion PE reads/run 35-50bp read length
4
Illumina Workflow – Library Preparation Genomic DNA mRNA
5
Illumina Workflow – Cluster Generation
6
Illumina Workflow – Sequencing by Synthesis
7
Cost of Whole Genome Sequencing C.Venter Sanger@7.5x J. Watson 454@7.4x NA18507 Illumina@36x SOLiD@12x
8
HTS is a transformative technology Numerous applications besides de novo genome sequencing: – RNA-Seq – Non-coding RNAs – ChIP-Seq – Epigenetics – Structural variation – Metagenomics – Paleogenomics – … HTS applications
9
Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work
10
Genomics-Guided Cancer Immunotherapy CTCAATTGATGAAATTGTTCTGAAACT GCAGAGATAGCTAAAGGATACCGGGTT CCGGTATCCTTTAGCTATCTCTGCCTC CTGACACCATCTGTGTGGGCTACCATG … AGGCAAGCTCATGGCCAAATCATGAGA Tumor mRNA Sequencing SYFPEITHI ISETDLSLL CALRRNESL … Tumor Specific Epitopes Peptide Synthesis Immune System Stimulation Mouse Image Source: http://www.clker.com/clipart-simple-cartoon-mouse-2.html Tumor Remission
11
Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing
12
Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing
13
Mapping mRNA Reads http://en.wikipedia.org/wiki/File:RNA-Seq-alignment.png
14
Read Merging GenomeCCDSAgree?Hard MergeSoft Merge Unique YesKeep Unique NoThrow UniqueMultipleNoThrowKeep UniqueNot MappedNoKeep MultipleUniqueNoThrowKeep Multiple NoThrow MultipleNot MappedNoThrow Not mappedUniqueNoKeep Not mappedMultipleNoThrow Not mappedNot MappedYesThrow
15
SNV Detection and Genotyping AACGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC AACGCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAG CGCGGCCAGCCGGCTTCTGTCGGCCAGCAGCCCGGA GCGGCCAGCCGGCTTCTGTCGGCCAGCCGGCAGGGA GCCAGCCGGCTTCTGTCGGCCAGCAGCCAGGAATCT GCCGGCTTCTGTCGGCCAGCAGCCAGGAATCTGGAA CTTCTGTCGGCCAGCCGGCAGGAATCTGGAAACAAT CGGCCAGCAGCCAGGAATCTGGAAACAATGGCTACA CCAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG CAAGCAGCCAGGAATCTGGAAACAATGGCTACAGCG GCAGCCAGGAATCTGGAAACAATGGCTACAGCGTGC Reference Locus i RiRi r(i) : Base call of read r at locus i ε r(i) : Probability of error reading base call r(i) G i : Genotype at locus i
16
SNV Detection and Genotyping Use Bayes rule to calculate posterior probabilities and pick the genotype with the largest one
17
SNV Detection and Genotyping Calculate conditional probabilities by multiplying contributions of individual reads
18
Data Filtering
19
Accuracy per RPKM bins
20
Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing
21
Haplotyping Human somatic cells are diploid, containing two sets of nearly identical chromosomes, one set derived from each parent. ACGTTACATTGCCACTCAATC--TGGA ACGTCACATTG-CACTCGATCGCTGGA Heterozygous variants
22
Haplotyping LocusEventAlleles 1SNVC,T 2DeletionC,- 3SNVA,G 4Insertion-,GC LocusEventAlleles Hap 1Alleles Hap 2 1SNVTC 2DeletionC- 3SNVAG 4Insertion-GC
23
RefHap Algorithm Reduce the problem to Max-Cut. Solve Max-Cut Build haplotypes according with the cut Locus12345 f1f1 -0110 f2f2 110-1 f3f3 1--0- f4f4 -00-1 3 1 1 1 4 2 3 h 1 00110 h 2 11001
24
Bioinformatics Pipeline Tumor mRNA reads CCDS Mapping Genome Mapping Read Merging CCDS mapped reads Genome mapped reads SNVs Detection Mapped reads Epitope Prediction Tumor specific epitopes Haplotyping Tumor- specific SNVs Close SNV Haplotypes Primers Design Primers for Sanger Sequencing
25
Immunology Background J.W. Yedell, E Reits and J Neefjes. Making sense of mass destruction: quantitating MHC class I antigen presentation. Nature Reviews Immunology, 3:952-961, 2003
26
Epitope Prediction C. Lundegaard et al. MHC Class I Epitope Binding Prediction Trained on Small Data Sets. In Lecture Notes in Computer Science, 3239:217-225, 2004
27
Results on Tumor Data
28
Experimental Validation Mutations reported by [Noguchi et al 94] found by the pipeline Confirmed with Sanger sequencing 18 out of 20 mutations for MethA and 26 out of 28 mutations for CMS5 Immunogenic potential under experimental validation in the Srivastava lab at UCHC
29
Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work
30
RNA-Seq ABCDE Make cDNA & shatter into fragments Sequence fragment ends Map reads Gene Expression (GE) ABC AC DE Isoform Discovery (ID) Isoform Expression (IE)
31
Alternative Splicing [Griffith and Marra 07]
32
Challenges to Accurate Estimation of Gene Expression Levels Read ambiguity (multireads) What is the gene length? ABCDE
33
Previous approaches to GE Ignore multireads [Mortazavi et al. 08] – Fractionally allocate multireads based on unique read estimates [Pasaniuc et al. 10] – EM algorithm for solving ambiguities Gene length: sum of lengths of exons that appear in at least one isoform Underestimates expression levels for genes with 2 or more isoforms [Trapnell et al. 10]
34
Read Ambiguity in IE ABCDE AC
35
Previous approaches to IE [Jiang&Wong 09] – Poisson model + importance sampling, single reads [Richard et al. 10] EM Algorithm based on Poisson model, single reads in exons [Li et al. 10] – EM Algorithm, single reads [Feng et al. 10] – Convex quadratic program, pairs used only for ID [Trapnell et al. 10] – Extends Jiang’s model to paired reads – Fragment length distribution
36
Our contribution Unified probabilistic model and Expectation- Maximization Algorithm for IE considering – Single and/or paired reads – Fragment length distribution – Strand information – Base quality scores
37
Read-Isoform Compatibility
38
Fragment length distribution Paired reads ABC AC ABC ACAC ABC i j F a (i) F a (j)
39
Fragment length distribution Single reads ABC AC ABC AC ABC AC i j F a (i) F a (j)
40
IsoEM algorithm E-step M-step
41
Error Fraction Curves - Isoforms 30M single reads of length 25 (simulated)
42
Error Fraction Curves - Genes 30M single reads of length 25 (simulated)
43
Validation on MAQC Samples
44
Outline Background on high-throughput sequencing Identification of tumor-specific epitopes Estimation of gene and isoform expression levels Viral quasispecies reconstruction Future work
45
Viral Quasispecies RNA viruses (HIV, HCV) Many replication mistakes Quasispecies Quasispecies (qsps) = co-existing closely related variants Variants differ in virulence ability to escape the immune system resistance to antiviral therapies tissue tropism How do qsps contribute to viral persistence and evolution?
46
454 Pyrosequencing Pyrosequencing Pyrosequencing =Sequencing by Synthesis. GS FLX Titanium GS FLX Titanium : reads Fragments (reads): 300-800 bp Sequence of the reads single System software assembles reads into a single genome We need a software that assembles reads into multiple genomes!
47
Quasispecies Spectrum Reconstruction (QSR) Problem Given pyrosequencing reads from a quasispecies population of unknown size and distribution Reconstructspectrum Reconstruct the quasispecies spectrum sequences frequencies
48
ViSpA Viral Spectrum Assembler
49
454 Sequencing Errors Error rate ~0.1%. Fixed number of incorporated bases vs. light intensity value. Incorrect resolution of homopolymers => over-calls (insertions) 65-75% of errors under-calls (deletions) 20-30% of errors
50
Preprocessing of Aligned Reads D 1.Deletions in reads: D all N. Replace deletion, confirmed by a single read, with either allele value that is present in all other reads or N. I 2.Insertions into reference: I Remove insertions, confirmed by a single read. 3.Imputation of missing values N
51
Read Graph: Vertices Subread with n mismatches Subread = completely contained in some read with ≤ n mismatches. Superread Superread = not a subread => the vertex in the read graph. ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT
52
Read Graph: Edges Edge b/w two vertices exists if there is an overlap between superreads with m mismatches. they agree on their overlap with ≤ m mismatches. Auxiliary vertices: source and sink
53
Read Graph: Edge Cost The most probable source-sink path through each vertex Cost: uncertainty that two superreads are from the same qsps. OverhangΔ Overhang Δ is the shift in start positions of two overlapping superreads. Δ
54
Contig Assembling Max Bandwidth Path Max Bandwidth Path through vertex path minimizing maximum edge cost for the path and each subpath Consensus of path’s superreads N Each position: >70%-majority or N Weighted consensus obtained on all reads Remove duplicates Duplicated sequences = statistical evidence rl read r of length l sL qsps s of length L k k is #mismatches, t/L t/L is a mutation rate
55
Expectation Maximization Bipartite graph: Q q f q Q q is a candidate with frequency f q R r o r R r is a read with observed frequency o r h q,r r q j Weight h q,r = probability that read r is produced by qsps q with j mismatches E step:M step:
56
HCV Qsps (P. Balfe) 30927 reads from 5.2Kb-long region of HCV-1a genomes intravenous drug user being infected for less than 3 months => mutation rate is in [1.75%, 8%] 27764 reads average length=292bp Indels: ~77% of reads Insertions length: 1 (86%), 3 (9.8%) Deletions length: 1 (98%) N N: ~7% of reads
57
HCV Data Statistics
58
NJ Tree for 12 Most Frequent Qsps (No Insertions) The top sequence: 26.9% (no mismatches) and 50.4% (≤1 mismatch) of the reads. In sum: 35.6% (no mismatches ) and 64.5% (≤1 mismatch) of the reads. Reconstructed sequence with highest frequency 99% identical to one of the ORFs obtained by cloning the quasispecies.
59
Conclusions & Future Work Freely available implementations of these methods available at http://dna.engr.uconn.edu/software/http://dna.engr.uconn.edu/software/ Ongoing work – Monitoring immune responses by TCR sequencing – Isoform discovery – Computational deconvolution of heterogeneous samples – Reconstruction & frequency estimation of virus quasispecies from Ion Torrent reads
60
Acknowledgments Immunogenomics Jorge Duitama (KU Leuven) Pramod K. Srivastava, Adam Adler, Brent Graveley, Duan Fei (UCHC) Matt Alessandri and Kelly Gonzalez (Ambry Genetics) IsoEM Marius Nicolae (Uconn) Alex Zelikovsky, Serghei Mangul (GSU) ViSpA Alex Zelikovsky, Irina Astrovskaya, Bassam Tork, Serghei Mangul, (GSU), and Kelly Westbrooks (Life Technologies) Peter Balfe (Birmingham University, UK) Funding NSF awards IIS-0546457, IIS-0916948, and DBI-0543365 UCONN Research Foundation UCIG grant
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.