Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University.

Slides:



Advertisements
Similar presentations
Marius Nicolae Computer Science and Engineering Department
Advertisements

RNA-Seq based discovery and reconstruction of unannotated transcripts
Reconstruction of Infectious Bronchitis Virus Quasispecies from NGS Data Bassam Tork Georgia State University Atlanta, GA 30303, USA.
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
ILP-BASED MAXIMUM LIKELIHOOD GENOME SCAFFOLDING James Lindsay Ion Mandoiu University of Connecticut Hamed Salooti Alex ZelikovskyGeorgia State University.
Reference Assisted Nucleic Acid Sequence Reconstruction from Mass Spectrometry Data Gabriel Ilie 1, Alex Zelikovsky 2 and Ion Măndoiu 1 1 CSE Department,
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
Transcriptome Assembly and Quantification from Ion Torrent RNA-Seq Data Alex Zelikovsky Department of Computer Science Georgia State University Joint work.
Alignment Problem (Optimal) pairwise alignment consists of considering all possible alignments of two sequences and choosing the optimal one. Sub-optimal.
RNA-Seq based discovery and reconstruction of unannotated transcripts in partially annotated genomes 3 Serghei Mangul*, Adrian Caciula*, Ion.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Class 02: Whole genome sequencing. The seminal papers ``Is Whole Genome Sequencing Feasible?'' ``Whole-Genome DNA.
DNA Sequencing. The Walking Method 1.Build a very redundant library of BACs with sequenced clone- ends (cheap to build) 2.Sequence some “seed” clones.
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
1 Nicholas Mancuso Department of Computer Science Georgia State University Joint work with Bassam Tork, GSU Pavel Skums, CDC Ion M ӑ ndoiu, UConn Alex.
Assembly.
Sequencing and Assembly Cont’d. CS273a Lecture 5, Win07, Batzoglou Steps to Assemble a Genome 1. Find overlapping reads 4. Derive consensus sequence..ACGATTACAATAGGTT..
Novel multi-platform next generation assembly methods for mammalian genomes The Baylor College of Medicine, Australian Government and University of Connecticut.
CS273a Lecture 4, Autumn 08, Batzoglou Hierarchical Sequencing.
Marius Nicolae Computer Science and Engineering Department University of Connecticut Joint work with Serghei Mangul, Ion Mandoiu and Alex Zelikovsky.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Evaluation of the Haplotype Motif Model using the Principle of Minimum Description Srinath Sridhar, Kedar Dhamdhere, Guy E. Blelloch, R. Ravi and Russell.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Next-Generation Sequencing: Challenges and Opportunities Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
CSE182-L10 LW statistics/Assembly. Whole Genome Shotgun Break up the entire genome into pieces Sequence ends, and assemble using a computer LW statistics.
Estimation of alternative splicing isoform frequencies from RNA-Seq data Ion Mandoiu Computer Science and Engineering Department University of Connecticut.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
Genome Assembly Bonnie Hurwitz Graduate student TMPL.
Software for Robust Transcript Discovery and Quantification from RNA-Seq Ion Mandoiu, Alex Zelikovsky, Serghei Mangul.
Reconstruction of Haplotype Spectra from NGS Data Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science & Engineering.
JAMES LINDSAY*, HAMED SALOOTI, ALEX ZELIKOVSKI, ION MANDOIU* ACM-BCB 2012 Scaffolding Large Genomes Using Integer Linear Programming University of Connecticut*Georgia.
Optimizing pooling strategies for the massive next-generation sequencing of viral samples Pavel Skums 1 Joint work with Olga Glebova 2, Alex Zelikovsky.
De-novo Assembly Day 4.
CS 394C March 19, 2012 Tandy Warnow.
Molecular evidence for endosymbiosis Perform blastp to investigate sequence similarity among domains of life Found yeast nuclear genes exhibit more sequence.
Inferring Genomic Sequences Irina Astrovskaya Irina Astrovskaya Dr. Alexander Zelikovsky 02/15/2011.
Probes can be designed in an evolutionary hierarchy.
Variables: – T(p) - set of candidate transcripts on which pe read p can be mapped within 1 std. dev. – y(t) -1 if a candidate transcript t is selected,
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
The iPlant Collaborative
Cancer Genome Assemblies and Variations between Normal and Tumour Human Cells Zemin Ning The Wellcome Trust Sanger Institute.
Novel transcript reconstruction from ION Torrent sequencing reads and Viral Meta-genome Reconstruction from AmpliSeq Ion Torrent data University of Connecticut.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
Quasispecies Assembly Using Network Flows Alex Zelikovsky Georgia State University Joint work with Kelly Westbrooks Georgia State University Irina Astrovskaya.
Inferring Viral Quasispecies Spectra from NGS Reads Ion Măndoiu Computer Science & Engineering Department University of Connecticut.
Bioinformatics tools for viral quasispecies reconstruction from next-generation sequencing data and vaccine optimization PD: Ion Măndoiu, UConn Co-PDs: Mazhar.
Scalable Algorithms for Next-Generation Sequencing Data Analysis Ion Mandoiu UTC Associate Professor in Engineering Innovation Department of Computer Science.
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
Detection of closed sharp edges in point clouds Speaker: Liuyu Time:
Distance-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Chapter 5 Sequence Assembly: Assembling the Human Genome.
An Integer Programming Approach to Novel Transcript Reconstruction from Paired-End RNA-Seq Reads Serghei Mangul Department of Computer Science Georgia.
Genome Research 12:1 (2002), Assembly algorithm outline ● Input and trimming ● Overlap detection ● Error correction ● Evaluation of alignments.
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li.
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Canadian Bioinformatics Workshops
Sequence assembly Jose Blanca COMAV institute bioinf.comav.upv.es.
Research in Computational Molecular Biology , Vol (2008)
Alexander Zelikovsky Computer Science Department
CS 598AGB Genome Assembly Tandy Warnow.
Discovery tools for human genetic variations
SEG5010 Presentation Zhou Lanjun.
Fragment Assembly 7/30/2019.
Presentation transcript:

Reconstruction of infectious bronchitis virus quasispecies from 454 pyrosequencing reads CAME 2011 Ion Mandoiu Computer Science & Engineering Dept. University of Connecticut

Infectious Bronchitis Virus (IBV) Group 3 coronavirus Biggest single cause of economic loss in US poultry farms Young chickens: coughing, tracheal rales, dyspnea Broiler chickens: reduced growth rate Layers: egg production drops 5-50%, thin-shelled, watery albumin Worldwide distribution, with dozens of serotypes in circulation Co-infection with multiple serotypes is not uncommon, creating conditions for recombination

IBV healthy chicks IBV-infected embryo normal embryo IBV-infected egg defect

IBV Vaccination Broadly used, most commonly with attenuated live vaccine Short lived protection Layers need to be re-vaccinated multiple times during their lifespan Vaccines might undergo selection in vivo and regain virulence [Hilt, Jackwood, and McKinley 2008]

Quasispecies identified by cloning and Sanger sequencing in both IBV infected poultry and commecial vaccines [Jackwood, Hilt, and Callison 2003; Hilt, Jackwood, and McKinley 2008] Evolution of IBV

Taken from Rev. Bras. Cienc. Avic. vol.12 no.2 Campinas Apr./June 2010

S1 Gene RT-PCR Primers redesigned using PrimerHunter Published Primers

ViSpA: Viral Spectrum Assembler [Astrovskaya et al. 2011] Error Correction Read Alignment Preprocessing of Aligned Reads Read Graph Construction Contig Assembly Frequency Estimation Shotgun 454 reads Quasispecies sequences w/ frequencies

k-mer Error Correction [Skums et al.] 1. Calculate k-mers and their frequencies kc(s) (k-counts). Assume that kmers with high k-counts (“solid” k-mers) are correct, while k-mers with low k-counts (“weak” k-mers) contain errors. 2. Determine the threshold k-count (error threshold), which distinguishes solid kmers from weak k-mers. 3. Find error regions. 4. Correct the errors in error regions Zhao X et al 2010

Iterated Read Alignment Read Alignment vs Reference Build Consensus Read Re- Alignment vs. Consensus More Reads Aligned? NoYes Post- processing

Read Coverage 145K 454 reads of avg. length 400bp (~60Mb) sequenced from 2 samples (M41 vaccine and M42 isolate)

Post-processing of Aligned Reads D 1.Deletions in reads: D I 2.Insertions into reference: I 3.Additional error correction: all NReplace deletions supported by a single read with either the allele present in all other reads or N Remove insertions supported by a single read

Read Graph: Vertices Subread with n mismatches Superread Subread = completely contained in some read with ≤ n mismatches. Superread = not a subread => the vertex in the read graph. ACTGGTCCCTCCTGAGTGT GGTCCCTCCT TGGTCACTCGTGAG ACCTCATCGAAGCGGCGTCCT

Read Graph: Edges Several paths may represent the same sequence. Edge b/w two vertices if there is an overlap between superreads and they agree on their overlap with ≤ m mismatches Transitive reduction

Edge Cost Cost measures the uncertainty that two superreads belong to the same quasispecies. OverhangΔ Overhang Δ is the shift in start positions of two overlapping superreads. Δ j where j is the number of mismatches oε in overlap o, ε is 454 error rate.

Contig Assembly - Path to Sequence The s-t-Max Bandwidth Path per vertex (maximizing minimum edge cost) 1.Build coarse sequence out of path’s superreads: N For each position: >70%-majority if it exists, otherwise N N 2.Replace N’s in coarse sequence with weighted consensus obtained on all reads 3.Select unique sequences out of constructed sequences. Repetitive sequences = evidence of real qsps sequence

Frequency Estimation – EM Algorithm Bipartite graph: Q q is a candidate with frequency f q R r is a read with observed frequency o r Weight h q,r = probability that read r is produced by quasispecies q with j mismatches E step: M step:

User-Specified Parameters 1. Number of mismatches allowed to cluster reads around super reads Usually small integer in range [0,6]. The smaller genomic diversity is expected, the smaller value should be used. If reads are corrected by read correction software, then it should be in the range [0,2]. 2. Mutation-Based Range Its value depends on expected underlying genomic diversity. In general, the value varies over [80, 450]. If reads are corrected by read correction software, the value varies over range [0,20]. Number of reconstructed quasispecies varies between for M41 Vaccine, and between for M42 isolate

Reconstructed Quasispecies Variability *IonSample42RL1.fas_KEC_corrected_I_2_20_CNTGS_DIST0_EM20.txt Sequencing primer ATGGTTTGTGGTTTAATTCACTTTC 122 clones of avg. length 500bp sequenced using Sanger

M42 Sanger Clones NJ Tree

M42 Vispa Qsps NJ Tree

M42 Sanger + Vispa NJ Tree

MA41 Vaccine Sanger Clones

Summary  Viral Spectrum Assembler (ViSpA) tool Error correction both pre-alignment (based on k- mers) and post-alignment (unique indels) Quasispecies assembly based on maximum- bandwidth paths in weighted read graphs Frequency estimation via EM on all reads Freely available at  Currently under validation on IBV samples

Ongoing Work Correction for coverage bias Comparison of shotgun and amplicon based reconstruction methods Quasispecies reconstruction from Ion Torrent reads Combining long and short read technologies Study of quasispecies persistence and evolution in layer flocks following administration of modified live IBV vaccine Optimization of vaccination strategies

Longitudinal Sampling Amplicon / shotgun sequencing

Acknowledgements University of Connecticut: Rachel O’Neill, PhD. Mazhar Kahn, Ph.D. Hongjun Wang, Ph.D. Craig Obergfell Andrew Bligh Georgia State University Alex Zelikovsky, Ph.D. Bassam Tork Serghei Mangul University of Maryland Irina Astrovskaya, Ph.D.