Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13:114-125 (2008) Presenter: Yong Li.

Slides:



Advertisements
Similar presentations
V-Detector: A Negative Selection Algorithm Zhou Ji, advised by Prof. Dasgupta Computer Science Research Day The University of Memphis March 25, 2005.
Advertisements

CZ5225 Methods in Computational Biology Lecture 9: Pharmacogenetics and individual variation of drug response CZ5225 Methods in Computational Biology.
MA/CS 375 Fall MA/CS 375 Fall 2002 Lecture 29.
RNA-Seq based discovery and reconstruction of unannotated transcripts
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
RNAseq.
Reference mapping and variant detection Peter Tsai Bioinformatics Institute, University of Auckland.
Sampling Distributions (§ )
Next-generation sequencing
Targeted Data Introduction  Many mapping, alignment and variant calling algorithms  Most of these have been developed for whole genome sequencing and.
Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.
Variant discovery Different approaches: With or without a reference? With a reference – Limiting factors are CPU time and memory required – Crossbow –
HIV/AIDS as a Microcosm for the Study of Evolution.
Computational Tools for Finding and Interpreting Genetic Variations Gabor T. Marth Department of Biology, Boston College
Mining SNPs from EST Databases Picoult-Newberg et al. (1999)
28-Way vertebrate alignment and conservation track in the UCSC Genome Browser Journal club Dec. 7, 2007.
Genotyping of James Watson’s genome from Low-coverage Sequencing Data Sanjiv Dinakar and Yözen Hernández.
Sequence Variation Informatics Gabor T. Marth Department of Biology, Boston College BI420 – Introduction to Bioinformatics.
CIS786, Lecture 8 Usman Roshan Some of the slides are based upon material by Dennis Livesay and David.
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Delon Toh. Pitfalls of 2 nd Gen Amplification of cDNA – Artifacts – Biased coverage Short reads – Medium ~100bp for Illumina – 700bp for 454.
TGCAAACTCAAACTCTTTTGTTGTTCTTACTGTATCATTGCCCAGAATAT TCTGCCTGTCTTTAGAGGCTAATACATTGATTAGTGAATTCCAATGGGCA GAATCGTGATGCATTAAAGAGATGCTAATATTTTCACTGCTCCTCAATTT.
Habil Zare Department of Genome Sciences University of Washington
BNFO 602/691 Biological Sequence Analysis Mark Reimers, VIPBG
Considerations for Analyzing Targeted NGS Data Introduction Tim Hague, CTO.
Targeted Cancer Therapy Xiaole Shirley Liu STAT115, STAT215, BIO298, BIST520.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
Cosmological studies with Weak Lensing Peak statistics Zuhui Fan Dept. of Astronomy, Peking University.
Targeted next generation sequencing for population genomics and phylogenomics in Ambystomatid salamanders Eric M. O’Neill David W. Weisrock Photograph.
The use of short-read next generation sequences to recover the evolutionary histories in multi-individual samples Systematic biology presentation Yuantong.
Adrian Caciula Department of Computer Science Georgia State University Joint work with Serghei Mangul (UCLA) Ion Mandoiu (UCONN) Alex Zelikovsky (GSU)
By Zemin Ning & Adam Spargo Informatics Division The Wellcome Trust Sanger Institute The SSAHA2 Application Pack.
The iPlant Collaborative
Lecture 2 Forestry 3218 Lecture 2 Statistical Methods Avery and Burkhart, Chapter 2 Forest Mensuration II Avery and Burkhart, Chapter 2.
How will new sequencing technologies enable the HMP? Elaine Mardis, Ph.D. Associate Professor of Genetics Co-Director, Genome Sequencing Center Washington.
Jeffrey Zheng School of Software, Yunnan University August 4, nd International Summit on Integrative Biology August 4-5, 2014 Chicago, USA.
Serghei Mangul Department of Computer Science Georgia State University Joint work with Irina Astrovskaya, Marius Nicolae, Bassam Tork, Ion Mandoiu and.
We obtained breast cancer tissues from the Breast Cancer Biospecimen Repository of Fred Hutchinson Cancer Research Center. We performed two rounds of next-gen.
BNFO 615 Usman Roshan. Short read alignment Input: – Reads: short DNA sequences (upto a few hundred base pairs (bp)) produced by a sequencing machine.
Optimization of personalized therapies for anticancer treatment Alexei Vazquez The Cancer Institute of New Jersey.
California Pacific Medical Center
Lecture 11. Topics in Omic Studies (Cancer Genomics, Transcriptomics and Epignomics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational.
By Alfonso Farrugio, Hieu Nguyen, and Antony Vydrin Sequencing Technologies and Human Genetic Variation.
Accurate estimation of microbial communities using 16S tags
ASSEMBLY AND ALIGNMENT-FREE METHOD OF PHYLOGENY RECONSTRUCTION FROM NGS DATA Huan Fan, Anthony R. Ives, Yann Surget-Groba and Charles H. Cannon.
A Maximum Likelihood Method for Quasispecies Reconstruction Nicholas Mancuso, Georgia State University Bassam Tork, Georgia State University Pavel Skums,
Computational Biology and Genomics at Boston College Biology Gabor T. Marth Department of Biology, Boston College
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
MEGAN analysis of metagenomic data Daniel H. Huson, Alexander F. Auch, Ji Qi, et al. Genome Res
KGEM: an EM Error Correction Algorithm for NGS Amplicon-based Data Alexander Artyomenko.
JERI DILTS SUZANNA KIM HEMA NAGRAJAN DEEPAK PURUSHOTHAM AMBILY SIVADAS AMIT RUPANI LEO WU Genome Assembly Final Results
BNFO 615 Usman Roshan. Projects and papers An opportunity to do hands on work Proposal presentations due by end of September Papers: present at least.
From Reads to Results Exome-seq analysis at CCBR
ICCABS 2013 kGEM: An EM-based Algorithm for Local Reconstruction of Viral Quasispecies Alexander Artyomenko.
Amos Tanay Nir Yosef 1st HCA Jamboree, 8/2017
Preprocessing Data Rob Schmieder.
Quality Control & Preprocessing of Metagenomic Data
Gene expression from RNA-Seq
Alexander Zelikovsky Computer Science Department
Discovery tools for human genetic variations
Sampling Distribution
Sampling Distribution
26.5 Molecular Clocks Help Track Evolutionary Time
Stacks simulation results.
Effect of dinB gene deletion on the frequency of spontaneously arisen TetR mutants in different genetic backgrounds. Effect of dinB gene deletion on the.
Alteration calling accuracy and concordance with reference platforms.
Toward Accurate and Quantitative Comparative Metagenomics
Presentation transcript:

Population sequencing using short reads: HIV as a case study Vladimir Jojic et.al. PSB 13: (2008) Presenter: Yong Li

Overview Background –Population sequencing & metagenomics –Pyrosequencing & classical sequencing The Problem and the challenge –low concentration; short reads; sequencing errors; The model –sequence & frequency  reads The EM algorithm Validation

Background Population sequencing & metagenomics –Multiple strain vs. multiple species –HIV drug resistance from rare variants Pyrosequencing & chromatographical –Ultra-deep sequencing, 454 sequencing –Short reads; high error rate; homopolymers –Sensitivity 0.1% vs. 20% To clone or not to clone? –Two protocols to detect mutational variant –Cloning bias; stoichiometry

Genome Res. Wang et al. 17: , 2007 Clonal amplification

Genome Res. Wang et al. 17: , 2007

The computational problem Given: – 454 sequencing reads Get: –Reconstruct the population Sequences (epitome) –Estimate the relative quantity Statistical model

The statistical model (1) Indel frequency Sequencing error parameter

The statistical model (2)

The hidden variable: Model parameters: Observed variable:, t = 1…T EM algorithm ?

Computational tricks One tau Clustering of reads Initialization Determining the number of strains: S –Trails

Validation Data is partially simulated –e is composed of real HIV variants –Artificial values for –x generated from the very probabilistic model with 1% substitution; 2% insertion, 0.5% deletion Two datasets –1. Varied strains frequencies, and coverage –2. Varied mutation density

Discussion High sensitivity compared with chromatography approach –0.1% relative abundance May be applied to metagenomic sequencing Need validation using real date Need comparison with other method

Questions?