Presented by Samuel Chapman. Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on.

Slides:



Advertisements
Similar presentations
Amanda Barrera Biology Honors Period 1
Advertisements

Original Figures for "Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring"
Alex Zelikovsky Department of Computer Science Georgia State University Joint work with Serghei Mangul, Irina Astrovskaya, Bassam Tork, Ion Mandoiu Viral.
Metabarcoding 16S RNA targeted sequencing
Determination of host-associated bacterial communities In the rhizospheres of maize, acorn squash, and pinto beans.
Amplicon-Based Quasipecies Assembly Using Next Generation Sequencing Nick Mancuso Bassam Tork Computer Science Department Georgia State University.
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
A New Nonparametric Bayesian Model for Genetic Recombination in Open Ancestral Space Presented by Chunping Wang Machine Learning Group, Duke University.
Characterization of microbial communities in a fluidized-pellet-bed bioreactor by DGGE analysis As an extension of the fluidized pellet bed operation used.
Kinship DNA Fingerprinting Simulation Grab the packet from the front table and begin reading.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Methods of identification and localization of the DNA coding sequences Jacek Leluk Interdisciplinary Centre for Mathematical and Computational Modelling,
Sociology 601: Class 5, September 15, 2009
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Evaluating Hypotheses
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Efficient Estimation of Emission Probabilities in profile HMM By Virpi Ahola et al Reviewed By Alok Datar.
Aaker, Kumar, Day Seventh Edition Instructor’s Presentation Slides
Utilizing Fuzzy Logic for Gene Sequence Construction from Sub Sequences and Characteristic Genome Derivation and Assembly.
CS 6293 Advanced Topics: Current Bioinformatics
Introduction to Bioinformatics Algorithms Clustering and Microarray Analysis.
The polymerase chain reaction (PCR) rapidly
University of Oklahoma Genome Center4/14/12.
Update on Next-Generation Sequencing
Copyright © 2005 by Evan Schofer
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Reading the Blueprint of Life
Aaker, Kumar, Day Ninth Edition Instructor’s Presentation Slides
Gene expression & Clustering (Chapter 10)
-The methods section of the course covers chapters 21 and 22, not chapters 20 and 21 -Paper discussion on Tuesday - assignment due at the start of class.
Recombinant DNA Technology………..
Accurate estimation of microbial communities using 16S tags Julien Tremblay, PhD
13-1 Changing the Living World
Data Gathering Techniques. Essential Question: What are the different methods for gathering data about a population?
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Error model for massively parallel (454) DNA sequencing Sriram Raghuraman (working with Haixu Tang and Justin Choi)
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
A new Ad Hoc Positioning System 컴퓨터 공학과 오영준.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Success criteria - PCR By the end of this lesson we will be able to: 1. The polymerase chain reaction (PCR) is a technique for the amplification ( making.
Polymerase Chain Reaction Aims  To understand the process of PCR and its uses. Starter - Match each term with its correct description (work in pairs)
Identification of Copy Number Variants using Genome Graphs
Polymerase Chain Reaction (PCR)
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
. Finding Motifs in Promoter Regions Libi Hertzberg Or Zuk.
Accurate estimation of microbial communities using 16S tags
Step 3: Tools Database Searching
CSE182 L14 Mass Spec Quantitation MS applications Microarray analysis.
1 Microarray Clustering. 2 Outline Microarrays Hierarchical Clustering K-Means Clustering Corrupted Cliques Problem CAST Clustering Algorithm.
SNP Scores. Overall Score Coverage Score * 4 optional scores ▫Read Balance Score  = 1 if reads are balanced in each direction ▫Allele Balance Score 
Title: Studying whole genomes Homework: learning package 14 for Thursday 21 June 2016.
Introducing DOTUR, a Computer Program for Defining Operational Taxonomic Units and Estimating Species Richness Patric D. Schloss and Jo Handelsman Department.
Clustering [Idea only, Chapter 10.1, 10.2, 10.4].
Date of download: 6/23/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A)
Date of download: 7/7/2016 Copyright © 2016 McGraw-Hill Education. All rights reserved. Pipeline for culture-independent studies of a microbiota. (A) DNA.
Indel rates and probabilistic alignments Gerton Lunter Budapest, June 2008.
Next-generation sequencing technology
Success criteria - PCR By the end of this lesson we will be know:
What is a Hidden Markov Model?
Cluster Analysis II 10/03/2012.
04/10/
Next-generation sequencing technology
Classification of unlabeled data:
DNA Technology.
Ranking Tumor Phylogeny Trees by Likelihood
Clustering.
CSE 5290: Algorithms for Bioinformatics Fall 2009
Clustering.
Probabilistic Surrogate Models
Presentation transcript:

Presented by Samuel Chapman

Pyrosequencing-Intro The core idea behind pyrosequencing is that it utilizes the process of complementary DNA extension on a single strand. Capable of producing about 400,000 reads of around 250 bp each! The process takes half a day and costs only several thousand dollars.

Pyrosequencing-Intro To sequence bacterial communities, the reads are often generated using known, conserved flanking regions as primers for a homologous region. PCR is used to amplify the number of copies of the desired region. The middle is where the variation among the population lies. The numbers of the sequences increase, but the proportions for each species are the same.

Pyrosequencing- Intro These regions are homologous, but only the conserved primer regions are the same. The middle areas can be different. These regions will be our sequencing reads.

Pyrosequencing-Methods Each separate DNA sample is put onto a bead. PCR is then performed, so that each bead has one kind of sample. Each bead is put into one of hundreds of thousands of separate wells, so that each well has a distinct sample (although two wells may have identical samples). The DNA on the beads is single-stranded, and the primer is attached, allowing for extension. Enzymes and chemicals are added so that, every time a new base is added, light is released.

Pyrosequencing- Methods

Bases are added to the sequences by covering the well plate with a nucleotide, washing it away, then doing the same thing with the other three, then starting over. Ex: A..T..C..G | A..T..C..G | A..T..C..G where ‘..’ represents washing and ‘|’ denotes a new cycle. NOTICE: if a sequence has two or more of a letter in a row, all of those will be added in one step. If more than one letter is added at once, more light will be emitted from that well.

Pyrosequencing-Methods Each well can be monitored for the amount of light it emits at each nucleotide step (how long the “homopolymer” is). The sequence of emissions is called a flowgram. Naively, an intensity of 0 means a homopolymer of length 0, intensity of 1 a homopolymer of length 1, intensity of 2 a homopolymer of length 2… HOWEVER, the intensity is rather a distribution, and can therefore lead to errors such as insertions and deletions.

Example from paper Consider a known sequence, ACTGGGG. The order of nucleotide addition is T..A..C..G Intensities “should” produce 0, 1, 1, 0| 1, 0, 0, 4 Observed flowgram was.18, 1.03, 1.02,.70 | 1.12,.07,.14, This suggested the sequence ACGTGGGGG, because.70 and 4.65 rounded up are 1 and 5. Therefore, it is better to use distributions to more accurately predict the sequence.

Intensity distribution created using known sequences (from paper)

Dealing with the noisy data Using the intensity data, a “distance” measure was defined, which reflected the probability that each flowgram represented a particular sequence. All distances were applied to a mixture model, and an iterative expectation maximization algorithm was employed to gradually bring the flowgrams into agreement with the “true” data. Artifacts such as PCR chimeras were dealt with using the Mallard algorithm.

Flowgram preclustering Assumption: the likelihood of the flowgrams is represented by the mixture model. Each sequence is a different part of the mixture and has it’s own probability. σ is the cluster size of flowgrams around a sequence fi is the density of the observed flowgrams about a sequence Sj is a particular sequence

Flowgram preclustering The likelihood of the dataset, D, of N flowgrams indexed i: τ j is each sequence’s relative frequency

Preclustering analogy The flowgrams are clustered, with the size of each cluster, σ, being 5 flowgrams. We guess that each cluster represents one sequence. This is just an analogy, because the mixture is not two-dimensional like this.

Expectation maximization Assume matrix Z, with rows representing flowgrams, columns representing sequences. z i, j =δ i,m(i), where m(i) is the sequence that generated flowgram i. Complete data likelihood is:

Expectation maximization Define z’ i, j as z i, j given model parameters. Expectation step: calculate z’ i, j given model parameters Maximization step: calculate new parameters such that LC is maximized according to z’ i, j. Stop when the improvement between steps falls below a cutoff, c.

Expectation maximization analogy Choose a beginning sequence (red square) in each cluster. There are many such clusters in the model. The black circles are flowgrams in the cluster. Expectation: calculate the parameters, such as likelihood that these flowgrams generate the sequence. Maximization: calculate a new sequence that is closer to the “real” sequence based on the flowgrams. You can see here that the sequence moves to a more likely position to the flowgrams. In the paper, the aggregate distance is calculated for all sequences.

Expectation maximization E step (calculating new z’ i, j) M step (calculating new relative frequencies, τ j,and then sequences

A visual example of the process

Testing the algorithm The pyrosequencing algorithm was tested on 16s rRNA from 90 known microbial clones. After sequencing, the samples were grouped phylogenetically into operational taxonomic units (OTUs) and the accuracy compared to real life. The sequence difference threshold for the creation of a separate OTU had to be larger than the noise (see next slide)

OTU assignment The assignment of OTUs depends on the required threshold of difference for a separate OTU. A higher difference results in fewer OTUs, because species become clustered together. A threshold that is below the noise level could result in the same species becoming two different OTUs.

Results

Take-home message The noise reduction algorithm employed by this paper resulted in more accurate sequence assignment. Average linking is better at handling noise.

Questions?

Acknowledgments Pyrosequence pic: e/JEB001370F2.jpeg