Naotoshi Seo, Hiroshi Toyoizumi Performance Evaluation Laboratory

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Bayesian inference “Very much lies in the posterior distribution” Bayesian definition of sufficiency: A statistic T (x 1, …, x n ) is sufficient for 
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
Thursday, September 12, 2013 Effect Size, Power, and Exam Review.
Significance Testing Chapter 13 Victor Katch Kinesiology.
Plasmids and Restriction Enzyme Mapping
INFINITE SEQUENCES AND SERIES
Introduction to Hypothesis Testing
Statistics for the Social Sciences
Physical Mapping I CIS 667 February 26, Physical Mapping A physical map of a piece of DNA tells us the location of certain markers  A marker is.
1 Bayesian inference of genome structure and application to base composition variation Nick Smith and Paul Fearnhead, University of Lancaster.
CS273a Lecture 2, Autumn 10, Batzoglou DNA Sequencing (cont.)
8-2 Basics of Hypothesis Testing
Experimental Evaluation
Population Proportion The fraction of values in a population which have a specific attribute p = Population proportion X = Number of items having the attribute.
The Simplified Partial Digest Problem: Hardness and a Probabilistic Analysis Zo ë Abrams Ho-Lin Chen
Sequence comparison: Significance of similarity scores Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc Chapter 11 Introduction to Hypothesis Testing.
Bacterial Genome Finishing Using Optical Mapping Dibyendu Kumar, Fahong Yu and William Farmerie Interdisciplinary Center for Biotechnology Research, University.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
& Gel Plasmid Electrophoresis Mapping.
Interval estimation ASW, Chapter 8 Economics 224, Notes for October 8, 2008.
Lecture Slides Elementary Statistics Twelfth Edition
Simulation of Random Walk How do we investigate this numerically? Choose the step length to be a=1 Use a computer to generate random numbers r i uniformly.
Chapter 6 The Normal Probability Distribution
Chapter 9 Large-Sample Tests of Hypotheses
Comparing two sample means Dr David Field. Comparing two samples Researchers often begin with a hypothesis that two sample means will be different from.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
CS CM124/224 & HG CM124/224 DISCUSSION SECTION (JUN 6, 2013) TA: Farhad Hormozdiari.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Comp. Genomics Recitation 3 The statistics of database searching.
1 Chapter 7 Sampling Distributions. 2 Chapter Outline  Selecting A Sample  Point Estimation  Introduction to Sampling Distributions  Sampling Distribution.
Estimation Chapter 8. Estimating µ When σ Is Known.
In section 11.9, we were able to find power series representations for a certain restricted class of functions. Here, we investigate more general problems.
Sections 7-1 and 7-2 Review and Preview and Estimating a Population Proportion.
Section 10.1 Confidence Intervals
Economics 173 Business Statistics Lecture 4 Fall, 2001 Professor J. Petry
LECTURE 25 THURSDAY, 19 NOVEMBER STA291 Fall
Biostatistics in Practice Peter D. Christenson Biostatistician Session 3: Testing Hypotheses.
Radiation Detection and Measurement, JU, First Semester, (Saed Dababneh). 1 Counting Statistics and Error Prediction Poisson Distribution ( p.
26134 Business Statistics Tutorial 12: REVISION THRESHOLD CONCEPT 5 (TH5): Theoretical foundation of statistical inference:
Introduction Suppose that a pharmaceutical company is concerned that the mean potency  of an antibiotic meet the minimum government potency standards.
1 Definitions In statistics, a hypothesis is a claim or statement about a property of a population. A hypothesis test is a standard procedure for testing.
1 Chi-square Test Dr. T. T. Kachwala. Using the Chi-Square Test 2 The following are the two Applications: 1. Chi square as a test of Independence 2.Chi.
Raptor Codes Amin Shokrollahi EPFL. BEC(p 1 ) BEC(p 2 ) BEC(p 3 ) BEC(p 4 ) BEC(p 5 ) BEC(p 6 ) Communication on Multiple Unknown Channels.
1 Probability and Statistics Confidence Intervals.
10.1 – Estimating with Confidence. Recall: The Law of Large Numbers says the sample mean from a large SRS will be close to the unknown population mean.
DNA Fingerprinting: The DNA of every individual is different. Loci where the human genome differs from individual to individual are called polymorphisms.
A short introduction to epidemiology Chapter 6: Precision Neil Pearce Centre for Public Health Research Massey University Wellington, New Zealand.
454 Genome Sequence Assembly and Analysis HC70AL S Brandon Le & Min Chen.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Genome Analysis. This involves finding out the: order of the bases in the DNA location of genes parts of the DNA that controls the activity of the genes.
The normal approximation for probability histograms.
Sampling Distributions – Sample Means & Sample Proportions
And distribution of sample means
Copyright © Cengage Learning. All rights reserved.
Statistics for the Social Sciences
Chapter 21 More About Tests.
Sequence comparison: Multiple testing correction
Introduction to Inference
Human Molecular Genetics
Computational and experimental analysis of DNA shuffling
Population Proportion
Chapter 12 Power Analysis.
Approximating distributions
CSCI 1810 Computational Molecular Biology 2018
Objectives 6.1 Estimating with confidence Statistical confidence
Objectives 6.1 Estimating with confidence Statistical confidence
MATH 2311 Section 4.4.
Presentation transcript:

Repeat finding by normal approximation on whole genome shotgun assembling Naotoshi Seo, Hiroshi Toyoizumi Performance Evaluation Laboratory University of Aizu

Abstract The purpose of this thesis is to verify repeat finding by normal approximation is more effective than one by traditional Poisson approximation. We first estimated stochastically, and then verified by using our simulator programs.

What is whole genome shotgun assembling? It is impossible to read DNA at burst because it is too long. Therefore, following procedure is required. Copy Restriction enzyme DNA Scan AGCTGTGGAG TGGAGCTTGA Shotgun assembling AGCTGTGGAGCTTGA

Repeat Repeat means subsequences with same arrangement in one genome sequence. ATTGAC repeat Repeat subsequence must not be used for overlap detection because its original location can not be determined. So, methods to find repeat are needed.

How to find repeat 3 If the # of copies is 3, the redundancy of one subsequence ordinarily becomes 3. 6 If the genome have another subsequence with same arrangement, in short, repeat, the number becomes 6. 4 Actually, these numbers become smaller because DNA is fragmented.

Estimation of the redundancy of subsequences with same arrangement : cut probability : word (subsequence) length : probability of miss reading a fragment : probability with complete subsequence : probability of not being cut at all in w length n w N miss reading Binomial distribution

Comparing with our simulator Estimation Simulator result It seems that my estimation is correct.

Approximation to another distribution The distribution was a binomial distribution. A binomial distribution requires much time for calculation. Therefore, it is better to be approximated to another distribution. Traditionally, it is approximated by a Poisson distribution.

A problem of traditional approximation and a prescription Although the approximation is possible when n is sufficiently large, p is small, it is actually impossible because n is small such as 10 and p is large such as 0.8 in this case. A binomial distribution can be approximated by not only a Poisson distribution but also a normal distribution.

Comparing a Poisson distribution and a normal distribution Binomial distribution Poisson distribution Binomial distribution resembles normal distribution rather than Poisson distribution.

Necessary copy number for repeat finding Right peak is the distribution of redundancies having double repeat. Left one is normal, having no repeat, distribution. When the # of copies is small, there is big error probability judged by mistake whether a subsequence is repeat or not. We assumed that error is accepted if its ratio is less than 0.05. big error subtle error

Necessary copy number for each distribution The error became less than 0.05 when n is 4 on a binomial distribution if good threshold is used. It did when n is 5 on a normal approximation. It did when n is 28 on a Poisson approximation. In this case, about 1/6 copies are enough in a normal approximation compared with a traditional Poisson approximation.

The effective threshold value for repeat finding The effective threshold value can be calculated by the intersection’s x-coordinate of a no-repeat distribution’s curve and a double-repeat distribution’s curve.

Experimental proof Word length = 100 Cut probability Pc Miss reading probability Pm # of copies n Threshold value Error ratio 1/500 5 5.8 0.0 1/5 12 11.4 0.0196 1/200 15 13.1 0.0035 We assumed that error ratio that is less than 0.05 is accepted. This shows that repeat finding by normal approximation works well.

Conclusion The # of copies could be decreased by normal approximation compared with traditional Poisson approximation. Indeed, the repeat finding by the small copies’ number achieved good results. Therefore, it was verified that repeat finding by normal approximation is more effective than one by traditional Poisson approximation.