Theoretical Analysis. Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user.

Slides:



Advertisements
Similar presentations
Chapter 7 Hypothesis Testing
Advertisements

Boyce/DiPrima 9th ed, Ch 2.8: The Existence and Uniqueness Theorem Elementary Differential Equations and Boundary Value Problems, 9th edition, by William.
Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
Statistical Techniques I EXST7005 Multiple Regression.
CHAPTER 21 Inferential Statistical Analysis. Understanding probability The idea of probability is central to inferential statistics. It means the chance.
Hypothesis Testing A hypothesis is a claim or statement about a property of a population (in our case, about the mean or a proportion of the population)
Proof Points Key ideas when proving mathematical ideas.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Maximum likelihood Conditional distribution and likelihood Maximum likelihood estimations Information in the data and likelihood Observed and Fisher’s.
From PET to SPLIT Yuri Kifer. PET: Polynomial Ergodic Theorem (Bergelson) preserving and weakly mixing is bounded measurable functions polynomials,integer.
Cal State Northridge  320 Ainsworth Sampling Distributions and Hypothesis Testing.
A general approximation technique for constrained forest problems Michael X. Goemans & David P. Williamson Presented by: Yonatan Elhanani & Yuval Cohen.
Grover. Part 2. Components of Grover Loop The Oracle -- O The Hadamard Transforms -- H The Zero State Phase Shift -- Z O is an Oracle H is Hadamards H.
Maximum likelihood (ML) and likelihood ratio (LR) test
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Elementary hypothesis testing Purpose of hypothesis testing Type of hypotheses Type of errors Critical regions Significant levels Hypothesis vs intervals.
13. The Weak Law and the Strong Law of Large Numbers
Expectation Maximization Algorithm
Randomzied Unique k-SAT R. Paturi, P. Pudlak, M.E. Saks and F. Zane An Improved Exponential-time Algorithm for k-SAT Journal of the ACM, Vol 52, No. 3,
Independent Sample T-test Often used with experimental designs N subjects are randomly assigned to two groups (Control * Treatment). After treatment, the.
DAST 2005 Week 4 – Some Helpful Material Randomized Quick Sort & Lower bound & General remarks…
1 Image filtering Hybrid Images, Oliva et al.,
Inferences About Process Quality
Lecture 10: Robust fitting CS4670: Computer Vision Noah Snavely.
Statistical Hypothesis Testing. Suppose you have a random variable X ( number of vehicle accidents in a year, stock market returns, time between el nino.
Introduction to Statistical Inferences Inference means making a statement about a population based on an analysis of a random sample taken from the population.
Estimation of Statistical Parameters
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
9th IMA Conference on Cryptography & Coding Dec 2003 More Detail for a Combined Timing and Power Attack against Implementations of RSA Werner Schindler.
PROBABILITY AND STATISTICS FOR ENGINEERING Hossein Sameti Department of Computer Engineering Sharif University of Technology The Weak Law and the Strong.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
The Selection Problem. 2 Median and Order Statistics In this section, we will study algorithms for finding the i th smallest element in a set of n elements.
Hypothesis Tests In statistics a hypothesis is a statement that something is true. Selecting the population parameter being tested (mean, proportion, variance,
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Section 6-5 The Central Limit Theorem. THE CENTRAL LIMIT THEOREM Given: 1.The random variable x has a distribution (which may or may not be normal) with.
Mixture of Gaussians This is a probability distribution for random variables or N-D vectors such as… –intensity of an object in a gray scale image –color.
1 CSCD 326 Data Structures I Software Design. 2 The Software Life Cycle 1. Specification 2. Design 3. Risk Analysis 4. Verification 5. Coding 6. Testing.
Lecture 15: Linkage Analysis VII
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.
Sampling and estimation Petter Mostad
Efficient Clustering of Uncertain Data Wang Kay Ngai, Ben Kao, Chun Kit Chui, Reynold Cheng, Michael Chau, Kevin Y. Yip Speaker: Wang Kay Ngai.
1 Hypothesis Testing Basic Problem We are interested in deciding whether some data credits or discredits some “hypothesis” (often a statement about the.
Hedonic Clustering Games Moran Feldman Joint work with: Seffi Naor and Liane Lewin-Eytan.
Statistical Techniques
Significance Tests Section Cookie Monster’s Starter Me like Cookies! Do you? You choose a card from my deck. If card is red, I give you coupon.
Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen
3.6 First Passage Time Distribution
Chance Constrained Robust Energy Efficiency in Cognitive Radio Networks with Channel Uncertainty Yongjun Xu and Xiaohui Zhao College of Communication Engineering,
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Chapter 8 Estimation ©. Estimator and Estimate estimator estimate An estimator of a population parameter is a random variable that depends on the sample.
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. Fast.
1 Distributed Vertex Coloring. 2 Vertex Coloring: each vertex is assigned a color.
CSE 5314 – Spring 2004 Homework #1 - Solutions [Soumya Sanyal] Q 1.5: Prove that the algorithm “MTF every-other-access” (i.e. move the item to the front.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses pt.1.
Week 2 Normal Distributions, Scatter Plots, Regression and Random.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
Classification of unlabeled data:
Spectral Clustering.
Randomized Algorithms CS648
Mathematical Foundations of BME Reza Shadmehr
13. The Weak Law and the Strong Law of Large Numbers
Top 40 Motifs from Artificial Book with Different Masking Ratios
The Image The pixels in the image The mask The resulting image 255 X
The Selection Problem.
Chapter 8 Estimation.
13. The Weak Law and the Strong Law of Large Numbers
Presentation transcript:

Theoretical Analysis

Objective Our algorithm use some kind of hashing technique, called random projection. In this slide, we will show that if a user want to find motif with high chance, how many non-motif will be occur in hashing process. We assume that we have a lot of potential windows and there is only 1 motif (pair of very similar images) among these windows.

Basic Notation Motif is a pair of similar images. Let say the different is less than 10%. Non-motif is a pair of any images which are not the motif. Usually the different is not small. User-defined window size S x by Sy. Let N = S x * Sy.

Assumption / Notations User defines size of windows N. (N=S x * S y ) There is only 1 motif in the dataset. User want to the motif to collide in the same bucket with confidence conf. (conf ≥ 90%) The distance, the number of different black pixel, of the motif is small. The distance of all pair images except the motif is defined by its distribution. The mean and stdev of the distribution are µ and σ, respectively. Images has the same number of black pixel. Not require. The distribution of distance is a normal distribution. Not require. Black pixels inside the windows are uniformly distributed for any windows except the motif.

Other Notations The other notations we use in our analysis are followings: s is the masking ratio. The ratio of removing black pixel in hashing process. t is the number of iteration indicates that how many times we do random projection. d is the distance between pair of images.

*Theoretical Statement* If the motif collide with confidence at least conf, other non-motif pair will have a chance to collide ≤ 1-(1-Q) t where Notations: The distance distribution of other non-motif with mean µ and stdev σ. Two hidden parameters are masking ratio, s, and the number of iteration, t. Note that parameter conf is used to find the best s or t (by fixing another).

Detail: How confident will the motif collide? (1)

Detail: How confident will the motif collide? (2) Next, we will use parameters s and t to find the probability that any non-motif pair of windows will accidentally collide. (false positive)

Detail: How about non-motif? (1)

Detail: How about non-motif? (2)

Detail: How about non-motif? (3)

Detail: How about non-motif? (4) Note that this step is loose.

Detail: Conclusion Theoretical statement is proved. User-define parameters - conf: confidence which the motif will collide. - N: size of windows (width*hight) - the distribution of the distance of non-motif windows. (µ and σ) A hidden parameter - Either s, masking ratio, or t, the number of iteration. After this proof - we can easily modify use our algorithm to set parameters automatically by trying the best value of one parameter. It can guarantee the number of false positive and find the motif with high probability.

Masking Ratio motif distance = 1 motif distance = 2 motif distance = 4 motif distance = 8 Number of iterations Fix confidence to find the correct motif = 99% Fix size of windows at 400 or 20x20. About the distribution of distance, fix µ=100 and σ=10 Vary masking ratio. For any ratio, find the best number of iteration to have a 99% chance to find motif. Vary number of iterations. For any number, find the best masking ratio to have a 99% chance to find motif. Minimum = = 6.4% Minimum = = 6.6% Probability that non-motif collide The probability of the collision of non-motif Plot by using close-form from equation (**).

The probability of the collision of non-motif motif distance = 1 motif distance = 2 motif distance = 4 motif distance = 8 Minimum = = 0.3% (at d=1) Probability that non-motif collide Masking RatioNumber of iterations Probability that non-motif collide Fix confidence to find the correct motif = 99% Fix size of windows at 400 or 20x20. About the distribution of distance, fix µ=100 and σ=10 Vary masking ratio. For any ratio, find the best number of iteration to have a 99% chance to find motif. Vary number of iterations. For any number, find the best masking ratio to have a 99% chance to find motif. Better result by using summation-form from equation (*) with the same parameters.

More Explanation In real book, there are only small number of images (usually < 10 in average) in each page of the book. Some pages may contain a hundred images but many of them contain only 1-2 images or none. Hence, there are not many potential windows in the book. For example, the 100-pages book may ideally have 1,000 potential windows or less. From our theoretical upper bound using close-form, the false positive is occurred around 6.4%. It is 500,000*0.064 = 32,000 pairs. If we use the summation-form upper bound, the false positive is 0.3%; it is 500,000*0.003 = 1,500 pairs. The result is not so good but it is reasonable for finding motif in a 100-pages book. Note that we use only few assumption to find the upper bound.

Further Analysis Possible Solution 1: Make upper bound tighter by tighter the equation (**) Possible Solution 2: Use more assumption such as normal or Gaussian distribution. – For the previous case, we set µ=100 and σ=10 and motif distance d <10. It is very very small chance that any distance less than 10 (< µ-9σ). Possible Solution 3: Use stronger bound like Chernoff’s inequality but it is still required more assumption.