Distributional Property Estimation Past, Present, and Future Gregory Valiant (Joint work w. Paul Valiant)

Slides:

Advertisements

Similar presentations

Optimal Lower Bounds for 2-Query Locally Decodable Linear Codes Kenji Obata.

Advertisements

The Data Stream Space Complexity of Cascaded Norms T.S. Jayram David Woodruff IBM Almaden.

Quantum Lower Bounds You probably Havent Seen Before (which doesnt imply that you dont know OF them) Scott Aaronson, UC Berkeley 9/24/2002.

The Future (and Past) of Quantum Lower Bounds by Polynomials Scott Aaronson UC Berkeley.

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.

Sublinear-time Algorithms for Machine Learning Ken Clarkson Elad Hazan David Woodruff IBM Almaden Technion IBM Almaden.

Why Simple Hash Functions Work : Exploiting the Entropy in a Data Stream Michael Mitzenmacher Salil Vadhan And improvements with Kai-Min Chung.

Algorithms Algorithm: what is it ?. Algorithms Algorithm: what is it ? Some representative problems : - Interval Scheduling.

Blackbox Reductions from Mechanisms to Algorithms.

Point Estimation Notes of STAT 6205 by Dr. Fan.

Fast Algorithms For Hierarchical Range Histogram Constructions

CmpE 104 SOFTWARE STATISTICAL TOOLS & METHODS MEASURING & ESTIMATING SOFTWARE SIZE AND RESOURCE & SCHEDULE ESTIMATING.

1 12. Principles of Parameter Estimation The purpose of this lecture is to illustrate the usefulness of the various concepts introduced and studied in.

Property Testing on Product Distributions: Optimal Testers for Bounded Derivative Properties Deeparnab Chakrabarty Microsoft Research Bangalore Kashyap.

Quantum Spectrum Testing Ryan O’Donnell John Wright (CMU)

Econ 140 Lecture 61 Inference about a Mean Lecture 6.

Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.

Estimating the Unseen: Sublinear Statistics Paul Valiant.

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Statistical Methods Chichang Jou Tamkang University.

1 An Asymptotically Optimal Algorithm for the Max k-Armed Bandit Problem Matthew Streeter & Stephen Smith Carnegie Mellon University NESCAI, April

Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.

1 Sampling Lower Bounds via Information Theory Ziv Bar-Yossef IBM Almaden.

1 On the Benefits of Adaptivity in Property Testing of Dense Graphs Joint work with Mira Gonen Dana Ron Tel-Aviv University.

1 Algorithmic Aspects in Property Testing of Dense Graphs Oded Goldreich – Weizmann Institute Dana Ron - Tel-Aviv University.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

Improved Approximation Bounds for Planar Point Pattern Matching (under rigid motions) Minkyoung Cho Department of Computer Science University of Maryland.

Quantum Algorithms II Andrew C. Yao Tsinghua University & Chinese U. of Hong Kong.

Learning and testing k-modal distributions Rocco A. Servedio Columbia University Joint work (in progress) with Ilias Diakonikolas UC Berkeley Costis Daskalakis.

Experimental Evaluation

1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.

Correlation testing for affine invariant properties on Shachar Lovett Institute for Advanced Study Joint with Hamed Hatami (McGill)

Approximating the MST Weight in Sublinear Time Bernard Chazelle (Princeton) Ronitt Rubinfeld (NEC) Luca Trevisan (U.C. Berkeley)

Random Sampling, Point Estimation and Maximum Likelihood.

Testing Collections of Properties Reut Levi Dana Ron Ronitt Rubinfeld ICS 2011.

7-1 Introduction The field of statistical inference consists of those methods used to make decisions or to draw conclusions about a population. These.

Communication System A communication system can be represented as in Figure. A message W, drawn from the index set {1, 2,..., M}, results in the signal.

Consistency An estimator is a consistent estimator of θ, if , i.e., if

Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.

BCS547 Neural Decoding.

Seminar on Sub-linear algorithms Prof. Ronitt Rubinfeld.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

NP Completeness Piyush Kumar. Today Reductions Proving Lower Bounds revisited Decision and Optimization Problems SAT and 3-SAT P Vs NP Dealing with NP-Complete.

Week 31 The Likelihood Function - Introduction Recall: a statistical model for some data is a set of distributions, one of which corresponds to the true.

Computacion Inteligente Least-Square Methods for System Identification.

Approximation Algorithms based on linear programming.

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.

Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.

Estimating standard error using bootstrap

Stat 223 Introduction to the Theory of Statistics

Algorithms for Big Data: Streaming and Sublinear Time Algorithms

On Sample Based Testers

Information Complexity Lower Bounds

Probability Theory and Parameter Estimation I

By Eliezer Yucht Prepared under the supervision of Prof. Dana Ron

On Testing Dynamic Environments

Approximating the MST Weight in Sublinear Time

From dense to sparse and back again: On testing graph properties (and some properties of Oded)

Lecture 18: Uniformity Testing Monotonicity Testing

Distributed Submodular Maximization in Massive Datasets

Testing with Alternative Distances

k-center Clustering under Perturbation Resilience

CIS 700: “algorithms for Big Data”

The Curve Merger (Dvir & Widgerson, 2008)

Summarizing Data by Statistics

Gautam Kamath Simons Institute  University of Waterloo

CSCI B609: “Foundations of Data Science”

Chapter 11 Limitations of Algorithm Power

The Byzantine Secretary Problem

Presentation transcript:

Distributional Property Estimation Past, Present, and Future Gregory Valiant (Joint work w. Paul Valiant)

Given a property of interest, and access to independent draws from a fixed distribution D, how many draws are necessary to estimate the property accurately? Distributional Property Estimation We will focus on symmetric properties. Definition: Let  be set of distributions over {1,2,…n} Property  :   R is symmetric, if invariant to relabeling support: for permutation  D)=  D  ) e.g entropy, support size, distance to uniformity, etc. For properties of pairs of distributions: distance metrics, etc.

Symmetric Properties `Histogram’ of a distribution: Given distribution D h D : (0,1] -> N h(x):= # domain elmts of D that occur w. prob x e.g. Unif[n] has h(1/n)=n, and h(x)=0 for all x≠1/n Fact: any “symmetric” property is a function of only h e.g. H(D)=  x:h(x)≠0 h(x) x log x Support(D)=  x:h(x)≠0 h(x) ‘Fingerprint’ of set of samples [aka profile, collision stats] f=f 1,f 2,…, f k f i :=# elmts seen exactly i times in the sample Fact: To estimate symmetric properties, fingerprint contains all useful information.

The Empirical Estimate 1/k 2/k3/k4/k 5/k6/k7/k8/k 9/k10/k11/k 12/k13/k14/k15/k log(1/k)  log(2/k)  log(3/k)  log(4/k)  log(5/k)  log(6/k)  log(7/k)  log(8/k)  log(9/k)  log(10/k)  log(11/k)  log(12/k)  log(13/k)  log(14/k)  log(15/k)  Better estimates? Apply something other than log to the empirical distribution z(1/k)  z(2/k)  z(3/k)  z(4/k)  z(5/k)  z(6/k)  z(7/k)  z(8/k)  z(9/k)  z(10/k)  z(11/k)  z(12/k)  z(13/k)  z(14/k)  z(15/k)  “fingerprint” of sample: i.e. ~120 domain elements seen once, 75 seen twice,.. Entropy: H(D)=  x:h(x)≠0 h(x) x log x

Linear Estimators Most estimators considered over past 100+ years: “Linear estimators” d1d1 d2d2 d3d3 c1c1 + c 2  + c 3  +    What richness of algorithmic machinery is necessary to effectively estimate these properties? Output:

s.t. for all distributions p over [n] “Expectation of estimator z applied to k samples from p is within ε of H(p)” Searching for Better Estimators Finding the Best Estimator Bias Variance

Surprising Theorem [VV’11] Thm: Given parameters n,k,ε, and a linear property π Either OR

“Find lower bound instance y +,y - Maximize H(y + )-H(y - ) s.t. expected fingerprint entries given k samples from y +,y - match to within k 1-c “”  for y +,y - dists. over [n] Proof Idea: Duality!! s.t. for all dists. p over [n] “Find estimator z : Minimize ε, s.t. expectation of estimator z applied to k samples from p is within ε of H(p) ” s.t. for all i, E[f + i ] – E[f - i ] ≤ k 1-c

So…do these estimators work in practice? Maybe unsurprising, since these estimators are defined via worst worst-case instances. Next part of talk: more robust approach.

Estimating the Unseen Given independent samples from a distribution (of discrete support): Empirical distribution  optimally approximates seen portion of distribution What can we infer about the unseen portion? How can inferences about the unseen portion yield better estimates of distribution properties? D

vs Some concrete problems Q1: Given a length n vector, how many indices must we look at to estimate # distinct elements, to +/-  n (w.h.p)? [distinct elements problem] Q2: Given a sample from D supported on {1,…,n}, how large a sample required to estimate entropy(D) to within +/-  (w.h.p)? Q3: Given samples from D1 and D2 supported on {1,2,…,n}, what sample size is required to estimate Dist(D1,D2) to within +/-  (w.h.p)? … abacc Distinct Elements Entropy Distance O(n logn) Trivial  (n) [Bar Yossef et al.’01] [P. Valiant, ‘08] [Raskhodnikova et al. ‘09]  (n) [Batu et al.’01,’02] [Paninski, ’03,’04] [Dasgupta et al, ’05]  (n) [Goldreich et al. ‘96] [Batu et al. ‘00,’01]  ( ) n log n [VV11/13] AnswerPrevious n ……

Fisher’s Butterflies Turing’s Enigma Codewords How many new species if I observe for another period? Probability mass of unseen codewords f 1 - f 2 +f 3 -f 4 +f 5 - … f 1 / (number of samples) (“fingerprint” of the samples)

Reasoning Beyond the Empirical Distribution Fingerprint based on sample of size kFingerprint based on sample of size 10000

Linear Programming “Find distributions whose expected fingerprint is close to the observed fingerprint of the sample” Feasible Region Must show diameter of feasible region is small!! Entropy Distinct Elements Other Property  (n/ log n) samples, and OPTIMAL

Linear Programming (revisited) “Find distributions whose expected fingerprint is close to the observed fingerprint of the sample” histogram Thm: For sufficiently large n, and any constant c>1, given c n / log n ind. draws from D, of support at most n, with prob > 1-exp(-  (n)), our alg returns histogram h’ s.t. R(h D, h’) < O (1/c 1/2 ) Additionally, our algorithm runs in time linear in the number of samples. R(h,h’): Relative Wass. Metric: [sup over functions f s.t. |f’(x)|<1/x, …] Corollary: For any  > 0, given O(n/  2 log n) draws from a distribution D of support at most n, with prob > 1-exp(-  (n)) our algorithm returns v s.t. |v-H(D)|< 

So…do the estimators work in practice? YES!!

Performance in Practice (entropy) Zipf: power law distr. p j  1/j (or 1/j c )

Performance in Practice (entropy)

Performance in Practice (support size) Task: Pick a (short) passage from Hamlet, then estimate # distinct words in Hamlet

The Big Picture [Estimating Symmetric Properties] “Linear estimators” f1f1 f2f2 f3f3 c1c1 + c 2  + c 3  +    Estimating Unseen Linear Programming Substantially more robust Both optimal (to const. factor) in worst-case, “Unseen approach” seems better for most inputs (does not require knowledge of upper bound on support size, not defined via worst-case inputs,…) Can one prove something beyond the “worst-case” setting?

Back to Basics Hypothesis testing for distributions: Given  >0, Distribution P = p 1 p 2 … samples from unknown Q Decide: P=Q versus ||P-Q|| 1 > 

Prior Work Data needed Type of input: distribution over [n] Uniform distribution ??? Unknown Distribution Is it P? Or >ε-far from P? Pearson’s chi-squared test: >n Batu et al. O(n 1/2 polylog n/  4 ) Goldreich-Ron: O(n 1/2 /  4 ) Paninski: O(n 1/2 /  2 )

Theorem Instance Optimal Testing [VV’14]  Fixed function f(P,ε) and constants c,c’: Our tester can distinguish Q=P from|Q-P| 1 >ε using f(P,ε) samples (w. prob >2/3) No tester can distinguish Q=P from |Q-P| 1 >cε using c’f(P,ε) samples (w. prob >2/3) f(P,  )= max ( 1/    -max   P -    -max  P -       P   If P supported on <n elements,  P    n 1/2/2

The Algorithm (intuition) Pearson’s chi-squared testOur Test Given P=(p 1,p 2,…), and Poi(k) samples from Q: X i = # times ith elmt occurs ii (X i – k p i ) 2 - k p i p i ii (X i – k p i ) 2 - X i p i 2/3 Replacing “kp i ” with “X i ” does not significantly change expectation, but reduces variance for elmts seen once. Normalizing by p i 2/3 makes us more tolerant of errors in the light elements…

Future Directions Instance optimal property estimation/learning in other settings. Harder than identity testing---we leveraged knowledge of P to build tester. Still might be possible, and if so, likely to have rich theory, and lead to algorithms that work extremely well in practice. Still don’t really understand many basic property estimation questions, and lack good algorithms (even/especially in practice!) Many tools and anecdotes, but big picture still hazy