So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time
So Many Slides Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time So Little Time (before lunch) (before lunch)
computation math experimentationalgorithms
Computers have two problems
1. They don’t have steering wheels
2. End of Moore’s Law party’s over !
computation algorithms experimentation
32 x = 544 This is not me
FFT RSA
noisy low entropy uncertain unevenly priced big
noisy low entropy uncertain unevenly priced big
Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr
Collected works of Micha Sharir My A(9,9)-th paper
massive input massive input output Sublinear Algorithms Sample tiny fraction
Shortest Paths [C-Liu-Magen ’03] New York DelphiDelphi
Ray Shooting Volume Intersection Point location
Approximate MST [C-Rubinfeld- Trevisan ’01]
Reduces to counting connected components
EE = no. connected components varvar << (no. connected components) 22 whp, is a good estimator of # connected components
worst case input space average case (uniform)
worst case
average case = actuarial view
“ OK, if you elect NOT to have the surgery, the insurance company offers 6 days and 7 nights in Barbados. “
arbitrary, unknown random source Self-Improving Algorithms
Yes ! This could be YOU, too !
E Tk Optimal expected time for random source time T1 time T2 time T3 time T4
Clustering [ Ailon-C-Liu-Comandur ’05 ] K-median over Hamming cube
minimize sum of distances
[ Kumar-Sabharwal-Sen ’04 ] COST OPT ( 1 + )
How to achieve linear limiting time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS
Store sample of precomputed KSS Nearest neighbor Incremental algorithm
Main difficulty: How to spot the tail?
encode
decode
Data inaccessible before noise What makes you think it’s wrong?
Data inaccessible before noise must satisfy some property (eg, convex, bipartite) but does not quite
f(x) = ? x f(x) data f = access function
f(x) = ? x f(x) f = access function
f(x) = ? x f(x) But life being what it is…
f(x) = ? x f(x)
Humans Define distance from any object to data class
f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:
Online Data Reconstructio n Online Data Reconstructio n
Monotone function: [n] R d Filter requires polylog (n) lookups [ Ailon-C-Liu-Comandur ’04 ] [ Ailon-C-Liu-Comandur ’04 ]
Convex polygon Filter requires : lookups [C-Comandur ’06 ]
Convex terrain lookups Filter requires :
Iterated planar separator theorem
Iterated (weak) planar separator theorem Iterated (weak) planar separator theorem in sublinear time!
Using epsilon-nets in spaces of unbounded VC dimension reconstruct
bipartite graph k-connectivity expander
denoising low-dim attractor sets
Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting o Linear programming Linear programming
Pricing data Pricing data Factoring is easy. Here’s why… Gaussian mixture sample: ….
Collaborators: Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Avner Magen, Ronitt Rubinfeld, Luca Trevisan Collaborators: Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu Avner Magen, Ronitt Rubinfeld, Luca Trevisan