Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr
massive input massive input output Sublinear algorithm s Sample tiny fraction
Approximate MST [CRT ’01]
Reduces to counting connected components
EE = no. connected components varvar << (no. connected components) 22
Shortest Paths [CLM ’03]
Ray Shooting Volume Intersection Point location Volume Intersection Point location [CLM ’03]
low-entropy data low-entropy data Takens embeddings Takens embeddings Markov models (speech) Markov models (speech) Takens embeddings Takens embeddings Markov models (speech) Markov models (speech)
Self-Improving Algorithms Arbitrary, unknown random source Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering
Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source.
Self-Improving Algorithms E Tk Optimal expected time for random source time T1 time T2 time T5 time T3 time T4
(x1, x2, …, xn) SortingSorting each xi independent from Di H = entropy of rank distribution each xi independent from Di H = entropy of rank distribution
ClusteringClustering K-median (k=2)
Minimize sum of distances Hamming cube {0,1} dd
Minimize sum of distances Hamming cube {0,1} dd
Minimize sum of distances Hamming cube {0,1} dd [KSS][KSS]
How to achieve linear limiting expected time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS
How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm NP vs P: input vicinity algorithmic vicinity
Main difficulty: How to spot the tail?
1. Data is accessible before noise 2. Or it’s not 2. Or ?
1. Data is accessible before noise
encode decode
Data inaccessible before noise Assumptions are necessary !
Data inaccessible before noise 2. Bipartite graph, expander 3. Solid w/ angular constraints 1. Sorted sequence 4. Low dim attractor set
Data inaccessible before noise data must satisfy data must satisfy some property P but does not quite
f(x) = ? x f(x) But life being what it is… data f = access function
f(x) = ? x f(x) data
Humans Define distance from any object to data class
f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:
Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)
Monotone function: [n] R d Filter requires polylog (n) queries
Offline reconstruction
Online reconstruction
monotone function
Frequency of a point Smallest interval I containing > |I|/2 violations involving f(x) xx
Frequency of a point
Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)
To prove: 1. Frequencies can be estimated in 2. Function is monotone over polylog time 3. ZF domain occupies (1-2 zero-frequency domain ) fraction
Bivariate concave function Filter requires polylog (n) queries
bipartite graph k-connectivity expander
denoising low-dim attractor sets
Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting o Linear programming Linear programming
computation experimentation
Pricing data Pricing data Ongoing project w/ Nir Ailon Factoring is easy. Here’s why… Gaussian mixture sample: ….
Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu