Download presentation
Presentation is loading. Please wait.
2
Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr
5
massive input massive input output Sublinear algorithm s Sample tiny fraction
7
Approximate MST [CRT ’01]
8
Reduces to counting connected components
9
EE = no. connected components varvar << (no. connected components) 22
10
Shortest Paths [CLM ’03]
11
Ray Shooting Volume Intersection Point location Volume Intersection Point location [CLM ’03]
13
low-entropy data low-entropy data Takens embeddings Takens embeddings Markov models (speech) Markov models (speech) Takens embeddings Takens embeddings Markov models (speech) Markov models (speech)
14
Self-Improving Algorithms Arbitrary, unknown random source Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering
15
Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source.
16
Self-Improving Algorithms E Tk Optimal expected time for random source time T1 time T2 time T5 time T3 time T4
17
(x1, x2, …, xn) SortingSorting each xi independent from Di H = entropy of rank distribution each xi independent from Di H = entropy of rank distribution
18
ClusteringClustering K-median (k=2)
19
Minimize sum of distances Hamming cube {0,1} dd
20
Minimize sum of distances Hamming cube {0,1} dd
21
Minimize sum of distances Hamming cube {0,1} dd [KSS][KSS]
22
How to achieve linear limiting expected time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS
23
How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm NP vs P: input vicinity algorithmic vicinity
24
Main difficulty: How to spot the tail?
26
1. Data is accessible before noise 2. Or it’s not 2. Or ?
27
1. Data is accessible before noise
28
encode decode
29
Data inaccessible before noise Assumptions are necessary !
30
Data inaccessible before noise 2. Bipartite graph, expander 3. Solid w/ angular constraints 1. Sorted sequence 4. Low dim attractor set
31
Data inaccessible before noise data must satisfy data must satisfy some property P but does not quite
32
f(x) = ? x f(x) But life being what it is… data f = access function
33
f(x) = ? x f(x) data
34
Humans Define distance from any object to data class
35
f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:
36
Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)
37
Monotone function: [n] R d Filter requires polylog (n) queries
38
Offline reconstruction
40
Online reconstruction
44
monotone function
46
Frequency of a point Smallest interval I containing > |I|/2 violations involving f(x) xx
47
Frequency of a point
48
Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)
49
To prove: 1. Frequencies can be estimated in 2. Function is monotone over polylog time 3. ZF domain occupies (1-2 zero-frequency domain ) fraction
50
Bivariate concave function Filter requires polylog (n) queries
51
bipartite graph k-connectivity expander
52
denoising low-dim attractor sets
54
Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting 0 1 0 0 10 0 11 1 0 1 0 1 01 1 0 0 1 0 0 01 1 1o 1 0 0 1 0 Linear programming Linear programming
55
computation experimentation
56
Pricing data Pricing data Ongoing project w/ Nir Ailon Factoring is easy. Here’s why… Gaussian mixture sample: 00100101001001101010101 ….
57
Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.