Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr.

Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr

massive input massive input output Sublinear algorithm s Sample tiny fraction

Approximate MST [CRT ’01]

Reduces to counting connected components

EE = no. connected components varvar << (no. connected components) 22

Shortest Paths [CLM ’03]

Ray Shooting  Volume  Intersection  Point location  Volume  Intersection  Point location [CLM ’03]

low-entropy data low-entropy data Takens embeddings Takens embeddings Markov models (speech) Markov models (speech) Takens embeddings Takens embeddings Markov models (speech) Markov models (speech)

Self-Improving Algorithms Arbitrary, unknown random source Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering

Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source.

Self-Improving Algorithms E Tk  Optimal expected time for random source time T1 time T2 time T5 time T3 time T4

(x1, x2, …, xn) SortingSorting  each xi independent from Di  H = entropy of rank distribution  each xi independent from Di  H = entropy of rank distribution

ClusteringClustering K-median (k=2)

Minimize sum of distances Hamming cube {0,1} dd

Minimize sum of distances Hamming cube {0,1} dd [KSS][KSS]

How to achieve linear limiting expected time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS

How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm NP vs P: input vicinity  algorithmic vicinity

Main difficulty: How to spot the tail?

1. Data is accessible before noise 2. Or it’s not 2. Or ?

1. Data is accessible before noise

encode decode

Data inaccessible before noise Assumptions are necessary !

Data inaccessible before noise 2. Bipartite graph, expander 3. Solid w/ angular constraints 1. Sorted sequence 4. Low dim attractor set

Data inaccessible before noise data must satisfy data must satisfy some property P but does not quite

f(x) = ? x f(x) But life being what it is… data f = access function

f(x) = ? x f(x) data

Humans Define distance from any object to data class

f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:

Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)

Monotone function: [n]  R d Filter requires polylog (n) queries

Offline reconstruction

Online reconstruction

monotone function

Frequency of a point Smallest interval I containing > |I|/2 violations involving f(x) xx

Frequency of a point

Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)

To prove: 1. Frequencies can be estimated in 2. Function is monotone over polylog time 3. ZF domain occupies (1-2 zero-frequency domain ) fraction

Bivariate concave function Filter requires polylog (n) queries

bipartite graph k-connectivity expander

denoising low-dim attractor sets

Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting 0 1 0 0 10 0 11 1 0 1 0 1 01 1 0 0 1 0 0 01 1 1o 1 0 0 1 0 Linear programming Linear programming

computation experimentation

Pricing data Pricing data Ongoing project w/ Nir Ailon Factoring is easy. Here’s why… Gaussian mixture sample: 00100101001001101010101 ….

Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu

Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr.

Similar presentations

Presentation on theme: "Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr.

Similar presentations

Presentation on theme: "Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr."— Presentation transcript:

Similar presentations

About project

Feedback