Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr.

Slides:



Advertisements
Similar presentations
Randomness Conductors Expander Graphs Randomness Extractors Condensers Universal Hash Functions
Advertisements

So Much Data Bernard Chazelle Princeton University Princeton University Bernard Chazelle Princeton University Princeton University So Little Time.
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
Size-estimation framework with applications to transitive closure and reachability Presented by Maxim Kalaev Edith Cohen AT&T Bell Labs 1996.
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Sampling and Pulse Code Modulation
1 Discrete Structures & Algorithms Graphs and Trees: III EECE 320.
Approximating Average Parameters of Graphs Oded Goldreich, Weizmann Institute Dana Ron, Tel Aviv University.
. Markov Chains as a Learning Tool. 2 Weather: raining today40% rain tomorrow 60% no rain tomorrow not raining today20% rain tomorrow 80% no rain tomorrow.
CHAPTER 16 MARKOV CHAIN MONTE CARLO
1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.
Sampling Attila Gyulassy Image Synthesis. Overview Problem Statement Random Number Generators Quasi-Random Number Generation Uniform sampling of Disks,
CS 188: Artificial Intelligence Fall 2009 Lecture 20: Particle Filtering 11/5/2009 Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 8 May 4, 2005
Network Design Adam Meyerson Carnegie-Mellon University.
Data-Powered Algorithms Bernard Chazelle Princeton University Bernard Chazelle Princeton University.
Face Recognition Using Embedded Hidden Markov Model.
Sublinear Algorithms for Approximating Graph Parameters Dana Ron Tel-Aviv University.
UMass Lowell Computer Science Analysis of Algorithms Prof. Karen Daniels Fall, 2002 Review Lecture Tuesday, 12/10/02.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
Sublinear Algorithms for Approximating Graph Parameters Dana Ron Tel-Aviv University.
Abstract Shortest distance query is a fundamental operation in large-scale networks. Many existing methods in the literature take a landmark embedding.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Introduction Outline The Problem Domain Network Design Spanning Trees Steiner Trees Triangulation Technique Spanners Spanners Application Simple Greedy.
Review of Probability.
Manifold learning: Locally Linear Embedding Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Patch Complexity, Finite Pixel Correlations and Optimal Denoising Anat Levin, Boaz Nadler, Fredo Durand and Bill Freeman Weizmann Institute, MIT CSAIL.
Nirmalya Roy School of Electrical Engineering and Computer Science Washington State University Cpt S 223 – Advanced Data Structures Graph Algorithms: Minimum.
Isolated-Word Speech Recognition Using Hidden Markov Models
online convex optimization (with partial information)
RANDOMNESS AND PSEUDORANDOMNESS Omer Reingold, Microsoft Research and Weizmann.
Digital Image Processing
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
Probability and Measure September 2, Nonparametric Bayesian Fundamental Problem: Estimating Distribution from a collection of Data E. ( X a distribution-valued.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
DIGITAL COMMUNICATIONS Linear Block Codes
Low-Dimensional Chaotic Signal Characterization Using Approximate Entropy Soundararajan Ezekiel Matthew Lang Computer Science Department Indiana University.
…. 2 Ongoing software project, not “theory” Encapsulated internals & interfaces Today: –Details of module internals –Details of architecture & signaling/feedback.
Topics in Algorithms 2007 Ramesh Hariharan. Tree Embeddings.
QUIZ!!  In HMMs...  T/F:... the emissions are hidden. FALSE  T/F:... observations are independent given no evidence. FALSE  T/F:... each variable X.
Massive Data Sets and Information Theory Ziv Bar-Yossef Department of Electrical Engineering Technion.
Seminar on random walks on graphs Lecture No. 2 Mille Gandelsman,
Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.
Complexity and Efficient Algorithms Group / Department of Computer Science Testing the Cluster Structure of Graphs Christian Sohler joint work with Artur.
11 Lecture 24: MapReduce Algorithms Wrap-up. Admin PS2-4 solutions Project presentations next week – 20min presentation/team – 10 teams => 3 days – 3.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!
Sampling algorithms and Markov chains László Lovász Microsoft Research One Microsoft Way, Redmond, WA 98052
RANDOMNESS AND PSEUDORANDOMNESS Omer Reingold, Microsoft Research and Weizmann.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
1 Kernel Machines A relatively new learning methodology (1992) derived from statistical learning theory. Became famous when it gave accuracy comparable.
1 Passive Reinforcement Learning Ruti Glick Bar-Ilan university.
Biointelligence Laboratory, Seoul National University
Semi-Supervised Clustering
CS 326A: Motion Planning Probabilistic Roadmaps for Path Planning in High-Dimensional Configuration Spaces (1996) L. Kavraki, P. Švestka, J.-C. Latombe,
Intro to Sampling Methods
Outline Introduction Signal, random variable, random process and spectra Analog modulation Analog to digital conversion Digital transmission through baseband.
RE-Tree: An Efficient Index Structure for Regular Expressions
Haim Kaplan and Uri Zwick
2018/9/16 Distributed Source Coding Using Syndromes (DISCUS): Design and Construction S.Sandeep Pradhan, Kannan Ramchandran IEEE Transactions on Information.
Spatial Online Sampling and Aggregation
ICS 353: Design and Analysis of Algorithms
Randomized Algorithms CS648
Partitioning and decomposing graphs László Lovász
Digital Image Processing Week IV
Synthesis of Motion from Simple Animations
Speech recognition, machine learning
Topological Signatures For Fast Mobility Analysis
Speech recognition, machine learning
Presentation transcript:

Biomedical imaging Sloan Digital Sky Survey 4 petabytes (~1MG) (~1MG) 10 petabytes/yr 150 petabytes/yr

massive input massive input output Sublinear algorithm s Sample tiny fraction

Approximate MST [CRT ’01]

Reduces to counting connected components

EE = no. connected components varvar << (no. connected components) 22

Shortest Paths [CLM ’03]

Ray Shooting  Volume  Intersection  Point location  Volume  Intersection  Point location [CLM ’03]

low-entropy data low-entropy data Takens embeddings Takens embeddings Markov models (speech) Markov models (speech) Takens embeddings Takens embeddings Markov models (speech) Markov models (speech)

Self-Improving Algorithms Arbitrary, unknown random source Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering Sorting Sorting Matching Matching MaxCut MaxCut All pairs shortest paths All pairs shortest paths Transitive closure Transitive closure Clustering Clustering

Self-Improving Algorithms Arbitrary, unknown random source 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 1. Run algorithm for best worst-case behavior 1. Run algorithm for best worst-case behavior or best under uniform distribution or best under or best under uniform distribution or best under some postulated prior. some postulated prior. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 2. Learning phase: Algorithm finetunes itself 2. Learning phase: Algorithm finetunes itself as it learns about the random source through as it learns about the random source through repeated use. repeated use. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source. 3. Algorithm settles to stationary status: optimal 3. Algorithm settles to stationary status: optimal expected complexity under (still unknown) expected complexity under (still unknown) random source. random source.

Self-Improving Algorithms E Tk  Optimal expected time for random source time T1 time T2 time T5 time T3 time T4

(x1, x2, …, xn) SortingSorting  each xi independent from Di  H = entropy of rank distribution  each xi independent from Di  H = entropy of rank distribution

ClusteringClustering K-median (k=2)

Minimize sum of distances Hamming cube {0,1} dd

Minimize sum of distances Hamming cube {0,1} dd

Minimize sum of distances Hamming cube {0,1} dd [KSS][KSS]

How to achieve linear limiting expected time? Input space {0,1} dndn prob < O(dn)/KSS Identify core Tail:Tail: Use KSS

How to achieve linear limiting expected time? Store sample of precomputed KSS nearest neighbor Incremental algorithm NP vs P: input vicinity  algorithmic vicinity

Main difficulty: How to spot the tail?

1. Data is accessible before noise 2. Or it’s not 2. Or ?

1. Data is accessible before noise

encode decode

Data inaccessible before noise Assumptions are necessary !

Data inaccessible before noise 2. Bipartite graph, expander 3. Solid w/ angular constraints 1. Sorted sequence 4. Low dim attractor set

Data inaccessible before noise data must satisfy data must satisfy some property P but does not quite

f(x) = ? x f(x) But life being what it is… data f = access function

f(x) = ? x f(x) data

Humans Define distance from any object to data class

f(x) = ? x g(x) x 1, x 2,… f ( x 1), f ( x 2),… filter g is access function for:

Similar to Self-Correction [RS96, BLR’93] except: about data, not functions error-free allows O(distance to property)

Monotone function: [n]  R d Filter requires polylog (n) queries

Offline reconstruction

Online reconstruction

monotone function

Frequency of a point Smallest interval I containing > |I|/2 violations involving f(x) xx

Frequency of a point

Given x: 1. estimate its frequency 2. if nonzero, find “smallest” interval around x with both endpoints having zero frequency 3. interpolate between f(endpoints)

To prove: 1. Frequencies can be estimated in 2. Function is monotone over polylog time 3. ZF domain occupies (1-2 zero-frequency domain ) fraction

Bivariate concave function Filter requires polylog (n) queries

bipartite graph k-connectivity expander

denoising low-dim attractor sets

Priced computation & accuracy Priced computation & accuracy spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting spectrometry/cloning/gene chip spectrometry/cloning/gene chip PCR/hybridization/chromatography PCR/hybridization/chromatography gel electrophoresis/blotting gel electrophoresis/blotting o Linear programming Linear programming

computation experimentation

Pricing data Pricing data Ongoing project w/ Nir Ailon Factoring is easy. Here’s why… Gaussian mixture sample: ….

Collaborators: Nir Ailon, Seshadri Comandur, Ding Liu