Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos
Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources USC 2001 C. Faloutsos
Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns USC 2001 C. Faloutsos
Sample queries Similarity search Find pairs of branches with similar sales patterns find medical cases similar to Smith's Find pairs of sensor series that move in sync Find shapes like a spark-plug USC 2001 C. Faloutsos
Sample queries –cont’d Rule discovery Clusters (of branches; of sensor data; ...) Forecasting (total sales for next year?) Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions related projects @ CMU and resourses USC 2001 C. Faloutsos
Indexing - Multimedia Problem: given a set of (multimedia) objects, find the ones similar to a desirable query object USC 2001 C. Faloutsos
distance function: by expert day $price 1 365 day $price 1 365 day $price 1 365 distance function: by expert USC 2001 C. Faloutsos
‘GEMINI’ - Pictorially eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 365 day USC 2001 C. Faloutsos
Remaining issues how to extract features automatically? how to merge similarity scores from different media USC 2001 C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos
FastMap ~100 O1 O2 O3 O4 O5 1 100 ?? ~1 USC 2001 C. Faloutsos
FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos
Applications: time sequences given n co-evolving time sequences visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos
Applications - financial currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos
Applications - financial currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5) USC 2001 C. Faloutsos
Application: VideoTrails [ACM MM97] USC 2001 C. Faloutsos
VideoTrails - usage scene-cut detection (about 10% errors) scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos
Merging similarity scores eg., video: text, color, motion, audio weights change with the query! solution 1: user specifies weights solution 2: user gives examples and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) but: how about disjunctive queries? USC 2001 C. Faloutsos
‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001 C. Faloutsos
“Single query point” methods + + + x + + + Rocchio USC 2001 C. Faloutsos
“Single query point” methods + + + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... USC 2001 C. Faloutsos
Main idea: FALCON Contours [Wu+, vldb2000] + + feature2 eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos
Conclusions for indexing + visualization GEMINI: fast indexing, exploiting off-the-shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos
Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses USC 2001 C. Faloutsos
Data mining & fractals – Road map Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos
Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) attraction/repulsion? separability?? USC 2001 C. Faloutsos
Problem#2: dim. reduction given attributes x1, ... xn possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos
Answer: Fractals / self-similarities / power laws USC 2001 C. Faloutsos
What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... USC 2001 C. Faloutsos
Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos
Intrinsic (‘fractal’) dimension Eg: #cylinders; miles / gallon Q: fractal dimension of a line? x y 5 1 4 2 3 USC 2001 C. Faloutsos
Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos
Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) USC 2001 C. Faloutsos
Sierpinsky triangle == ‘correlation integral’ log(#pairs log( r ) log(#pairs within <=r ) 1.58 USC 2001 C. Faloutsos
Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos
Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
[w/ Seeger, Traina, Traina, SIGMOD00] Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
spatial d.m. r1 r2 r2 r1 Heuristic on choosing # of clusters USC 2001 C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos
Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos
Problem #2: Dim. reduction USC 2001 C. Faloutsos
Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos
Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
Problem #2: dim. reduction global FD=1 PFD=1 PFD=1 Notice: ‘max variance’ would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 Notice: SVD would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos
Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos
disk traffic Not Poisson, not(?) iid - BUT: self-similar How to model it? time #bytes USC 2001 C. Faloutsos
traffic disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80% #bytes time USC 2001 C. Faloutsos
Traffic Many other time-sequences are bursty/clustered: (such as?) USC 2001 C. Faloutsos
Tape accesses # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N USC 2001 C. Faloutsos
Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real time Tape#1 Tape# N real # qual. records USC 2001 C. Faloutsos
More apps: Brain scans Oct-trees; brain-scans Log(#octants) 2.63 = fd octree levels Log(#octants) 2.63 = fd USC 2001 C. Faloutsos
GIS points Cross-roads of Montgomery county: any rules? USC 2001 C. Faloutsos
GIS A: self-similarity: intrinsic dim. = 1.51 avg#neighbors(<= r ) = r^D log(#pairs(within <= r)) 1.51 log( r ) USC 2001 C. Faloutsos
Examples:LB county Long Beach county of CA (road end-points) USC 2001 C. Faloutsos
More fractals: cardiovascular system: 3 (!) stock prices (LYCOS) - random walks: 1.5 Coastlines: 1.2-1.58 (?) 1 year 2 years USC 2001 C. Faloutsos
USC 2001 C. Faloutsos
Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos
Fractals <-> Power laws self-similarity -> <=> fractals <=> scale-free <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos
Zipf’s law “the” log(freq) “and” Bible RANK-FREQUENCY plot: (in log-log scales) log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos
Zipf’s law similarly for first names (slope ~-1) last names (~ -0.7) etc USC 2001 C. Faloutsos
More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day magnitude USC 2001 C. Faloutsos
Clickstream data <url, u-id, ....> Web Site Traffic log(count) log(freq) log(count) Zipf USC 2001 C. Faloutsos
Lotka’s law library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos
Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos
More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos
(Korcak’s law: Aegean islands) USC 2001 C. Faloutsos
Olympic medals: log(# medals) Russia China USA log rank USC 2001 C. Faloutsos
SALES data – store#96 count of products # units sold USC 2001 C. Faloutsos
TELCO data count of customers # of service units USC 2001 C. Faloutsos
More power laws on the Internet log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos
Even more power laws: Income distribution (Pareto’s law); duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos
Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) automatic feature extraction: FastMap Relevance feedback: FALCON USC 2001 C. Faloutsos
Conclusions - cont’d New tools for Data Mining: Fractals/power laws: appear everywhere lead to skewed distributions (Gaussian, Poisson, uniformity, independence) ‘correlation integral’ for separability/cluster detection PFD for dimensionality reduction USC 2001 C. Faloutsos
Resources: Software and papers: www.cs.cmu.edu/~christos Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos
Resources Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001 C. Faloutsos