Presentation is loading. Please wait.

Presentation is loading. Please wait.

Indexing and Data Mining in Multimedia Databases

Similar presentations


Presentation on theme: "Indexing and Data Mining in Multimedia Databases"— Presentation transcript:

1 Indexing and Data Mining in Multimedia Databases
Christos Faloutsos CMU

2 Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources USC 2001 C. Faloutsos

3 Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns USC 2001 C. Faloutsos

4 Sample queries Similarity search
Find pairs of branches with similar sales patterns find medical cases similar to Smith's Find pairs of sensor series that move in sync Find shapes like a spark-plug USC 2001 C. Faloutsos

5 Sample queries –cont’d
Rule discovery Clusters (of branches; of sensor data; ...) Forecasting (total sales for next year?) Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos

6 Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions related CMU and resourses USC 2001 C. Faloutsos

7 Indexing - Multimedia Problem: given a set of (multimedia) objects,
find the ones similar to a desirable query object USC 2001 C. Faloutsos

8 distance function: by expert
day $price 1 365 day $price 1 365 day $price 1 365 distance function: by expert USC 2001 C. Faloutsos

9 ‘GEMINI’ - Pictorially
eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 365 day USC 2001 C. Faloutsos

10 Remaining issues how to extract features automatically?
how to merge similarity scores from different media USC 2001 C. Faloutsos

11 Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos

12 FastMap ~100 O1 O2 O3 O4 O5 1 100 ?? ~1 USC 2001 C. Faloutsos

13 FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos

14 Applications: time sequences
given n co-evolving time sequences visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos

15 Applications - financial
currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos

16 Applications - financial
currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5) USC 2001 C. Faloutsos

17 Application: VideoTrails
[ACM MM97] USC 2001 C. Faloutsos

18 VideoTrails - usage scene-cut detection (about 10% errors)
scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos

19 Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos

20 Merging similarity scores
eg., video: text, color, motion, audio weights change with the query! solution 1: user specifies weights solution 2: user gives examples  and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) but: how about disjunctive queries? USC 2001 C. Faloutsos

21 ‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001
C. Faloutsos

22 “Single query point” methods
+ + + x + + + Rocchio USC 2001 C. Faloutsos

23 “Single query point” methods
+ + + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... USC 2001 C. Faloutsos

24 Main idea: FALCON Contours
[Wu+, vldb2000] + + feature2 eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos

25 Conclusions for indexing + visualization
GEMINI: fast indexing, exploiting off-the-shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos

26 Outline Goal: ‘Find similar / interesting things’
Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses USC 2001 C. Faloutsos

27 Data mining & fractals – Road map
Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos

28 Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) attraction/repulsion? separability?? USC 2001 C. Faloutsos

29 Problem#2: dim. reduction
given attributes x1, ... xn possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos

30 Answer: Fractals / self-similarities / power laws USC 2001
C. Faloutsos

31 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... USC 2001 C. Faloutsos

32 Definitions (cont’d) Paradox: Infinite perimeter ; Zero area!
‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos

33 Intrinsic (‘fractal’) dimension
Eg: #cylinders; miles / gallon Q: fractal dimension of a line? x y 5 1 4 2 3 USC 2001 C. Faloutsos

34 Intrinsic (‘fractal’) dimension
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos

35 Intrinsic (‘fractal’) dimension
Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) USC 2001 C. Faloutsos

36 Sierpinsky triangle == ‘correlation integral’ log(#pairs
log( r ) log(#pairs within <=r ) 1.58 USC 2001 C. Faloutsos

37 Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos

38 Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos

39 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

40 [w/ Seeger, Traina, Traina, SIGMOD00]
Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

41 spatial d.m. r1 r2 r2 r1 Heuristic on choosing # of clusters USC 2001
C. Faloutsos

42 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

43 Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope
- plateau! repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos

44 Problem #2: Dim. reduction
USC 2001 C. Faloutsos

45 Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos

46 Problem #2: dim. reduction
global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

47 Problem #2: dim. reduction
global FD=1 PFD=1 PFD=1 Notice: ‘max variance’ would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

48 Problem #2: dim. reduction
global FD=1 PFD=1 PFD~1 Notice: SVD would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

49 Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos

50 disk traffic Not Poisson, not(?) iid - BUT: self-similar
How to model it? time #bytes USC 2001 C. Faloutsos

51 traffic disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80%
#bytes time USC 2001 C. Faloutsos

52 Traffic Many other time-sequences are bursty/clustered: (such as?)
USC 2001 C. Faloutsos

53 Tape accesses # tapes needed, to retrieve n records?
(# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N USC 2001 C. Faloutsos

54 Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real
time Tape#1 Tape# N real # qual. records USC 2001 C. Faloutsos

55 More apps: Brain scans Oct-trees; brain-scans Log(#octants) 2.63 = fd
octree levels Log(#octants) 2.63 = fd USC 2001 C. Faloutsos

56 GIS points Cross-roads of Montgomery county: any rules? USC 2001
C. Faloutsos

57 GIS A: self-similarity: intrinsic dim. = 1.51
avg#neighbors(<= r ) = r^D log(#pairs(within <= r)) 1.51 log( r ) USC 2001 C. Faloutsos

58 Examples:LB county Long Beach county of CA (road end-points) USC 2001
C. Faloutsos

59 More fractals: cardiovascular system: 3 (!)
stock prices (LYCOS) - random walks: 1.5 Coastlines: (?) 1 year 2 years USC 2001 C. Faloutsos

60 USC 2001 C. Faloutsos

61 Road map Motivation – problems / case studies
Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos

62 Fractals <-> Power laws
self-similarity -> <=> fractals <=> scale-free <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos

63 Zipf’s law “the” log(freq) “and” Bible
RANK-FREQUENCY plot: (in log-log scales) log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos

64 Zipf’s law similarly for first names (slope ~-1) last names (~ -0.7)
etc USC 2001 C. Faloutsos

65 More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day magnitude USC 2001 C. Faloutsos

66 Clickstream data <url, u-id, ....> Web Site Traffic log(count)
log(freq) log(count) Zipf USC 2001 C. Faloutsos

67 Lotka’s law library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos

68 Korcak’s law log(count( >= area))
Scandinavian lakes area vs complementary cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos

69 More power laws: Korcak
log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos

70 (Korcak’s law: Aegean islands)
USC 2001 C. Faloutsos

71 Olympic medals: log(# medals) Russia China USA log rank USC 2001
C. Faloutsos

72 SALES data – store#96 count of products # units sold USC 2001
C. Faloutsos

73 TELCO data count of customers # of service units USC 2001 C. Faloutsos

74 More power laws on the Internet
log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos

75 Even more power laws: Income distribution (Pareto’s law);
duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos

76 Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) automatic feature extraction: FastMap Relevance feedback: FALCON USC 2001 C. Faloutsos

77 Conclusions - cont’d New tools for Data Mining: Fractals/power laws:
appear everywhere lead to skewed distributions (Gaussian, Poisson, uniformity, independence) ‘correlation integral’ for separability/cluster detection PFD for dimensionality reduction USC 2001 C. Faloutsos

78 Resources: Software and papers: www.cs.cmu.edu/~christos
Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos

79 Resources Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001
C. Faloutsos


Download ppt "Indexing and Data Mining in Multimedia Databases"

Similar presentations


Ads by Google