Indexing and Data Mining in Multimedia Databases

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #19: SVD - part II (case studies) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #10: Fractals - case studies - I C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#5: Multi-key and Spatial Access Methods - II C. Faloutsos.
Deepayan ChakrabartiCIKM F4: Large Scale Automated Forecasting Using Fractals -Deepayan Chakrabarti -Christos Faloutsos.
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals: M-trees and dim. curse (case studies – Part II) C. Faloutsos.
15-826: Multimedia Databases and Data Mining
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
Multimedia DBs. Multimedia dbs A multimedia database stores text, strings and images Similarity queries (content based retrieval) Given an image find.
CMU SCS : Multimedia Databases and Data Mining Lecture #11: Fractals - case studies Part III (regions, quadtrees, knn queries) C. Faloutsos.
Social Networks and Graph Mining Christos Faloutsos CMU - MLD.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Indexing Time Series Based on Slides by C. Faloutsos (CMU) and D. Gunopulos (UCR)
Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU
CMU SCS Graph and stream mining Christos Faloutsos CMU.
Based on Slides by D. Gunopulos (UCR)
School of Computer Science Carnegie Mellon Boston U., 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Multimedia Databases Text II. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Text databases Image and video.
Carnegie Mellon Powerful Tools for Data Mining Fractals, Power laws, SVD C. Faloutsos Carnegie Mellon University.
Data Mining using Fractals and Power laws
CMU SCS : Multimedia Databases and Data Mining Lecture #30: Conclusions C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture#1: Introduction Christos Faloutsos CMU
Introduction to Fractals and Fractal Dimension Christos Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
CMU SCS : Multimedia Databases and Data Mining Lecture #8: Fractals - introduction C. Faloutsos.
School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.
CMU SCS : Multimedia Databases and Data Mining Lecture #9: Fractals – examples & algo’s C. Faloutsos.
Multimedia Databases (MMDB)
CMU SCS : Multimedia Databases and Data Mining Lecture #12: Fractals - case studies Part III (quadtrees, knn queries) C. Faloutsos.
School of Computer Science Carnegie Mellon Data Mining using Fractals (fractals for fun and profit) Christos Faloutsos Carnegie Mellon University.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
MindReader: Querying databases through multiple examples Yoshiharu Ishikawa (Nara Institute of Science and Technology, Japan) Ravishankar Subramanya (Pittsburgh.
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
School of Computer Science Carnegie Mellon Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
Indexing Time Series. Outline Spatial Databases Temporal Databases Spatio-temporal Databases Multimedia Databases Time Series databases Text databases.
FastMap : Algorithm for Indexing, Data- Mining and Visualization of Traditional and Multimedia Datasets.
School of Computer Science Carnegie Mellon WRIGHT, 2005C. Faloutsos1 Data Mining using Fractals and Power laws Christos Faloutsos Carnegie Mellon University.
CMU SCS KDD'09Faloutsos, Miller, Tsourakakis P9-1 Large Graph Mining: Power Tools and a Practitioner’s guide Christos Faloutsos Gary Miller Charalampos.
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU
SCS-CMU Data Mining Tools A crash course C. Faloutsos.
15-826: Multimedia Databases and Data Mining
Digital Video Library - Jacky Ma.
Next Generation Data Mining Tools: SVD and Fractals
15-826: Multimedia Databases and Data Mining
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
NetMine: Mining Tools for Large Graphs
Jiawei Han Department of Computer Science
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Data Warehousing and Data Mining
I don’t need a title slide for a lecture
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
15-826: Multimedia Databases and Data Mining
Graph and Tensor Mining for fun and profit
15-826: Multimedia Databases and Data Mining
Data Mining using Fractals and Power laws
15-826: Multimedia Databases and Data Mining
Presentation transcript:

Indexing and Data Mining in Multimedia Databases Christos Faloutsos CMU www.cs.cmu.edu/~christos

Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resources USC 2001 C. Faloutsos

Problem Given a large collection of (multimedia) records, find similar/interesting things, ie: Allow fast, approximate queries, and Find rules/patterns USC 2001 C. Faloutsos

Sample queries Similarity search Find pairs of branches with similar sales patterns find medical cases similar to Smith's Find pairs of sensor series that move in sync Find shapes like a spark-plug USC 2001 C. Faloutsos

Sample queries –cont’d Rule discovery Clusters (of branches; of sensor data; ...) Forecasting (total sales for next year?) Outliers (eg., unexpected part failures; fraud detection) USC 2001 C. Faloutsos

Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions related projects @ CMU and resourses USC 2001 C. Faloutsos

Indexing - Multimedia Problem: given a set of (multimedia) objects, find the ones similar to a desirable query object USC 2001 C. Faloutsos

distance function: by expert day $price 1 365 day $price 1 365 day $price 1 365 distance function: by expert USC 2001 C. Faloutsos

‘GEMINI’ - Pictorially eg,. std S1 F(S1) 1 365 day F(Sn) Sn eg, avg 1 365 day USC 2001 C. Faloutsos

Remaining issues how to extract features automatically? how to merge similarity scores from different media USC 2001 C. Faloutsos

Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos

FastMap ~100 O1 O2 O3 O4 O5 1 100 ?? ~1 USC 2001 C. Faloutsos

FastMap Multi-dimensional scaling (MDS) can do that, but in O(N**2) time We want a linear algorithm: FastMap [SIGMOD95] USC 2001 C. Faloutsos

Applications: time sequences given n co-evolving time sequences visualize them + find rules [ICDE00] DEM rate JPY HKD time USC 2001 C. Faloutsos

Applications - financial currency exchange rates [ICDE00] FRF GBP JPY HKD USD(t) USD(t-5) USC 2001 C. Faloutsos

Applications - financial currency exchange rates [ICDE00] USD HKD JPY FRF DEM GBP USD(t) USD(t-5) USC 2001 C. Faloutsos

Application: VideoTrails [ACM MM97] USC 2001 C. Faloutsos

VideoTrails - usage scene-cut detection (about 10% errors) scene classification (eg., dialogue vs action) USC 2001 C. Faloutsos

Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search Visualization: Fastmap Relevance feedback: FALCON Data Mining / Fractals Conclusions USC 2001 C. Faloutsos

Merging similarity scores eg., video: text, color, motion, audio weights change with the query! solution 1: user specifies weights solution 2: user gives examples  and we ‘learn’ what he/she wants: rel. feedback (Rocchio, MARS, MindReader) but: how about disjunctive queries? USC 2001 C. Faloutsos

‘FALCON’ Vs Inverted Vs Trader wants only ‘unstable’ stocks USC 2001 C. Faloutsos

“Single query point” methods + + + x + + + Rocchio USC 2001 C. Faloutsos

“Single query point” methods + + + + + x x x + + + Rocchio MindReader MARS The averaging affect in action... USC 2001 C. Faloutsos

Main idea: FALCON Contours [Wu+, vldb2000] + + feature2 eg., frequency + + + feature1 (eg., temperature) USC 2001 C. Faloutsos

Conclusions for indexing + visualization GEMINI: fast indexing, exploiting off-the-shelf SAMs FastMap: automatic feature extraction in O(N) time FALCON: relevance feedback for disjunctive queries USC 2001 C. Faloutsos

Outline Goal: ‘Find similar / interesting things’ Problem - Applications Indexing - similarity search New tools for Data Mining: Fractals Conclusions Resourses USC 2001 C. Faloutsos

Data mining & fractals – Road map Motivation – problems / case study Definition of fractals and power laws Solutions to posed problems More examples USC 2001 C. Faloutsos

Problem #1 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies (stores & households ; mpg & MTBF...) - patterns? (not Gaussian; not uniform) attraction/repulsion? separability?? USC 2001 C. Faloutsos

Problem#2: dim. reduction given attributes x1, ... xn possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) USC 2001 C. Faloutsos

Answer: Fractals / self-similarities / power laws USC 2001 C. Faloutsos

What is a fractal? = self-similar point set, e.g., Sierpinski triangle: zero area; infinite length! ... USC 2001 C. Faloutsos

Definitions (cont’d) Paradox: Infinite perimeter ; Zero area! ‘dimensionality’: between 1 and 2 actually: Log(3)/Log(2) = 1.58… (long story) USC 2001 C. Faloutsos

Intrinsic (‘fractal’) dimension Eg: #cylinders; miles / gallon Q: fractal dimension of a line? x y 5 1 4 2 3 USC 2001 C. Faloutsos

Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) USC 2001 C. Faloutsos

Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs log(r) ) USC 2001 C. Faloutsos

Sierpinsky triangle == ‘correlation integral’ log(#pairs log( r ) log(#pairs within <=r ) 1.58 USC 2001 C. Faloutsos

Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples Conclusions USC 2001 C. Faloutsos

Solution#1: spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol - ‘BOPS’ plot - [sigmod2000]) clusters? separable? attraction/repulsion? data ‘scrubbing’ – duplicates? USC 2001 C. Faloutsos

Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

[w/ Seeger, Traina, Traina, SIGMOD00] Solution#1: spatial d.m. [w/ Seeger, Traina, Traina, SIGMOD00] log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

spatial d.m. r1 r2 r2 r1 Heuristic on choosing # of clusters USC 2001 C. Faloutsos

Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! - repulsion! ell-ell spi-spi spi-ell log(r) USC 2001 C. Faloutsos

Solution#1: spatial d.m. log(#pairs within <=r ) - 1.8 slope - plateau! repulsion!! ell-ell spi-spi -duplicates spi-ell log(r) USC 2001 C. Faloutsos

Problem #2: Dim. reduction USC 2001 C. Faloutsos

Solution: drop the attributes that don’t increase the ‘partial f.d.’ PFD dfn: PFD of attribute set A is the f.d. of the projected cloud of points [w/ Traina, Traina, Wu, SBBD00] USC 2001 C. Faloutsos

Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

Problem #2: dim. reduction global FD=1 PFD=1 PFD=1 Notice: ‘max variance’ would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

Problem #2: dim. reduction global FD=1 PFD=1 PFD~1 Notice: SVD would fail here PFD=0 PFD=1 PFD~1 USC 2001 C. Faloutsos

Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos

disk traffic Not Poisson, not(?) iid - BUT: self-similar How to model it? time #bytes USC 2001 C. Faloutsos

traffic disk traces (80-20 ‘law’ = ‘multifractal’ [ICDE’02]) 20% 80% #bytes time USC 2001 C. Faloutsos

Traffic Many other time-sequences are bursty/clustered: (such as?) USC 2001 C. Faloutsos

Tape accesses # tapes needed, to retrieve n records? (# days down, due to failures / hurricanes / communication noise...) time Tape#1 Tape# N USC 2001 C. Faloutsos

Tape accesses 50-50 = Poisson # tapes retrieved Tape#1 Tape# N real time Tape#1 Tape# N real # qual. records USC 2001 C. Faloutsos

More apps: Brain scans Oct-trees; brain-scans Log(#octants) 2.63 = fd octree levels Log(#octants) 2.63 = fd USC 2001 C. Faloutsos

GIS points Cross-roads of Montgomery county: any rules? USC 2001 C. Faloutsos

GIS A: self-similarity: intrinsic dim. = 1.51 avg#neighbors(<= r ) = r^D log(#pairs(within <= r)) 1.51 log( r ) USC 2001 C. Faloutsos

Examples:LB county Long Beach county of CA (road end-points) USC 2001 C. Faloutsos

More fractals: cardiovascular system: 3 (!) stock prices (LYCOS) - random walks: 1.5 Coastlines: 1.2-1.58 (?) 1 year 2 years USC 2001 C. Faloutsos

USC 2001 C. Faloutsos

Road map Motivation – problems / case studies Definition of fractals and power laws Solutions to posed problems More examples fractals power laws Conclusions USC 2001 C. Faloutsos

Fractals <-> Power laws self-similarity -> <=> fractals <=> scale-free <=> power-laws (y=x^a, F=C*r^(-2)) log(#pairs within <=r ) 1.58 log( r ) USC 2001 C. Faloutsos

Zipf’s law “the” log(freq) “and” Bible RANK-FREQUENCY plot: (in log-log scales) log(rank) Zipf’s (first) Law: USC 2001 C. Faloutsos

Zipf’s law similarly for first names (slope ~-1) last names (~ -0.7) etc USC 2001 C. Faloutsos

More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) amplitude day magnitude USC 2001 C. Faloutsos

Clickstream data <url, u-id, ....> Web Site Traffic log(count) log(freq) log(count) Zipf USC 2001 C. Faloutsos

Lotka’s law library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(count) J. Ullman log(#citations) USC 2001 C. Faloutsos

Korcak’s law log(count( >= area)) Scandinavian lakes area vs complementary cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos

More power laws: Korcak log(count( >= area)) Japan islands; area vs cumulative count (log-log axes) log(area) USC 2001 C. Faloutsos

(Korcak’s law: Aegean islands) USC 2001 C. Faloutsos

Olympic medals: log(# medals) Russia China USA log rank USC 2001 C. Faloutsos

SALES data – store#96 count of products # units sold USC 2001 C. Faloutsos

TELCO data count of customers # of service units USC 2001 C. Faloutsos

More power laws on the Internet log(degree) -0.82 log(rank) degree vs rank, for Internet domains (log-log) [sigcomm99] USC 2001 C. Faloutsos

Even more power laws: Income distribution (Pareto’s law); duration of UNIX jobs [Harchol-Balter] Distribution of UNIX file sizes Web graph [CLEVER-IBM; Barabasi] USC 2001 C. Faloutsos

Overall Conclusions: ‘Find similar/interesting things’ in multimedia databases Indexing: feature extraction (‘GEMINI’) automatic feature extraction: FastMap Relevance feedback: FALCON USC 2001 C. Faloutsos

Conclusions - cont’d New tools for Data Mining: Fractals/power laws: appear everywhere lead to skewed distributions (Gaussian, Poisson, uniformity, independence) ‘correlation integral’ for separability/cluster detection PFD for dimensionality reduction USC 2001 C. Faloutsos

Resources: Software and papers: www.cs.cmu.edu/~christos Fractal dimension (FracDim) Separability (sigmod 2000, kdd2001) Relevance feedback for query by content (FALCON – vldb 2000) USC 2001 C. Faloutsos

Resources Manfred Schroeder “Chaos, Fractals and Power Laws” USC 2001 C. Faloutsos