Download presentation
Presentation is loading. Please wait.
Published byBrice Mathews Modified over 8 years ago
1
Carnegie Mellon Data Mining – Research Directions C. Faloutsos CMU www.cs.cmu.edu/~christos
2
Carnegie Mellon NSF-IDM2000 - C. Faloutsos2 Past Data mining : ‘find rules / interesting patterns’ ML -> decision trees, ANN,… DB -> A.R., DataCubes, OLAP, clustering (BIRCH, BFR, …), decision trees Stat: SVD/PCA, … Most of them: already in commercial products
3
Carnegie Mellon NSF-IDM2000 - C. Faloutsos3 Past often, (implicit) assumptions about -Gaussian distributions (eg., clustering) -Poisson arrivals (time series) -Uniformity/independence Often, inadequate – e.g.:
4
Carnegie Mellon NSF-IDM2000 - C. Faloutsos4 Road end-points of Montgomery county: Q: distribution? not uniform not gaussian no rules?? Problem #1: GIS - points
5
Carnegie Mellon NSF-IDM2000 - C. Faloutsos5 Problem #2: Internet Internet routers: how many neighbors within h hops?
6
Carnegie Mellon NSF-IDM2000 - C. Faloutsos6 Problem #3: traffic disk trace (from HP); Web traffic - fit a model time #bytes Poisson
7
Carnegie Mellon NSF-IDM2000 - C. Faloutsos7 Common answer: Fractals / self-similarities / power laws Seminal works from Hilbert, Minkowski, Cantor, Mandelbrot, (Hausdorff, Lyapunov, Wilson, …)
8
Carnegie Mellon NSF-IDM2000 - C. Faloutsos8 What is a fractal? = self-similar point set, e.g., Sierpinski triangle: Important: intrinsic, or ‘fractal’ dimension =log(N)/log(r ) = log(3)/log(2) = 1.58 (!)
9
Carnegie Mellon NSF-IDM2000 - C. Faloutsos9 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( r ) ~ r^1 Q: fd of a plane? A: nn ( r ) = r^2
10
Carnegie Mellon NSF-IDM2000 - C. Faloutsos10 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58
11
Carnegie Mellon NSF-IDM2000 - C. Faloutsos11 Cross-roads of Montgomery county: any rules? Problem #1: GIS points
12
Carnegie Mellon NSF-IDM2000 - C. Faloutsos12 Solution #1 A: self-similarity -> fractals scale-free power-laws (y=x^a, F=C*r^(-2) avg#neighbors(<= r ) = r^D log( r ) log(#pairs(within <= r))
13
Carnegie Mellon NSF-IDM2000 - C. Faloutsos13 Solution #2: Internet topology Internet routers: how many neighbors within h hops? Reachability function: number of neighbors within r hops, vs r (log- log). Mbone routers, 1995 log(hops) log(#pairs) 3.3
14
Carnegie Mellon NSF-IDM2000 - C. Faloutsos14 Solution #3: traffic disk traces (Hurst exponent, variance plot, Fractional Gaussian noise, multifractals) time #bytes
15
Carnegie Mellon NSF-IDM2000 - C. Faloutsos15 More examples of fractals Galaxies (Sloan Digital Sky Survey)
16
Carnegie Mellon NSF-IDM2000 - C. Faloutsos16 Brain scans Oct-trees; brain-scans octree levels Log(#octants) 2.63 = fd
17
Carnegie Mellon NSF-IDM2000 - C. Faloutsos17 More fractals and power laws: Coastlines: 1.2-1.58 (Norway!) cardiovascular system: 3 (!) stock prices: 1.5
18
Carnegie Mellon NSF-IDM2000 - C. Faloutsos18 More power laws on the Internet degree vs rank, for Internet domains (log-log) [sigcomm99] log(rank) log(degree)
19
Carnegie Mellon NSF-IDM2000 - C. Faloutsos19 More tools: ‘fat fractals’ -> islands, lakes etc Multi-fractals: 80-20 ‘law’ … (multi-fractal spectrum, Hoelder exponent…)
20
Carnegie Mellon NSF-IDM2000 - C. Faloutsos20 More power laws: GIS areas Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)
21
Carnegie Mellon NSF-IDM2000 - C. Faloutsos21 More power laws: GIS areas Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))
22
Carnegie Mellon NSF-IDM2000 - C. Faloutsos22 Multifractals – 80-20 law 80-20 ‘law’, recursively applied - bias: p
23
Carnegie Mellon NSF-IDM2000 - C. Faloutsos23 Tape accesses time Tape#1 Tape# N # tapes retrieved # qual. records unif
24
Carnegie Mellon NSF-IDM2000 - C. Faloutsos24 More power laws Distribution of file sizes (‘Zipf’s law’) Income distribution (Pareto’s law) publication counts (Lotka’s law) length of articles in a newspaper (Zipf) web hit counts [Huberman] duration of UNIX jobs [Harchol-Balter] length of file transfers [Bestavros+]
25
Carnegie Mellon NSF-IDM2000 - C. Faloutsos25 Conclusions Real datasets: very often, self-similar: –geographic, medical, astrophysics, financial … settings; in –network/web traffic; the internet topology Then, we could look for –fractal/intrinsic dimension –power laws: y=x^a
26
Carnegie Mellon NSF-IDM2000 - C. Faloutsos26 Therefore: Need to ‘borrow’ tools + scale them up, or to develop new data mining tools –Tools from physics, math, graphics, … –beyond Gaussian, Poisson, uniformity, independence, –Beyond ‘mean’ and ‘variance’: slopes and exponents instead.
27
Carnegie Mellon NSF-IDM2000 - C. Faloutsos27 Resource: Manfred Schroeder “Fractals, Chaos, Power Laws”, Freeman and Co., 1991
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.