Presentation is loading. Please wait.

Presentation is loading. Please wait.

School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional.

Similar presentations


Presentation on theme: "School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional."— Presentation transcript:

1 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional Data Christos Faloutsos Carnegie Mellon University

2 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos2 THANK YOU! Prof. Jiawei Han Prof. Kevin Chang

3 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos3 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

4 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos4 Applications of sensors/streams ‘Smart house’: monitoring temperature, humidity etc Financial, sales, economic series

5 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos5 Motivation - Applications Medical: ECGs +; blood pressure etc monitoring Scientific data: seismological; astronomical; environment / anti- pollution; meteorological

6 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos6 Motivation - Applications (cont’d) civil/automobile infrastructure –bridge vibrations [Oppenheim+02] – road conditions / traffic monitoring Automobile traffic 0 200 400 600 800 1000 1200 1400 1600 1800 2000 time # cars

7 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos7 Motivation - Applications (cont’d) Computer systems –web servers (buffering, prefetching) –network traffic monitoring –... http://repository.cs.vt.edu/lbl-conn-7.tar.Z

8 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos8 Problem definition Given: one or more sequences x 1, x 2, …, x t, …; (y 1, y 2, …, y t, …) Find –patterns; clusters; outliers; forecasts;

9 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos9 Problem #1 Find patterns, in large datasets time # bytes

10 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos10 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr

11 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos11 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr

12 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos12 Problem #1 Find patterns, in large datasets time # bytes Poisson indep., ident. distr Q: Then, how to generate such bursty traffic?

13 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos13 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

14 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos14 Problem #2 - network and graph mining How does the Internet look like? How does the web look like? What constitutes a ‘normal’ social network? What is the ‘market value’ of a customer? which gene/species affects the others the most?

15 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos15 Problem#2 Given a graph: which node to market-to / defend / immunize first? Are there un-natural sub- graphs? (eg., criminals’ rings)? [from Lumeta: ISPs 6/1999]

16 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos16 Solutions New tools: power laws, self-similarity and ‘fractals’ work, where traditional assumptions fail Let’s see the details:

17 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos17 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws Discussion

18 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos18 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality??

19 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos19 What is a fractal? = self-similar point set, e.g., Sierpinski triangle:... zero area: (3/4)^inf infinite length! (4/3)^inf Q: What is its dimensionality?? A: log3 / log2 = 1.58 (!?!)

20 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos20 Intrinsic (‘fractal’) dimension Q: fractal dimension of a (finite set of points on a) line? xy 51 42 33 24

21 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos21 Intrinsic (‘fractal’) dimension Q: fractal dimension of a line? A: nn ( <= r ) ~ r^1 (‘power law’: y=x^a) Q: fd of a plane? A: nn ( <= r ) ~ r^2 fd== slope of (log(nn) vs.. log(r) )

22 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos22 Sierpinsky triangle log( r ) log(#pairs within <=r ) 1.58 == ‘correlation integral’ = CDF of pairwise distances

23 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos23 Observations: Fractals power laws Closely related: fractals self-similarity scale-free power laws ( y= x a ; F=K r -2 ) (vs y=e -ax or y=x a +b) log( r ) log(#pairs within <=r ) 1.58

24 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos24 Outline Problems Self-similarity and power laws Solutions to posed problems Discussion

25 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos25 time #bytes Solution #1: traffic disk traces: self-similar: (also: [Leland+94]) How to generate such traffic?

26 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos26 Solution #1: traffic disk traces (80-20 ‘law’) – ‘multifractals’ time #bytes 20% 80%

27 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos27 80-20 / multifractals 20 80

28 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos28 80-20 / multifractals 20 p ; (1-p) in general yes, there are dependencies 80

29 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos29 Overview Goals/ motivation: find patterns in large datasets: –(A) Sensor data –(B) network/graph data Solutions: self-similarity and power laws –sensor/traffic data –network/graph data Discussion

30 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos30 Problem #2 - topology How does the Internet look like? Any rules?

31 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos31 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? degree count avg: 3.3

32 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos32 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! degree count avg: 3.3

33 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos33 Patterns? avg degree is, say 3.3 pick a node at random - what is the degree you expect it to have? A: 1!! A’: very skewed distr. Corollary: the mean is meaningless! (and std -> infinity (!)) degree count avg: 3.3

34 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos34 Solution#2: Rank exponent R A1: Power law in the degree distribution [SIGCOMM99] internet domains log(rank) log(degree) -0.82 att.com ibm.com

35 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos35 Solution#2’: Eigen Exponent E A2: power law in the eigenvalues of the adjacency matrix E = -0.48 Exponent = slope Eigenvalue Rank of decreasing eigenvalue May 2001

36 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos36 Solution#2’’: Hop Exponent H A3: neighborhood function N(h) = number of pairs within h hops or less - power law, too! log(hops) log(#pairs) 2.8 internet ‘hop’ exponent Hop exp. = 1 Hop exp. = 2

37 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos37 Power laws - discussion do they hold, over time? do they hold on other graphs/domains?

38 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos38 Power laws - discussion do they hold, over time? Yes! for multiple years [Siganos+] do they hold on other graphs/domains? Yes! –web sites and links [Tomkins+], [Barabasi+] –peer-to-peer graphs (gnutella-style) –who-trusts-whom (epinions.com)

39 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos39 Time Evolution: rank R The rank exponent has not changed! [Siganos+] Domain level log(rank) log(degree) - 0.82 att.com ibm.com

40 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos40 The Peer-to-Peer Topology Number of immediate peers (= degree), follows a power-law [Jovanovic+] degree count

41 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos41 epinions.com who-trusts-whom [Richardson + Domingos, KDD 2001] (out) degree count

42 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos42 Outline problems Fractals Solutions Discussion –what else can they solve? –how frequent are fractals?

43 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos43 What else can they solve? separability [KDD’02] forecasting [CIKM’02] dimensionality reduction [SBBD’00] non-linear axis scaling [KDD’02] disk trace modeling [PEVA’02] selectivity of spatial queries [PODS’94, VLDB’95, ICDE’00]...

44 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos44 Problem #3 - spatial d.m. Galaxies (Sloan Digital Sky Survey w/ B. Nichol) - ‘spiral’ and ‘elliptical’ galaxies - patterns? (not Gaussian; not uniform) -attraction/repulsion? - separability??

45 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos45 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! CORRELATION INTEGRAL!

46 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos46 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion! [w/ Seeger, Traina, Traina, SIGMOD00]

47 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos47 spatial d.m. r1r2 r1 r2 Heuristic on choosing # of clusters

48 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos48 Solution#3: spatial d.m. log(r) log(#pairs within <=r ) spi-spi spi-ell ell-ell - 1.8 slope - plateau! - repulsion!

49 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos49 Problem#4: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones mpg cc

50 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos50 Problem#4: dim. reduction given attributes x 1,... x n –possibly, non-linearly correlated drop the useless ones (Q: why? A: to avoid the ‘dimensionality curse’) Solution: keep on dropping attributes, until the f.d. changes! [SBBD’00] mpg cc

51 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos51 Outline problems Fractals Solutions Discussion –what else can they solve? –how frequent are fractals?

52 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos52 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

53 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos53 Fractals: Brain scans brain-scans octree levels Log(#octants) 2.63 = fd

54 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos54 More fractals periphery of malignant tumors: ~1.5 benign: ~1.3 [Burdet+]

55 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos55 More fractals: cardiovascular system: 3 (!) lungs: ~2.9

56 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos56 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

57 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos57 More fractals: Coastlines: 1.2-1.58 1 1.1 1.3

58 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos58

59 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos59 Cross-roads of Montgomery county: any rules? GIS points

60 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos60 GIS A: self-similarity: intrinsic dim. = 1.51 log( r ) log(#pairs(within <= r)) 1.51

61 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos61 Examples:LB county Long Beach county of CA (road end-points) 1.7 log(r) log(#pairs)

62 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos62 More power laws: areas – Korcak’s law Scandinavian lakes Any pattern?

63 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos63 More power laws: areas – Korcak’s law Scandinavian lakes area vs complementary cumulative count (log-log axes) log(count( >= area)) log(area)

64 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos64 More power laws: Korcak Japan islands; area vs cumulative count (log-log axes) log(area) log(count( >= area))

65 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos65 More power laws Energy of earthquakes (Gutenberg-Richter law) [simscience.org] log(count) Magnitude = log(energy)day Energy released

66 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos66 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

67 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos67 A famous power law: Zipf’s law Bible - rank vs. frequency (log-log) log(rank) log(freq) “a” “the” “Rank/frequency plot”

68 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos68 TELCO data # of service units count of customers ‘best customer’

69 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos69 SALES data – store#96 # units sold count of products “aspirin”

70 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos70 Olympic medals (Sidney 2000): log( rank) log(#medals)

71 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos71 Olympic medals (Athens04): log( rank) log(#medals)

72 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos72 Even more power laws: Income distribution (Pareto’s law) size of firms publication counts (Lotka’s law)

73 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos73 Even more power laws: library science (Lotka’s law of publication count); and citation counts: (citeseer.nj.nec.com 6/2001) log(#citations) log(count) Ullman

74 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos74 Even more power laws: web hit counts [w/ A. Montgomery] Web Site Traffic log(freq) log(count) Zipf “yahoo.com”

75 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos75 Fractals & power laws: appear in numerous settings: medical geographical / geological social computer-system related

76 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos76 Power laws, cont’d In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] log indegree - log(freq) from [Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, Andrew Tomkins ]

77 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos77 Power laws, cont’d In- and out-degree distribution of web sites [Barabasi], [IBM-CLEVER] length of file transfers [Bestavros+] duration of UNIX jobs [Harchol-Balter]

78 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos78 Conclusions Fascinating problems in Data Mining: find patterns in –sensors/streams –graphs/networks

79 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos79 Conclusions - cont’d New tools for Data Mining: self-similarity & power laws: appear in many cases Bad news: lead to skewed distributions (no Gaussian, Poisson, uniformity, independence, mean, variance) Good news: ‘correlation integral’ for separability rank/frequency plots 80-20 (multifractals) (Hurst exponent, strange attractors, renormalization theory, ++)

80 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos80 Resources Manfred Schroeder “Chaos, Fractals and Power Laws”, 1991 Jiawei Han and Micheline Kamber “Data Mining: Concepts and Techniques”, 2001

81 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos81 References –[ieeeTN94] W. E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the Self-Similar Nature of Ethernet Traffic, IEEE Transactions on Networking, 2, 1, pp 1- 15, Feb. 1994. – [pods94] Christos Faloutsos and Ibrahim Kamel, Beyond Uniformity and Independence: Analysis of R- trees Using the Concept of Fractal Dimension, PODS, Minneapolis, MN, May 24-26, 1994, pp. 4-13

82 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos82 References –[vldb95] Alberto Belussi and Christos Faloutsos, Estimating the Selectivity of Spatial Queries Using the `Correlation' Fractal Dimension Proc. of VLDB, p. 299-310, 1995 –[vldb96] Christos Faloutsos, Yossi Matias and Avi Silberschatz, Modeling Skewed Distributions Using Multifractals and the `80-20 Law’ Conf. on Very Large Data Bases (VLDB), Bombay, India, Sept. 1996.

83 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos83 References –[vldb96] Christos Faloutsos and Volker Gaede Analysis of the Z-Ordering Method Using the Hausdorff Fractal Dimension VLD, Bombay, India, Sept. 1996 –[sigcomm99] Michalis Faloutsos, Petros Faloutsos and Christos Faloutsos, What does the Internet look like? Empirical Laws of the Internet Topology, SIGCOMM 1999

84 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos84 References –[icde99] Guido Proietti and Christos Faloutsos, I/O complexity for range queries on region data stored using an R-tree International Conference on Data Engineering (ICDE), Sydney, Australia, March 23-26, 1999 –[sigmod2000] Christos Faloutsos, Bernhard Seeger, Agma J. M. Traina and Caetano Traina Jr., Spatial Join Selectivity Using Power Laws, SIGMOD 2000

85 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos85 References –[kdd2001] Agma J. M. Traina, Caetano Traina Jr., Spiros Papadimitriou and Christos Faloutsos: Tri-plots: Scalable Tools for Multidimensional Data Mining, KDD 2001, San Francisco, CA.

86 School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos86 Thank you! Contact info: christos @ cs.cmu.edu www. cs.cmu.edu /~christos (w/ papers, datasets, code for fractal dimension estimation, etc)


Download ppt "School of Computer Science Carnegie Mellon UIUC 04C. Faloutsos1 Advanced Data Mining Tools: Fractals and Power Laws for Graphs, Streams and Traditional."

Similar presentations


Ads by Google