Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable Data Mining The Auton Lab, Carnegie Mellon University www.autonlab.org Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy.

Similar presentations


Presentation on theme: "Scalable Data Mining The Auton Lab, Carnegie Mellon University www.autonlab.org Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy."— Presentation transcript:

1 Scalable Data Mining The Auton Lab, Carnegie Mellon University www.autonlab.org Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy Connolly autonlab.org

2 Outline  Kd-trees  Fast nearest-neighbor finding  Fast K-means clustering  Fast kernel density estimation  Large-scale galactic morphology autonlab.org

3 Outline  Kd-trees  Fast nearest-neighbor finding  Fast K-means clustering  Fast kernel density estimation  Large-scale galactic morphology autonlab.org

4 Nearest Neighbor - Naïve Approach Given a query point X. Scan through each point Y Takes O(N) time for each query! 33 Distance Computations

5 Speeding Up Nearest Neighbor We can speed up the search for the nearest neighbor: –Examine nearby points first. –Ignore any points that are further then the nearest point found so far. Do this using a KD-tree: –Tree based data structure –Recursively partitions points into axis aligned boxes.

6 KD-Tree Construction PtXY 10.00 21.004.31 30.132.85 ……… We start with a list of n-dimensional points.

7 KD-Tree Construction PtXY 1 0.0 0 3 0.1 3 2.8 5 ……… We can split the points into 2 groups by choosing a dimension X and value V and separating the points into X > V and X <= V. X>.5 PtXY 21.004.31 ……… YES NO

8 KD-Tree Construction PtXY 1 0.0 0 3 0.1 3 2.8 5 ……… We can then consider each group separately and possibly split again (along same/different dimension). X>.5 PtXY 21.004.31 ……… YES NO

9 KD-Tree Construction PtXY 3 0.1 3 2.8 5 ……… We can then consider each group separately and possibly split again (along same/different dimension). X>.5 PtXY 21.004.31 ……… YES NO PtXY 1 0.0 0 ……… Y>.1 NO YES

10 KD-Tree Construction We can keep splitting the points in each set to create a tree structure. Each node with no children (leaf node) contains a list of points.

11 KD-Tree Construction We will keep around one additional piece of information at each node. The (tight) bounds of the points at or below this node.

12 KD-Tree Construction Use heuristics to make splitting decisions: Which dimension do we split along? Widest Which value do we split at? Median of value of that split dimension for the points. When do we stop? When there are fewer then m points left OR the box has hit some minimum width.

13 Outline Kd-trees  Fast nearest-neighbor finding  Fast K-means clustering  Fast kernel density estimation  Large-scale galactic morphology autonlab.org

14 Nearest Neighbor with KD Trees We traverse the tree looking for the nearest neighbor of the query point.

15 Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

16 Nearest Neighbor with KD Trees Examine nearby points first: Explore the branch of the tree that is closest to the query point first.

17 Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node.

18 Nearest Neighbor with KD Trees When we reach a leaf node: compute the distance to each point in the node.

19 Nearest Neighbor with KD Trees Then we can backtrack and try the other branch at each node visited.

20 Nearest Neighbor with KD Trees Each time a new closest node is found, we can update the distance bounds.

21 Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

22 Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

23 Nearest Neighbor with KD Trees Using the distance bounds and the bounds of the data below each node, we can prune parts of the tree that could NOT include the nearest neighbor.

24 Metric Trees Kd-trees rendered worse-than-useless in higher dimensions Only requires metric space (a well- behaved distance function)

25 Outline Kd-trees Fast nearest-neighbor finding  Fast K-means clustering  Fast kernel density estimation  Large-scale galactic morphology autonlab.org

26 What does k-means do?

27 K-means 1.Ask user how many clusters they’d like. (e.g. k=5)

28 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations

29 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints)

30 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

31 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns… 5.…and jumps there 6.…Repeat until terminated! For basic tutorial on kmeans, see any machine learning text book, or www.cs.cmu.edu/~awm/tutorials/kmeans.htmlwww.cs.cmu.edu/~awm/tutorials/kmeans.html

32 K-means 1.Ask user how many clusters they’d like. (e.g. k=5) 2.Randomly guess k cluster Center locations 3.Each datapoint finds out which Center it’s closest to. 4.Each Center finds the centroid of the points it owns

33 K-means search guts I must compute  x i of all points I own

34 K-means search guts I will compute  x i of all points I own in rectangle

35 K-means search guts I will compute  x i of all points I own in left subrectangle, then right subrectangle, then add ‘em I will compute  x i of all points I own in left rectangle, then right subrectangle, then add ‘em

36 In recursive call… I will compute  x i of all points I own in rectangle

37 In recursive call… I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle

38 In recursive call… I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle 2.For each other center: can it own any points in rectangle?

39 In recursive call… I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle 2.For each other center: can it own any points in rectangle?

40 Pruning I will just grab  x i from the cached value in the node I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle 2.For each other center: can it own any points in rectangle? 3.If not, PRUNE

41 Pruning not possible at previous level… I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle 2.For each other center: can it own any points in rectangle? 3.If yes, recurse... I will compute  x i of all points I own in rectangle But.. maybe there’s some optimization possible anyway?

42 Blacklisting I will compute  x i of all points I own in rectangle 1.Find center nearest to rectangle 2.For each other center: can it own any points in rectangle? 3.If yes, recurse... I will compute  x i of all points I own in rectangle But.. maybe there’s some optimization possible anyway? A hopeless center never needs to be considered in any recursion

43 Example Example generated by: Dan Pelleg and Andrew Moore. Accelerating Exact k-means Algorithms with Geometric Reasoning. Proc. Conference on Knowledge Discovery in Databases 1999, (KDD99) (available on www.autonlab.org ) www.autonlab.org

44 K-means continues …

45

46

47

48

49

50

51

52 K-means terminates

53 Comparison to a linear algorithm Astrophysics data (2-dimensions)

54 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering  Fast kernel density estimation  Large-scale galactic morphology autonlab.org

55 What if we want to do density estimation with multimodal or clumpy data?

56 The GMM assumption There are k components. The i’th component is called  i Component  i has an associated mean vector  i 11 22 33

57 The GMM assumption There are k components. The i’th component is called  i Component  i has an associated mean vector  i Each component generates data from a Gaussian with mean  i and covariance matrix  2 I Assume that each datapoint is generated according to the following recipe: 11 22 33

58 The GMM assumption There are k components. The i’th component is called  i Component  i has an associated mean vector  i Each component generates data from a Gaussian with mean  i and covariance matrix  2 I Assume that each datapoint is generated according to the following recipe: 1.Pick a component at random. Choose component i with probability P(  i ). 22

59 The GMM assumption There are k components. The i’th component is called  i Component  i has an associated mean vector  i Each component generates data from a Gaussian with mean  i and covariance matrix  2 I Assume that each datapoint is generated according to the following recipe: 1.Pick a component at random. Choose component i with probability P(  i ). 2.Datapoint ~ N(  i,  2 I ) 22 x

60 The General GMM assumption 11 22 33 There are k components. The i’th component is called  i Component  i has an associated mean vector  i Each component generates data from a Gaussian with mean  i and covariance matrix  i Assume that each datapoint is generated according to the following recipe: 1.Pick a component at random. Choose component i with probability P(  i ). 2.Datapoint ~ N(  i,  i )

61 Gaussian Mixture Example: Start Advance apologies: in Black and White this example will be incomprehensible

62 After first iteration

63 After 2nd iteration

64 After 3rd iteration

65 After 4th iteration

66 After 5th iteration

67 After 6th iteration

68 After 20th iteration

69 Some Bio Assay data

70 GMM clustering of the assay data

71 Resulting Density Estimator For basic tutorial on Gaussian Mixture Models, see Hastie et al or Duda, Hart & Stork books, or www.cs.cmu.edu/~awm/tutorials/gmm.htmlwww.cs.cmu.edu/~awm/tutorials/gmm.html

72

73

74

75 Uses for kd-trees and cousins K-Means clustering [Pelleg and Moore, 99], [Moore 2000] Kernel Density Estimation [Deng & Moore, 1995], [Gray & Moore 2001] Kernel-density-based clustering [Wong and Moore, 2002] Gaussian Mixture Model [Moore, 1999] Kernel Regression [Deng and Moore, 1995] Locally weighted regression [Moore, Schneider, Deng, 97] Kernel-based Bayes Classifiers [Moore and Schneider, 97] N-point correlation functions [Gray and Moore 2001] Also work by Priebe, Ramikrishnan, Schaal, D’Souza, Elkan,… Papers (and software): www.autonlab.org

76 Uses for kd-trees and cousins K-Means clustering [Pelleg and Moore, 99], [Moore 2000] Kernel Density Estimation [Deng & Moore, 1995], [Gray & Moore 2001] Kernel-density-based clustering [Wong and Moore, 2002] Gaussian Mixture Model [Moore, 1999] Kernel Regression [Deng and Moore, 1995] Locally weighted regression [Moore, Schneider, Deng, 97] Kernel-based Bayes Classifiers [Moore and Schneider, 97] N-point correlation functions [Gray and Moore 2001] Also work by Priebe, Ramikrishnan, Schaal, D’Souza, Elkan,… Papers (and software): www.autonlab.org

77 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation  Large-scale galactic morphology autonlab.org

78 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation  Large-scale galactic morphology  GMorph  Memory-based  PCA autonlab.org

79

80 Image space (~4096 dim)

81 galaxies

82

83

84 demo

85

86 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation Large-scale galactic morphology  GMorph  Memory-based  PCA autonlab.org

87 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation Large-scale galactic morphology  GMorph  Memory-based  PCA autonlab.org

88 Parameter space (12 dim) Image space (~4096 dim) Model synthetic galaxies

89 Parameter space (12 dim) Image space (~4096 dim) Model Target Image

90 Parameter space (12 dim) Image space (~4096 dim) Model Target Image

91 Parameter space (12 dim) Image space (~4096 dim) Model Target Image

92 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation Large-scale galactic morphology  GMorph Memory-based  PCA autonlab.org

93

94

95

96 Parameter space (12 dim) Image space (~4096 dim) Model Eigenspace (~16 dim) Target Image

97 Parameter space (12 dim) Image space (~4096 dim) Model Eigenspace (~16 dim) Target Image

98 Parameter space (12 dim) Image space (~4096 dim) Model Eigenspace (~16 dim) Target Image

99 snapshot

100 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation Large-scale galactic morphology  GMorph Memory-based PCA  Results autonlab.org

101

102

103

104 Outline Kd-trees Fast nearest-neighbor finding Fast K-means clustering Fast kernel density estimation Large-scale galactic morphology GMorph Memory-based PCA Results autonlab.org paper See AUTON website for papers


Download ppt "Scalable Data Mining The Auton Lab, Carnegie Mellon University www.autonlab.org Brigham Anderson, Andrew Moore, Dan Pelleg, Alex Gray, Bob Nichols, Andy."

Similar presentations


Ads by Google