Download presentation
Presentation is loading. Please wait.
Published byLoreen Porter Modified over 9 years ago
1
Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA (Ptree technology is patented by NDSU)
2
Overview Introduction Data Structures Clustering Algorithms based on Partitioning Our Approach Example Conclusion & Discussion
3
Introduction Existing methods are not always suitable for cluster analysis due to the dataset size. Peano Count Tree (PC-tree) provides a lossless, compressed clustering-ready representation of a spatial dataset. We introduce an efficient clustering methods using this structure.
4
Background on Spatial Data Band – attribute Pixel – transaction (tuple) Value – 0~255 (one byte) Different kinds of images have different numbers of bands –TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2) –TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC) –TIFF: 3 bands (B, G, R) –Ground data: individual bands (Yield, Moisture, Nitrate, Temperature)
5
Spatial Data Formats Existing formats –BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel) New format –bSQ (bit Sequential)
6
Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19
7
Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19
8
Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19
9
Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1
10
bSQ Format Reasons of using bSQ format –Different bits contribute to the value differently. bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision). –bSQ format facilitates the creation of efficient structures P-trees P-tree algebra. Example –Landsat Thematic-Mapper (TM) satellite image, is in BSQ format 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files, B11,…,b78, each containing ~40,000,000 bits.
11
Peano Count Tree (PC-tree) P-trees represent spatial data in a bit-by-bit, recursive, quadrant-by-quadrant arrangement. P-trees are lossless representations of the original data. P-trees are compressed structures.
12
An example of a P-tree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 55 1681516 30414434 11100010110 1 55 0 4 444 158 11 10 300 10 1 11 3 0 1 11111111111111111110000011110010111111111111111111111111111111111111111111111111111000001111001011111111111111111111111111111111 bSQ Arranged as a spatial dataset (2-D raster order)
13
55 1681516 30414434 11100010110 1 An example of Ptree Peano or Z-ordering Pure (Pure-1/Pure-0) quadrant Root Count Level Fan-out QID (Quadrant ID) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0123 111 ( 7, 1 ) ( 111, 001 ) 10.10.11 2 3 2. 2. 3 001
14
Ptree Algebra And Or Complement Other (XOR, etc) PC-tree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 Its complement (counts 0’s, not 1’s: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010
15
Basic, Value and Tuple Ptrees Value or Interval Ptrees P 1, 5 = P 1,101 = P 11 AND P 12 ’ AND P 13 Tuple Ptrees P (5,2,7) = P (101,010,111) = P 1,101 ^ P 2,010 ^ P 3,111 = P 11 ^P’ 12 ^P 13 ^ P’ 21 ^P 22 ^P’ 23 ^ P 31 ^P 32 ^P 33 Basic Ptrees P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Notational alternatives: P 1, 5 = P 1,101 = P (101,,)
16
Clustering Methods A Categorization of Major Clustering Methods Partitioning methods K-means, K-mediods,… Hierarchical methods Agglomerative, divissive,… Density-based methods Grid-based methods Model-based methods
17
The K-Means Clustering Method Given k, k-means alg is implemented as follows: –Partition objects into k nonempty subsets 1.Compute seed points (centroids or mean points) of the clusters of the current partition. 2.Assign each object to the cluster with nearest seed point. –Repeat until some stopping condition is satified.
18
The K-Means Clustering Method Strength –Relatively efficient: O(nkt) –n = number of objects, –k = number of clusters, –t = number of iterations. –Normally k and t << n. Weakness –Requires a metric so that mean is defined. –Need to specify k, the number of clusters, in advance. –Sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value.
19
The K-Medoids Clustering Method Find representative objects, called medoids, in clusters Often a “middle-ish” or “median”object. PAM (Partitioning Around Medoids, 1987) Pick k medoids; check all pairs, (medoid, non-medoid) for improved clustering; if yes, relpace medoid. Repeat until some stopping condition. PAM effective for small data sets, but does not scale well for large data sets CLARA (Clustering LARge Applications) (Kaufmann & Rousseeuw, 1990) Draw many sample sets; apply PAM to each; return best clustering. CLARANS (Clustering Large Apps based upon RANdomized Search) (Ng & Han, 1994) Similar to CLARA except a graph is used to guide replacements.
20
PAM (Partitioning Around Medoids) Use real object to represent the cluster –Select k representative objects arbitrarily –For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih –For each pair of i and h, If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object –repeat steps 2-3 until there is no change
21
CLARA (Clustering Large Applications) It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output Strength: deals with larger data sets than PAM Weakness: –Efficiency depends on the sample size –A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased
22
CLARANS (A Clustering Algorithm based on Randomized Search) CLARANS draws sample of neighbors dynamically The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum It is more efficient and scalable than both PAM and CLARA
23
Our Approach Representing the original data set as interval P-trees by using higher-order-bits concept hierarchy (interval = value at a higher level in the hierarchy) These P-trees can be viewed as groups of very similar data elements Prune out outliers by disregarding sparse groups Input: total number of objects (N), all interval P-trees, pruning criteria (e.g., root count threshold, outlier (ol) percentage) Output: interval P-trees after prune (1) Choose the interval P-tree with smallest root count (P v ) (2) Apply pruning criteria ( eg, RC(P v ) < threshold; (ol:=ol+RC(P v ))/N t ) then remove P v and repeat until pruning criteria fails. Finding clusters by –traversing P-tree levels until there are k –use PAM where each interval P-tree is an object
24
Example BSQ: 0,0 | 0011| 0111| 1000| 1011 0,1 | 0011| 0011| 1000| 1111 0,2 | 0111| 0011| 0100| 1011 0,3 | 0111| 0010| 0101| 1011 1,0 | 0011| 0111| 1000| 1011 1,1 | 0011| 0011| 1000| 1011 1,2 | 0111| 0011| 0100| 1011 1,3 | 0111| 0010| 0101| 1011 2,0 | 0010| 1011| 1000| 1111 2,1 | 0010| 1011| 1000| 1111 2,2 | 1010| 1010| 0100| 1011 2,3 | 1111| 1010| 0100| 1011 3,0 | 0010| 1011| 1000| 1111 3,1 | 1010| 1011| 1000| 1111 3,2 | 1111| 1010| 0100| 1011 3,3 | 1111| 1010| 0100| 1011 bSQ: B 11 B 12 B 13 B 14 0000 0011 1111 1111 0011 0001 1111 0001 0111 0011 1111 0011 B 21, B 22, B 23, B 24 Value P-trees: P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110 P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111 P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110 P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111 P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110 P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111 Basic P-trees: P 11, P 12, P 13, P 14, P 21, P 22, P 23, P 24 P 31, P 32, P 33, P 34, P 41, P 42, P 43, P 44 P 11 ’, P 12 ’, P 13 ’, P 14 ’, P 21 ’, P 22 ’’, P 23, P 24 ’ P 31 ’, P 32 ’, P 33 ’, P 34 ’, P 41 ’, P 42 ’, P 43 ’, P 44 ’ Tuple P-trees: (non-zero RCs) P (0010,1011,1000,1111) 3 0 0 3 0 1110 P (1010,1010,0100,1011) 1 0 0 0 1 1000 P (1010,1011,1000,1111) 1 0 0 1 0 0001 P (0011,0011,1000,1011) 1 1 0 0 0 0001 P (0011,0011,1000,1111) 1 1 0 0 0 0100 P (0011,0111,1000,1011) 2 2 0 0 0 1010 P (0111,0010,0101,1011) 2 0 2 0 0 0101 P (0111,0011,0100,1011) 2 0 2 0 0 1010 P (1111,1010,0100,1011) 3 0 0 0 3 0111
25
P-tree Performance Average time required to perform multi-operand ANDing operation on a TM file (40 million pixels)
26
Conclusion & Discussion PAM is not efficient in dealing with medium and large data sets. CLARA - CLARANS draw samples from the original data randomly. Our algorithm (using P-trees; lossless, data-mining-ready data structures) does not draw samples, but groups the data first –each interval P-tree can be viewed as a group –Then PAM only needs to deal with the P-trees, and the number of P-trees are much smaller than the data set CLARA and CLARANS need to deal with. –Because P-tree ANDing is very fast, our algorithm is very fast.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.