Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA (Ptree technology is patented by NDSU)

Overview  Introduction  Data Structures  Clustering Algorithms based on Partitioning  Our Approach  Example  Conclusion & Discussion

Introduction  Existing methods are not always suitable for cluster analysis due to the dataset size.  Peano Count Tree (PC-tree) provides a lossless, compressed clustering-ready representation of a spatial dataset.  We introduce an efficient clustering methods using this structure.

Background on Spatial Data  Band – attribute  Pixel – transaction (tuple)  Value – 0~255 (one byte)  Different kinds of images have different numbers of bands –TM4/5: 7 bands (B, G, R, NIR, MIR, TIR, MIR2) –TM7: 8 bands (B, G, R, NIR, MIR, TIR, MIR2, PC) –TIFF: 3 bands (B, G, R) –Ground data: individual bands (Yield, Moisture, Nitrate, Temperature)

Spatial Data Formats  Existing formats –BSQ (Band Sequential) –BIL (Band Interleaved by Line) –BIP (Band Interleaved by Pixel)  New format –bSQ (bit Sequential)

Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19

Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19

Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19

Spatial Data Formats (Cont.) BAND-1 254 127 (1111 1110) (0111 1111) 14 193 (0000 1110) (1100 0001) BAND-2 37 240 (0010 0101) (1111 0000) 200 19 (1100 1000) (0001 0011) BSQ format (2 files) Band 1: 254 127 14 193 Band 2: 37 240 200 19 BIL format (1 file) 254 127 37 240 14 193 200 19 BIP format (1 file) 254 37 127 240 14 200 193 19 bSQ format (16 files) B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26 B27 B28 1 1 1 1 1 1 1 0 0 0 1 0 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1 0 0 1 1

bSQ Format  Reasons of using bSQ format –Different bits contribute to the value differently. bSQ format facilitates the representation of a precision hierarchy (from 1 bit up to 8 bit precision). –bSQ format facilitates the creation of efficient structures P-trees P-tree algebra.  Example –Landsat Thematic-Mapper (TM) satellite image, is in BSQ format 7 bands, B1,…,B7, (Landsat-7 has 8) and ~40,000,000 8-bit data values. In this case, the bSQ format will consist of 56 separate files, B11,…,b78, each containing ~40,000,000 bits.

Peano Count Tree (PC-tree)  P-trees represent spatial data in a bit-by-bit, recursive, quadrant-by-quadrant arrangement.  P-trees are lossless representations of the original data.  P-trees are compressed structures.

An example of a P-tree  Peano or Z-ordering  Pure (Pure-1/Pure-0) quadrant  Root Count  Level  Fan-out  QID (Quadrant ID) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 55 1681516 30414434 11100010110 1 55 0 4 444 158 11 10 300 10 1 11 3 0 1 11111111111111111110000011110010111111111111111111111111111111111111111111111111111000001111001011111111111111111111111111111111 bSQ Arranged as a spatial dataset (2-D raster order)

55 1681516 30414434 11100010110 1 An example of Ptree  Peano or Z-ordering  Pure (Pure-1/Pure-0) quadrant  Root Count  Level  Fan-out  QID (Quadrant ID) 1 1 1 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 1 0123 111 ( 7, 1 ) ( 111, 001 ) 10.10.11 2 3 2. 2. 3 001

Ptree Algebra  And  Or  Complement  Other (XOR, etc) PC-tree: 55 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 16 ____8__ _15__ 16 / / | \ / | \ \ 3 0 4 1 4 4 3 4 //|\ //|\ //|\ 1110 0010 1101 Its complement (counts 0’s, not 1’s: 9 ____________/ / \ \___________ / ___ / \___ \ / / \ \ 0 ____8__ __1__ 0 / / | \ / | \ \ 1 4 0 3 0 0 1 0 //|\ //|\ //|\ 0001 1101 0010

Basic, Value and Tuple Ptrees Value or Interval Ptrees P 1, 5 = P 1,101 = P 11 AND P 12 ’ AND P 13 Tuple Ptrees P (5,2,7) = P (101,010,111) = P 1,101 ^ P 2,010 ^ P 3,111 = P 11 ^P’ 12 ^P 13 ^ P’ 21 ^P 22 ^P’ 23 ^ P 31 ^P 32 ^P 33 Basic Ptrees P 11, P 12, …, P 18, P 21, …, P 28, …, P 71, …, P 78 Notational alternatives: P 1, 5 = P 1,101 = P (101,,)

Clustering Methods A Categorization of Major Clustering Methods  Partitioning methods K-means, K-mediods,…  Hierarchical methods Agglomerative, divissive,…  Density-based methods  Grid-based methods  Model-based methods

The K-Means Clustering Method  Given k, k-means alg is implemented as follows: –Partition objects into k nonempty subsets 1.Compute seed points (centroids or mean points) of the clusters of the current partition. 2.Assign each object to the cluster with nearest seed point. –Repeat until some stopping condition is satified.

The K-Means Clustering Method  Strength –Relatively efficient: O(nkt) –n = number of objects, –k = number of clusters, –t = number of iterations. –Normally k and t << n.  Weakness –Requires a metric so that mean is defined. –Need to specify k, the number of clusters, in advance. –Sensitive to noisy data and outliers since a small number of such data can substantially influence the mean value.

The K-Medoids Clustering Method  Find representative objects, called medoids, in clusters Often a “middle-ish” or “median”object.  PAM (Partitioning Around Medoids, 1987) Pick k medoids; check all pairs, (medoid, non-medoid) for improved clustering; if yes, relpace medoid. Repeat until some stopping condition. PAM effective for small data sets, but does not scale well for large data sets  CLARA (Clustering LARge Applications) (Kaufmann & Rousseeuw, 1990) Draw many sample sets; apply PAM to each; return best clustering.  CLARANS (Clustering Large Apps based upon RANdomized Search) (Ng & Han, 1994) Similar to CLARA except a graph is used to guide replacements.

PAM (Partitioning Around Medoids)  Use real object to represent the cluster –Select k representative objects arbitrarily –For each pair of non-selected object h and selected object i, calculate the total swapping cost TC ih –For each pair of i and h, If TC ih < 0, i is replaced by h Then assign each non-selected object to the most similar representative object –repeat steps 2-3 until there is no change

CLARA (Clustering Large Applications)  It draws multiple samples of the data set, applies PAM on each sample, and gives the best clustering as the output  Strength: deals with larger data sets than PAM  Weakness: –Efficiency depends on the sample size –A good clustering based on samples will not necessarily represent a good clustering of the whole data set if the sample is biased

CLARANS (A Clustering Algorithm based on Randomized Search)  CLARANS draws sample of neighbors dynamically  The clustering process can be presented as searching a graph where every node is a potential solution, that is, a set of k medoids  If the local optimum is found, CLARANS starts with new randomly selected node in search for a new local optimum  It is more efficient and scalable than both PAM and CLARA

Our Approach  Representing the original data set as interval P-trees by using higher-order-bits concept hierarchy (interval = value at a higher level in the hierarchy)  These P-trees can be viewed as groups of very similar data elements  Prune out outliers by disregarding sparse groups Input: total number of objects (N), all interval P-trees, pruning criteria (e.g., root count threshold, outlier (ol) percentage) Output: interval P-trees after prune (1) Choose the interval P-tree with smallest root count (P v ) (2) Apply pruning criteria ( eg, RC(P v ) < threshold; (ol:=ol+RC(P v ))/N  t ) then remove P v and repeat until pruning criteria fails.  Finding clusters by –traversing P-tree levels until there are  k –use PAM where each interval P-tree is an object

Example BSQ: 0,0 | 0011| 0111| 1000| 1011 0,1 | 0011| 0011| 1000| 1111 0,2 | 0111| 0011| 0100| 1011 0,3 | 0111| 0010| 0101| 1011 1,0 | 0011| 0111| 1000| 1011 1,1 | 0011| 0011| 1000| 1011 1,2 | 0111| 0011| 0100| 1011 1,3 | 0111| 0010| 0101| 1011 2,0 | 0010| 1011| 1000| 1111 2,1 | 0010| 1011| 1000| 1111 2,2 | 1010| 1010| 0100| 1011 2,3 | 1111| 1010| 0100| 1011 3,0 | 0010| 1011| 1000| 1111 3,1 | 1010| 1011| 1000| 1111 3,2 | 1111| 1010| 0100| 1011 3,3 | 1111| 1010| 0100| 1011 bSQ: B 11 B 12 B 13 B 14 0000 0011 1111 1111 0011 0001 1111 0001 0111 0011 1111 0011 B 21, B 22, B 23, B 24 Value P-trees: P1,0000 P1,0100 P1,1000 P1,1100 P1,0010 P1,0110 P1,1010 P1,1110 P1,0001 P1,0101 P1,1001 P1,1101 P1,0011 P1,0111 P1,1011 P1,1111 P2,0000 P2,0100 P2,1000 P2,1100 P2,0010 P2,0110 P2,1010 P2,1110 P2,0001 P2,0101 P2,1001 P2,1101 P2,0011 P2,0111 P2,1011 P2,1111 P3,0000 P3,0100 P3,1000 P3,1100 P3,0010 P3,0110 P3,1010 P3,1110 P3,0001 P3,0101 P3,1001 P3,1101 P3,0011 P3,0111 P3,1011 P3,1111 P4,0000 P4,0100 P4,1000 P4,1100 P4,0010 P4,0110 P4,1010 P4,1110 P4,0001 P4,0101 P4,1001 P4,1101 P4,0011 P4,0111 P4,1011 P4,1111 Basic P-trees: P 11, P 12, P 13, P 14, P 21, P 22, P 23, P 24 P 31, P 32, P 33, P 34, P 41, P 42, P 43, P 44 P 11 ’, P 12 ’, P 13 ’, P 14 ’, P 21 ’, P 22 ’’, P 23, P 24 ’ P 31 ’, P 32 ’, P 33 ’, P 34 ’, P 41 ’, P 42 ’, P 43 ’, P 44 ’ Tuple P-trees: (non-zero RCs) P (0010,1011,1000,1111) 3 0 0 3 0 1110 P (1010,1010,0100,1011) 1 0 0 0 1 1000 P (1010,1011,1000,1111) 1 0 0 1 0 0001 P (0011,0011,1000,1011) 1 1 0 0 0 0001 P (0011,0011,1000,1111) 1 1 0 0 0 0100 P (0011,0111,1000,1011) 2 2 0 0 0 1010 P (0111,0010,0101,1011) 2 0 2 0 0 0101 P (0111,0011,0100,1011) 2 0 2 0 0 1010 P (1111,1010,0100,1011) 3 0 0 0 3 0111

P-tree Performance Average time required to perform multi-operand ANDing operation on a TM file (40 million pixels)

Conclusion & Discussion  PAM is not efficient in dealing with medium and large data sets.  CLARA - CLARANS draw samples from the original data randomly.  Our algorithm (using P-trees; lossless, data-mining-ready data structures) does not draw samples, but groups the data first –each interval P-tree can be viewed as a group –Then PAM only needs to deal with the P-trees, and the number of P-trees are much smaller than the data set CLARA and CLARANS need to deal with. –Because P-tree ANDing is very fast, our algorithm is very fast.

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Similar presentations

Presentation on theme: "Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA.

Similar presentations

Presentation on theme: "Clustering Analysis of Spatial Data Using Peano Count Trees Qiang Ding William Perrizo Department of Computer Science North Dakota State University, USA."— Presentation transcript:

Similar presentations

About project

Feedback