Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proximal Support Vector Machine for Spatial Data Using P-trees1

Similar presentations


Presentation on theme: "Proximal Support Vector Machine for Spatial Data Using P-trees1"— Presentation transcript:

1 Proximal Support Vector Machine for Spatial Data Using P-trees1
Fei Pan, Baoying Wang, Dongmei Ren, Xin Hu, William Perrizo 1Patents are pending on P-tree technology by North Dakota State University

2 OUTLINE Introduction Brief Review of SVM Review of P-tree and EIN-ring
Proximal Support Vector Machine Performance Analysis Conclusion

3 Introduction In this research paper, we develop an efficient proximal support vector machine (SVM) for spatial data using P-trees. The central idea is to fit a binary class boundary using piecewise linear segments.

4 Brief Review of SVM In very simple terms an SVM corresponds to a linear method (perceptron) in a very high dimensional feature space that is nonlinearly related to the input space. By using kernels, a nonlinear class boundary is is transformed into a linear boundary in a high dimensional feature space, where linear methods apply. The resulting classification in the original feature space is thereby exposed. Support Vector Machines (SVMs) are forcefully competing with Neural Networks as tools for solving pattern recognition problems. They are based on some beautifully simple ideas and provide a clear intuition of what learning from examples is all about. More importantly they are also showing high performances in practical applications.

5 More About of SVM The goal of a support vector machine classifier is to find the particular hyperplane in high dimensions for which the separation margin between two classes is maximized. Given a data set which contains the data belonging to two or more different classes, either linearly separable or non-separable, the problem is to find the optimal separating hyperplane (decision Boundary) to separate the data according to their class type.

6 More About of SVM Recently there has been explosion of interest in SVMs, which have empirically been shown to give good classification performance on a wide variety of problems. However, the training of SVMs is extremely slow for large scale data set. Lack of scalability is a problem for SVM.

7 Our Approach Our approach is a geometric method with well tuned accuracy and efficiency by using P-trees and EIN-rings (Equal Interval Neighborhood rings). Outliers in the training data are first identified and eliminated. The method is local (proximal) – I.e., no training phase is required. Preliminary tests show that the method has promise for both speed and accuracy.

8 But it is pure (pure0) so this branch ends
Current practice: Sets of horizontal records Ptrees: vertically project each attribute; vertically project each bit pos of each attribute; processed vertically (vertical scans) compress each bit slice into a basic Ptree; R( A1 A2 A3 A4) Horizontally AND basic Ptrees Horizontal structures (records) Scanned vertically R[A1] R[A2] R[A3] R[A4] R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 The 1-Dimensional Ptree, P11, of R11 built by recording the truth of predicate “pure 1” recursively on halves, until purity is reached. 1. Whole is pure1? false 0 2. 1st half pure1? false  0 3. 2nd half pure1? false  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 4. 1st half of 2nd ? false  0 0 0 6. 1sthalf of 1st of 2nd? true1 0 0 0 1 1 5. 2nd half of 2nd half? true1 0 0 0 1 7. 2ndhalf of 1st of 2nd false0 0 0 0 1 10 But it is pure (pure0) so this branch ends Eg, to count, s, use “pure ”: level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level

9 When is Horizontal Processing of Vertical Structures a good idea?
Their NOT for record-based workloads (e.g., SQL) (where the result is a set of records), changing horizontal record to vertical trees and then having to reconstruct horizontal result records, may mean excessive post processing. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 R( A1 A2 A3 A4) They are for data mining workloads, result is often a bit (Yes/No, T/F), so no reconstructive post processing. R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1

10 2-Dimensional Pure1-trees
Node is 1 iff that quadrant is purely 1-bits, e.g., A bit-file (e.g., high-order bit of the RED band of a 2-D image) Which, in spatial raster order looks like: Run-length compress it into a quadrant tree using Peano order. 1 1 1 1 1 1 1 1 1 1 1

11 1 55 level-3 (pure=43) 16 15 16 level-2 3 4 1 4 level-1 1 level-0 1 2
Counts are needed in DM. Predicate-trees are very compressed and can produce counts quickly. However, Count-trees are an alternative - each inode counts 1s in that quadrant): 1=001 1 level-3 (pure=43) 16 15 level-2 3 4 1 level-1 level-0 1 2 3 2 3 7=111 ( 7, 1 ) ( 111, 001 )

12 Logical Operations on Ptrees (are used to get counts of any pattern)
Ptree Ptree AND result OR result AND operation is faster than the bit-by-bit AND since, there are shortcuts (any pure0 operand node means result node is pure0.) (any pure1, copy subtree of the other operand to the result) e.g., only load quadrant 2 to AND Ptree1, Ptree2, etc. The more operands there are in the AND, the greater the benefit due to this shortcut (more pure0 nodes).

13 Hilbert Ordering? In 2-dimensions, Peano ordering is 22-recursive z-ordering (raster ordering) Hilbert ordering is 44-recursive tuning fork ordering (H-trees have fanout=16) Somewhat better continuity characteristics, but a much less usable coordinate –to-quadrant translator.

14 3-Dimensional Ptrees

15 Same relation, showing values in binary
Generalizing Peano compression to any table with numeric attributes. Raster Sorting: Attributes 1st Bit position 2nd Peano Sorting: Bit position 1st Attributes 2nd Unsorted relation Same relation, showing values in binary

16 Generalize Peano Sorting can make a big difference in classification speed
Classificatin speed improvement (sample-based classifier) Using 5 UCI Machine Learning Repository data sets 120 100 Unsorted 80 Generalized Raster Time in Seconds 60 40 20 Generalized Peano adult spam mushroom function crop

17 Range predicate tree: Px>v
For identifying/counting tuples satisfying a given range predicate. v=bm…bi…b0 Px>v = Pm opm … Pi opi Pi-1 … opk+1 Pk 1) opi is  if bi=1, opi is  otherwise 2) k is rightmost bit position with value “0” For example: Px >101 = (P2  (P1 P0))

18 Pxv v=bm…bi…b0 Pxv = P’m opm … P’i opi P’i-1 … opk+1P’k
1) opi is  if bi=0, opi is  otherwise 2) k is rightmost bit position of v with “0” For example: Px  101 = (P’2  P’1) Given a data set with d attributes, X = (An, An-1 … A0), and the binary representation of jth attribute Aj as (bj,mbj,m-1...bj,i… bj,0.)

19 Equal Interval Neighborhood Rings (EIN-rings) (using L distance)
3rd EIN-ring Diagram of EIN-Ring 2nd EIN-ring C 1st EIN-ring EIN-Ring of data point c with radii r and fixed interval  is defined as the neighborhood ring R(c, r, r+) = {x X | r < |c-x|  r+}, where |c-x| is the distance between x and c.

20 EIN-ring Based Neighborhood Search Using Range Predicate Tree
X X X x r r r+ x x P X1 = (x1-r-, x1 + r + ] X2 = (x2-r-, x2 + r + ] P’ X1 = (x1-r, x1+r] X2 = (x2-r, x2+r] P ^ P’ X1 = (x1-r-, x1 + r + ] X1 = (x1-r, x1+r] X2 = (x2-r-, x2 + r + ] X2 = (x2-r, x2+r] EIN-ring Based Neighborhood Search Using Range Predicate Tree

21 Proximal Support Vector Machine (P-SVM)
1) Find region components (proximities) using EIN-rings. 2) Calculate EIN-ring membership and find support vector pairs. 3) If the training space has d feature dimensions, calculate d-nearest boundary sentries in the training data, to determine a local boundary hyperplane segment. The class label is then determined by the unclassified sample’s location relative to the boundary hyperplane. In this research paper, we propose a proximal Support Vector Machine using P-tree(P-SVM). The overall algorithm of P-SVM is as follows: (1). Finding Region Components: Partition the training data into region components using EIN-ring according to the class label (2). Finding Support Vectors: Calculate EIN-ring membership of a data set in region component g. Then find support vector pairs based on EIN-ring membership of the data set. (3). Fitting Boundary: Calculate d-nearest Boundary Sentries of a test data, which determine the boundary hyperplane of the test data. The class label of the test data is then determined by its location relative to the boundary hyperplane.

22 Step 1: Find region components using EIN-rings (defined above)
(1) ensure that each data object j N is assigned to some center i N, the constraints (2) ensure that whenever a data object j is assigned to a center i, then a center must have been opened at i, Definition: Given a training data set X with C classes, Region Components are groups of the training data points, x, which have more than half classmates within  neighbors, where classmates are the data points with same class label as x, and  is a number threshold of neighbors. Algorithm: 1. Set number of neighbors with the Hawaiian ring(NBRx) as a fixed number , e.g., 4, 6, or 8. If NBRx < , decrement r and until NBRs  . 2. Check the neighbors of x, mark x the same group as its classmates’ within the neighborhood. If none of its classmates within the neighborhood are marked, mark x as a new group member. If the number of x’s classmates < /2, treat x as an outlier. 3. Merge groups which have the same class label and are reachable to each other within  neighbors Assume outliers are eliminated during a data cleaning process

23 Step 2:finding support vector pairs Step 3:fit boundary hyper plane
EIN-ring membership c: component r: radius Support vector pair Boundary Sentry Boundary hyper plane + + + å = Î m r c x NBR w N M 1 , * + + + + + + + + + + - + - + - + * - - - # * - - - The EIN-ring membership of data x in region component g, Mxg, is defined as normalized summation of the weighted tuple-P-tree root counts within EIN-rings, which is calculated as follows where Ng is the number of data points in region components g, and wr is the weight of EIN-rings, R(x, r-1, r). NBRxg,r is the number of neighbors in group g within the EIN-ring, R(x, 0, r). The EIN-ring Membership Pair of data x in region component g, HMPxg, is defined as HMPxg = (Mxg, Mxg’), where g’ is the neighboring region component of data x. A pair of two candidate data support vectors, xi, xjX, ij, and xi  g, xj  g’, is the support vector pair, SVP(xi, xj), iff d(xi, xj)  d(xk, xl) and xk  g, xl  g’. In another word, xi and xj are at right side of boundary and the nearest neighbors from the different region components. - - - - - - - H ( x ) = wx + w

24 PRELIMINARY EVALUATION
Experiments were performed on a 1GHz Pentium PC machine with 1GB main memory, running on Debian Linux. The original image size is 1320x1320. For experimental purpose we form 16x16, 32x32, 64x64, 128x128, 256x256 and 512x512 image by choosing pixels that are uniformly distributed in original image. Aerial (TIFF) Image and Yield Map from Oaks, North Dakota

25 PRELIMINARY EVALUATION
Dataset n = # of pixels Testing Correctness % P-SVM C-SVM n = 16x16 86.4% 84.9% n = 32x32 89.0% 85.2% n = 64x64 90.3% 95.4% n = 128x128 92.0% 90.5% n = 256x256 94.1% 91.1% n = 512x512 As shown in Table 7, the testing correctness of P-SVM and C-SVM on these dataset is almost identical. It indicates that P-SVM has comparable accuracy with C-SVM. The average CPU run time of 30 runs on the five different sizes of data is shown in Figure 17. The following Table shows the experiment results of average error rate for 30 runs of P-SVM and C-SVM. In each experiment run, we randomly select 10% of data set as test data and the rest as training data. P-SVM correctness appears to be comparable to standard SVM

26 PRELIMINARY EVALUATION
Inf. We see that P-SVM is faster than C-SVM on all five different sizes of data set. When the data set size increases, the run time of P-SVM method increases at a much lower rate than C-SVM. The experiment results show that P-SVM method is more scalable for large spatial data set The speed experiments were also performed. The average CPU run time of 30 runs on the five different sizes of data is shown in the Figure above. P-SVM speed appears to be superior to standard SVM

27 CONCLUSION In this paper, we propose an efficient P-tree based proximal Support Vector Machine (P-SVM), which appears to improve speed without sacrificing accuracy. In the future, more extensive experiments and combination of P-SVM with KNN will be explored.

28 THANKS


Download ppt "Proximal Support Vector Machine for Spatial Data Using P-trees1"

Similar presentations


Ads by Google