Functional Analytic Unsupervised and Supervised data mining Technology

Slides:



Advertisements
Similar presentations
Searching on Multi-Dimensional Data
Advertisements

A Novel 2D To 3D Image Technique Based On Object- Oriented Conversion.
CS559-Computer Graphics Copyright Stephen Chenney Image File Formats How big is the image? –All files in some way store width and height How is the image.
Data Mining on Streams  We should use runlists for stream data mining (unless there is some spatial structure to the data, of course, then we need to.
Performance Improvement for Bayesian Classification on Spatial Data with P-Trees Amal S. Perera Masum H. Serazi William Perrizo Dept. of Computer Science.
Vertical Set Square Distance: A Fast and Scalable Technique to Compute Total Variation in Large Datasets Taufik Abidin, Amal Perera, Masum Serazi, William.
October 14, 2014Computer Vision Lecture 11: Image Segmentation I 1Contours How should we represent contours? A good contour representation should meet.
Bit Sequential (bSQ) Data Model and Peano Count Trees (P-trees) Department of Computer Science North Dakota State University, USA (the bSQ and P-tree technology.
Ptree * -based Approach to Mining Gene Expression Data Fei Pan 1, Xin Hu 2, William Perrizo 1 1. Dept. Computer Science, 2. Dept. Pharmaceutical Science,
CSCI 765 Big Data and Infinite Storage One new idea introduced in this course is the emerging idea of structuring data into vertical structures and processing.
K-Nearest Neighbor Classification on Spatial Data Streams Using P-trees Maleq Khan, Qin Ding, William Perrizo; NDSU.
Efficient OLAP Operations for Spatial Data Using P-Trees Baoying Wang, Fei Pan, Dongmei Ren, Yue Cui, Qiang Ding William Perrizo North Dakota State University.
TEMPLATE DESIGN © Predicate-Tree based Pretty Good Protection of Data William Perrizo, Arjun G. Roy Department of Computer.
FAUST Oblique Analytics (based on the dot product, o). Given a table, X(X 1..X n ), |X|=N and vectors, D=(D 1..D n ), FAUST Oblique employs the ScalarPTreeSets.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa pf 9 pb a pc b pd pe c d e f a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6.
13.3 Conditional Probability and Intersections of Events Understand how to compute conditional probability. Calculate the probability of the intersection.
Given k, k-means clustering is implemented in 4 steps, assumes the clustering criteria is to maximize intra- cluster similarity and minimize inter-cluster.
The Universality of Nearest Neighbor Sets in Classification and Prediction Dr. William Perrizo, Dr. Gregory Wettstein, Dr. Amal Shehan Perera and Tingda.
Our Approach  Vertical, horizontally horizontal data vertically)  Vertical, compressed data structures, variously called either Predicate-trees or Peano-trees.
6-hop myrrh example (from Damian). Market agency targeting advertising to friends of customers: Entities: 1. advertisements 2. markets 3. merchants 4.
Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.
Enclose clusters with gaps using functionals (ScalarPTreeSets or SPTSs): C p,d (x)=(x-p) o d /  (x-p) o (x-p) Conical Separating clusters by cone gaps.
Fast and Scalable Nearest Neighbor Based Classification Taufik Abidin and William Perrizo Department of Computer Science North Dakota State University.
Overview Data Mining - classification and clustering
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built.
Clustering Microarray Data based on Density and Shared Nearest Neighbor Measure CATA’06, March 23-25, 2006 Seattle, WA, USA Ranapratap Syamala, Taufik.
Unsupervised Classification
5(I,C) (I,C) (I,C) (I,C)
Efficient Quantitative Frequent Pattern Mining Using Predicate Trees Baoying Wang, Fei Pan, Yue Cui William Perrizo North Dakota State University.
Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.
P Left half of rt half ? false  Left half pure1? false  Whole is pure1? false  0 5. Rt half of right half? true  1.
Item-Based P-Tree Collaborative Filtering applied to the Netflix Data
[ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].
Decision Tree Classification of Spatial Data Streams Using Peano Count Trees Qiang Ding Qin Ding * William Perrizo Department of Computer Science.
But it's pure0 so this branch ends
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Efficient Image Classification on Vertically Decomposed Data
Decision Tree Induction for High-Dimensional Data Using P-Trees
Efficient Ranking of Keyword Queries Using P-trees
Proximal Support Vector Machine for Spatial Data Using P-trees1
Opening Bell Opportunity Evaluator (OBOE): The NYSE Bell is at 9AM EST (I think). At 8:53AM EST (EST=UTC-5), create Trends Around the Globe (TAG) pTrees.
Mean Shift Segmentation
North Dakota State University Fargo, ND USA
Yue (Jenny) Cui and William Perrizo North Dakota State University
Dr. William Perrizo North Dakota State University
PTrees (predicate Trees) fast, accurate , DM-ready horizontal processing of compressed, vertical data structures Project onto each attribute (4 files)
Fitting Curve Models to Edges
StatQuest: t-SNE Clearly Expalined!!!!
Efficient Image Classification on Vertically Decomposed Data
[ ] Find a furthest point from M, f0 = MaxPoint[SpS((x-M)o(x-M))].
PDR PTreeSet Distribution Revealer
Vertical K Median Clustering
A Fast and Scalable Nearest Neighbor Based Classification
B- Trees D. Frey with apologies to Tom Anastasio
Vertical K Median Clustering
From: Perrizo, William Sent: Thursday, February 02, :45 AM To: 'Mark Silverman' The Satlog (Landsat Satellite) data set from UCI Machine Learning.
B- Trees D. Frey with apologies to Tom Anastasio
Outline Introduction Background Our Approach Experimental Results
North Dakota State University Fargo, ND USA
B- Trees D. Frey with apologies to Tom Anastasio
The Multi-hop closure theorem for the Rolodex Model using pTrees
Vertical K Median Clustering
North Dakota State University Fargo, ND USA
But, depending on defintions 3 count=1_thin_intervals,
FAUST{pdq,std} (FAUST{pdq} using number of gap standard deviations)
Mathematical Analysis of Algorithms
The P-tree Structure and its Algebra Qin Ding Maleq Khan Amalendu Roy
pTrees predicate Tree technologies
Presentation transcript:

Functional Analytic Unsupervised and Supervised data mining Technology FAUST Functional Analytic Unsupervised and Supervised data mining Technology Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University, Fargo, ND William.perrizo@ndsu.edu visit Treeminer, Inc. 175 Admiral Cochrane Dr. Suite 300 Annapolis, Maryland 21401 240.389.0750 (T) 301.528.6297 (F) info@treeminer.com

This work involves clustering and classification of very big datasets (so called "big data")using vertically structuring (pTrees) In a nutshell our techniques involve structuring the dataset into compressed vertical strips (bitslices) and processing across those vertical bitslices using logical operations. This is in contrast to the traditional method, that of structuring the data into horizontal records and then processing down those record. Thus, we do horizontal processing of vertical data (HPVD) rather than the traditional vertical processing of horizontal data (VPHD). We will take a much more detail look at the way we structure data and the way we process those structures a little later, but first, I want to talk about big data. How big is big data these days and how big will it get? An example.

The US Library of Congress is storing EVERY tweet that was sent since Twitter launched in 2006 Each tweet record contains fifty fields Let's assume a tweet record is 1000 bits in width The US LoC will record 172 Billion tweets from 500 million tweeters in 2013 alone.  So let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees from 2006 to 2016. As a full 3-dimensional matrix that's an undecillion=1030 matrix cells Even if only the sender is recorded, that's a 1000 excillion=1021.

Let’s look at how the definition of “big data” has evolved just over my work lifetime. I started as THE technician at the St. John’s University IBM 1620 Computer Center (circa 1964), I did the following: 1. I turned the 1620 switch on.  2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the O/S  punch card stack on the card reader (~4 in. high) 4. I put the FORTRAN compiler card stack on the reader (~3 in.) 5. I put the FORTRAN program card stack on the reader (~2 in.) 6. The 1620 produced an object code stack which I read in (~1 in 7. I read in the object stack and a 1964 big data stack (~40 in.) The 1st FORTRAN upgrade allowed for a “continue” card so that the data stack could be read in segments (and I could sit down).

How high would a 2013 big data stack of Hollerith punch cards reach? Let's be conservative and just assume an exabyte of data on punch cards How high is an exabyte on punch cards? We're being conservative because the US LoC full matrix of tweet data would be ~1030 bytes (and an exabyte is a mere 1018)

punch cards would reach That exabyte stack of punch cards would reach JUPITER! So, in my work lifetime, "big data" has gone from 40 inches high all the way to Jupiter! What will happen to big data over my grandson, Will-the-2nd’s, lifetime?

Will-the-1st (that's me) has to deal with a stack of tweets that will reach Jupiter, but I will replace it, losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-2nd will have to deal with a stack of tweets that will reach the end of space, but he will replace it losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-3rd, will have to deal with a stack of tweets that will create new space, but he will replace it losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-3rd can use Will the-2nd's code (which is Will-the-1st's code). Will-the-3rd ‘s DATA WILL HAVE TO BE COMPRESSED and have to be VERTICALLY structured (I believe). Let's see how compressed vertical data structuring might be done.

But it's pure0 so this branch ends predicate Trees = pTrees: slice by column (4 vertical structures). Vertical Data (pTrees) vertically slice off each bit position (12 vertical structures) Traditional Vertical Processing of Horizontal Data (VPHD) T(A1 A2 A3 A4) 2 7 6 1 6 7 6 0 3 7 5 1 2 7 5 7 3 2 1 4 2 2 1 5 7 0 1 4 then compress each bit slice into a tree using a predicate (We will walk thru the compression of T11 into pTree, P11 ) =2 e.g., find the number of occurences of 7 0 1 4 Using vertical pTrees find number of occurences 7 0 1 4 T[A1] T[A2] T[A3] T[A4] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 Base 10 Base 2 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 011 010 001 100 010 010 001 101 111 000 001 100 = for Horizontally structured, record-oriented data, one scans vertically T11 1 Imagine an excillion records, not just 8 (We need speed!). 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 0 1 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 T11 T12 T13 T21 T22 T23 T31 T32 T33 T41 T42 T43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. More typically, we compress strings of bits not single bits (eg, 64 bit strings or strides). 1. Whole thing pure1? false  0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 0 0 0 1 01 10 1 0 1 0 01 0 1 0 0 0 1 0 0 10 01 ^ 2. Left half pure1? false  0 3. Right half pure1? false  0 0 0 P11 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 But it's pure0 so this branch ends To count (7,0,1,4)s use 111000001100 P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 7 0 1 4 0 *23 0 0 *22 =2 0 1 *21 *20 =

If count(C) / |fM|2 > DT (a DensityThreshold), C is complete, FAUST=Functional Analytic Unsupervised and Supervised data mining Technology CLUSTERing (unsupervised) Use functional gap analysis to find round clusters: Initially let C = entire table, X, e.g., X≡{p1, ..., pf}, an image table of 15 pixels). While cluster C, is insufficiently dense, find M = Mean(C) (or Vector_of_Medians?). Pick a point fC (e.g., the furthest point from M) and let fM denote the vector from f to M. If count(C) / |fM|2 > DT (a DensityThreshold), C is complete, else split C at each cutpoint P where there is a GT gap (GT=Gap Threshold) in the dot product values, C dot d≡fM/|fM| C2={p5}is complete (a singleton = outlier). C3={p6,pf} will split into {p6} and {pf} which are also outliers (details omitted) That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is complete ( density(C1)= ~4/22=1 > DT=.5 ?) Applying the algorithm to C4: In both cases those are probably the best "round" clusters (accurate?). Speed? What if there are a trillion pixels (not just 15)? All calculations can be done using pTrees (processing across the 8 pTrees regardless of the number of pixels.) 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f Finally, there are many variations of this algorithm (e.g. using M=vector_of_medians instead of the mean, which works better for finding outliers! M f M4 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 C1 C2 C3 C4 {pa} is an outlier. C2 splits into {p9}, {pb,pc,pd}, both complete. M1 M0 f f1=p3, C1 doesn't split (complete).

Finding FAUST gaps using pTree HPVD (Horizontal Process of Vertical Data) X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 OR between gap 2 and 3 for cluster C2={p5} f=p1 GT=8=23. [uncompressed] pTrees of xofM p5' 1 p3' 1 p6' 1 p4' 1 p4' 1 p3 1 p5' 1 p6' 1 p3' 1 p5' 1 p4 1 p6' 1 p5' 1 p6' 1 p4 1 p3 1 p4' 1 p3' 1 p5 1 p6' 1 p4' 1 p5 1 p6' 1 p3 1 p3' 1 p5 1 p4 1 p6' 1 p5 1 p4 1 p6' 1 p3 1 23 gap: [0,7]= [000 0000, 000 0111], but since it is at the front, it doesn't separate clusters. 23 gap: [40,47] = [010 1000, 010 1111] 23 gap: [56,63] = [011 1000, 011 1111] p6 1 p5' 1 p4' 1 p4' 1 p5' 1 p6 1 p6 1 p3' 1 p4 1 p5' 1 p3 1 p5' 1 p6 1 p4 1 p3' 1 p4' 1 p6 1 p5 1 p5 1 p6 1 p4' 1 p3 1 p5 1 p3' 1 p6 1 p4 1 p5 1 p6 1 p3 1 p4 1 24 gap: [101 1000, 110 0111] = [88,103] 24 gap: [100 0000, 100 1111]= [64,79] OR pTrees between gap 1 & 2 for cluster C1={p1,p3,p2,p4} between 3,4 cluster C3={p6,pf} Or for cluster C4={p7,p8,p9,pa,pb,pc,pd,pe}

FAUST Classifier (supervised) PR = P(X dot d)<a Use a Cut hyperplane between classes. Cut at midpoint between means. D≡ mRmV d=D/|D| a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d PR = P(X dot d)<a Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane cutting the space in two. Classify x ≡ If xod<a, classify as R, else V. To classify a set, form the set mask pTree for the set of points and apply We can improve accuracy, e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, median{v2|vV}, ... ) Alternatively, cut using a standard deviation ratio a= std(R)/((std(R)+std(V)) dim 2 vomR vomV r   r vv r mR   r      v v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d a std of these distances from origin along the d-line = std(R)