Download presentation
Presentation is loading. Please wait.
Published byClarence Sharp Modified over 5 years ago
1
Functional Analytic Unsupervised and Supervised data mining Technology
FAUST Functional Analytic Unsupervised and Supervised data mining Technology Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University, Fargo, ND visit Treeminer, Inc. 175 Admiral Cochrane Dr. Suite 300 Annapolis, Maryland (T) (F)
2
This work involves clustering and classification of very big datasets (so called "big data")using vertically structuring (pTrees) In a nutshell our techniques involve structuring the dataset into compressed vertical strips (bitslices) and processing across those vertical bitslices using logical operations. This is in contrast to the traditional method, that of structuring the data into horizontal records and then processing down those record. Thus, we do horizontal processing of vertical data (HPVD) rather than the traditional vertical processing of horizontal data (VPHD). We will take a much more detail look at the way we structure data and the way we process those structures a little later, but first, I want to talk about big data. How big is big data these days and how big will it get? An example.
3
The US Library of Congress
is storing EVERY tweet that was sent since Twitter launched in 2006 Each tweet record contains fifty fields Let's assume a tweet record is 1000 bits in width The US LoC will record 172 Billion tweets from 500 million tweeters in 2013 alone. So let's estimate approximately 1 trillion tweets from 1 billion tweeters, to 1 billion tweetees from 2006 to 2016. As a full 3-dimensional matrix that's an undecillion=1030 matrix cells Even if only the sender is recorded, that's a 1000 excillion=1021.
4
Let’s look at how the definition of “big data” has evolved just over my work lifetime.
I started as THE technician at the St. John’s University IBM 1620 Computer Center (circa 1964), I did the following: 1. I turned the 1620 switch on. 2. I waited for the ready light bulb to come on (~15 minutes) 3. I put the O/S punch card stack on the card reader (~4 in. high) 4. I put the FORTRAN compiler card stack on the reader (~3 in.) 5. I put the FORTRAN program card stack on the reader (~2 in.) 6. The 1620 produced an object code stack which I read in (~1 in 7. I read in the object stack and a 1964 big data stack (~40 in.) The 1st FORTRAN upgrade allowed for a “continue” card so that the data stack could be read in segments (and I could sit down).
5
How high would a 2013 big data stack of Hollerith punch cards reach?
Let's be conservative and just assume an exabyte of data on punch cards How high is an exabyte on punch cards? We're being conservative because the US LoC full matrix of tweet data would be ~1030 bytes (and an exabyte is a mere 1018)
6
punch cards would reach
That exabyte stack of punch cards would reach JUPITER! So, in my work lifetime, "big data" has gone from 40 inches high all the way to Jupiter! What will happen to big data over my grandson, Will-the-2nd’s, lifetime?
7
Will-the-1st (that's me) has to deal with a stack of tweets that will reach Jupiter, but I will replace it, losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-2nd will have to deal with a stack of tweets that will reach the end of space, but he will replace it losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-3rd, will have to deal with a stack of tweets that will create new space, but he will replace it losslessly by 1000 extendable vertical pTrees and data mine across those 1000 horizontally. Will-the-3rd can use Will the-2nd's code (which is Will-the-1st's code). Will-the-3rd ‘s DATA WILL HAVE TO BE COMPRESSED and have to be VERTICALLY structured (I believe). Let's see how compressed vertical data structuring might be done.
8
But it's pure0 so this branch ends
predicate Trees = pTrees: slice by column (4 vertical structures). Vertical Data (pTrees) vertically slice off each bit position (12 vertical structures) Traditional Vertical Processing of Horizontal Data (VPHD) T(A1 A2 A3 A4) then compress each bit slice into a tree using a predicate (We will walk thru the compression of T11 into pTree, P11 ) =2 e.g., find the number of occurences of Using vertical pTrees find number of occurences T[A1] T[A2] T[A3] T[A4] Base 10 Base 2 = for Horizontally structured, record-oriented data, one scans vertically T11 1 Imagine an excillion records, not just 8 (We need speed!). T11 T12 T13 T21 T22 T23 T31 T32 T33 T41 T42 T43 pure1? false=0 pure1? true=1 pure1? false=0 pure1? false=0 pure1? false=0 Record truth of predicate: "purely 1-bits" in a tree, recursively on halves, until the half is pure. More typically, we compress strings of bits not single bits (eg, 64 bit strings or strides). 1. Whole thing pure1? false 0 0 0 0 1 P11 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 1 1 0 1 0 01 0 1 0 0 ^ 2. Left half pure1? false 0 3. Right half pure1? false 0 0 0 P11 4. Left half of rt half ? false0 0 0 5. Rt half of right half? true1 0 0 0 1 But it's pure0 so this branch ends To count (7,0,1,4)s use P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 *23 *22 =2 0 1 *21 *20 =
9
If count(C) / |fM|2 > DT (a DensityThreshold), C is complete,
FAUST=Functional Analytic Unsupervised and Supervised data mining Technology CLUSTERing (unsupervised) Use functional gap analysis to find round clusters: Initially let C = entire table, X, e.g., X≡{p1, ..., pf}, an image table of 15 pixels). While cluster C, is insufficiently dense, find M = Mean(C) (or Vector_of_Medians?). Pick a point fC (e.g., the furthest point from M) and let fM denote the vector from f to M. If count(C) / |fM|2 > DT (a DensityThreshold), C is complete, else split C at each cutpoint P where there is a GT gap (GT=Gap Threshold) in the dot product values, C dot d≡fM/|fM| C2={p5}is complete (a singleton = outlier). C3={p6,pf} will split into {p6} and {pf} which are also outliers (details omitted) That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is complete ( density(C1)= ~4/22=1 > DT=.5 ?) Applying the algorithm to C4: In both cases those are probably the best "round" clusters (accurate?). Speed? What if there are a trillion pixels (not just 15)? All calculations can be done using pTrees (processing across the 8 pTrees regardless of the number of pixels.) 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f Finally, there are many variations of this algorithm (e.g. using M=vector_of_medians instead of the mean, which works better for finding outliers! M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf C1 C C C4 {pa} is an outlier. C2 splits into {p9}, {pb,pc,pd}, both complete. M1 M0 f f1=p3, C1 doesn't split (complete).
10
Finding FAUST gaps using pTree HPVD (Horizontal Process of Vertical Data)
X x1 x2 p p p p p p p p p pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 OR between gap 2 and 3 for cluster C2={p5} f=p1 GT=8=23. [uncompressed] pTrees of xofM p5' 1 p3' 1 p6' 1 p4' 1 p4' 1 p3 1 p5' 1 p6' 1 p3' 1 p5' 1 p4 1 p6' 1 p5' 1 p6' 1 p4 1 p3 1 p4' 1 p3' 1 p5 1 p6' 1 p4' 1 p5 1 p6' 1 p3 1 p3' 1 p5 1 p4 1 p6' 1 p5 1 p4 1 p6' 1 p3 1 23 gap: [0,7]= [ , ], but since it is at the front, it doesn't separate clusters. 23 gap: [40,47] = [ , ] 23 gap: [56,63] = [ , ] p6 1 p5' 1 p4' 1 p4' 1 p5' 1 p6 1 p6 1 p3' 1 p4 1 p5' 1 p3 1 p5' 1 p6 1 p4 1 p3' 1 p4' 1 p6 1 p5 1 p5 1 p6 1 p4' 1 p3 1 p5 1 p3' 1 p6 1 p4 1 p5 1 p6 1 p3 1 p4 1 24 gap: [ , ] = [88,103] 24 gap: [ , ]= [64,79] OR pTrees between gap 1 & 2 for cluster C1={p1,p3,p2,p4} between 3,4 cluster C3={p6,pf} Or for cluster C4={p7,p8,p9,pa,pb,pc,pd,pe}
11
FAUST Classifier (supervised) PR = P(X dot d)<a
Use a Cut hyperplane between classes. Cut at midpoint between means. D≡ mRmV d=D/|D| a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d PR = P(X dot d)<a Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane cutting the space in two. Classify x ≡ If xod<a, classify as R, else V To classify a set, form the set mask pTree for the set of points and apply We can improve accuracy, e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, median{v2|vV}, ... ) Alternatively, cut using a standard deviation ratio a= std(R)/((std(R)+std(V)) dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a std of these distances from origin along the d-line = std(R)
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.