Download presentation
Presentation is loading. Please wait.
Published byMalcolm Sparks Modified over 6 years ago
1
1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score= Nothing comes to minds as to what we would do here. MapReduce.Hadoop is a key-value approach to organizing complex BigData. In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2. pTree Text Mining: Current_Relevancy_Score =10 Killer_Idea_Score=9 I I think Oblique FAUST is the way to do this. Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9 Killer_Idea_Score=9 No No one has taken up the proof that this is a break through method. The applications are unlimited! 4. Secure pTreeBases: Current_Relevancy_Score =9 Killer_Idea_Score=10 This seems straight forward and a certainty (to be a killer advance)! It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs. Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9 Killer_Idea_Score= No one done a complete analysis of this is a break through method. The applications are unlimited here too! 6. pTree Algorithmic Tools: Current_Relevancy_Score =10 Killer_Idea_Score= This is Md’s work. Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful. It helps us all! 7. pTree Alternative Algorithm Impl: Current_Relevancy_Score =9 Killer_Idea_Score= This is Bryan’s work. Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8. pTree O/S Infrastructure: Current_Relevancy_Score =10 Killer_Idea_Score= This is Matt’s work. I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,… I give it a 10/10 because I know the people – they do double digit work always! From: Sent: Thurs, Aug Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach – these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time. I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going. I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday.
2
pTree Text Mining data Cube layout: tePt=again tePt=all tePt=a
lev2, pred=pure1 on tfP1 -stide 1 hdfP t=a t=again t=all lev-2 (len=VocabLen) 8 1 3 df count <--dfP3 <--dfP0 t=a t=again t=all . . . tfP0 1 tfP1 lev1tfPk eg pred tfP0: mod(sum(mdl-stride),2)=1 2 doc=1 d=2 d=3 term=a t=a t=a d= d= d=3 t=again t=again t=again tf d=1 d= d=3 t=all t=all t=all ... tePt=again t=a d=1 t=a d=2 t=a d=3 1 tePt=a t=again t=again t=again d= d= d=3 tePt=all t=all d=1 t=all d=2 t=all d=3 lev1 (len=DocCt*VocabLen) lev0 corpusP (len=MaxDocLen*DocCt*VocabLen) t=a d=1 t=a d=2 t=a d=3 t=again d=1 1 Math book mask Libry Congress masks (document categories move us up document semantic hierarchy ptf: positional term frequency The frequency of each term in each position across all documents (Is this any good?). 2 d=1 Preface 1 d=1 commas d=1 References Reading position masks (pos categories) move us up position semantic hierarchy (and allows puncutation etc., placement.) 1 te ... tf2 1 ... tf1 1 ... tf0 3 2 tf are April apple and an always. all again a Vocab Terms 1 3 2 df . . . . . . 1 JSE HHS LMM documnet Corpus pTreeSet data Cube layout: 1 2 3 4 5 6 7 Position
3
T0 S0 D0 T S DT human .22 -.11 3.34 interface .2 -.07 2.54
computer user system response time EPS survey trees graph minors c1 c2 c3 c4 c5 m1 m2 m3 m4 X tm\doc c1 c2 c3 c4 c5 m1 m2 m3 m4 human interface computer user system response time EPS survey trees graph minors X^ .46 = c1 Human machine interface for Lab ABC computer apps c2 A survey of user opinion of comp system response time c3 The EPS user interface management system c4 System and human system engineering testing of EPS c5 Relation of user-perceived response time to error measmnt m1 The generation of random, binary, unordered trees m2 The intersection graph of paths in trees m3 Graph minors IV: Widths of trees and well-quasi-ordering m4 Graph minors: A survey X = T0S0D0T T0, D0 column-orthonormal. X^ keeps only 1st 2 singular values. Corresp T,D columns give term and doc coordinates in 2D. X ~ X^ = TSDT T0 S0 3.34 2.54 2.35 1.64 1.50 1.31 0.85 0.56 0.36 D0 .46 =
4
inter comp X doc\term human face uter user system response time EPS c c c c c m m m m mc mm q D d (mc+mm)/ mc+mm/2*d a q * d q dot d Since .65 is ar less than a =~ 205, q is clearly in the c class survey trees graph minors d(doc,q) human interface computer user system response time (c1-q)^ (c2-q)^ (c3-q)^ (c4-q)^ (c5-q)^ (m1-q)^ (m2-q)^ (m3-q)^ (m4-q)^ This tells us c1 is closest to q in the full space, but that the other c documents are no closer than the m documents. q is probably classified c (one voter in the 1.5 nbhd) but it's not clear. This shows need for SVD or Oblique FAUST! EPS survey trees graph minors
5
inter comp doc\term human face uter user system respons time EPS c c c c c m m m m c mean mc m mean mm q^ = TSDq' d(doc,q^) 0.02 (c1-q^)^ 1.47 (c2-q^)^ 0.90 (c3-q^)^ 1.23 (c4-q^)^ 0.50 (c5-q^)^ 0.92 (m1-q^)^ 1.40 (m2-q^)^ 1.82 (m3-q^)^ 1.55 (m4-q^)^ Using knn this SVD transformed picture puts q cleary in the c class: dis=.25 nbrs {c } dis=.5 nbrs {c1,c } dis=.9 nbrs {c1,c3,c } dis=1 nbrs {c1,c3,c5, m1 } dis=1.25 nbrs {c1,c3,c4,c5, m1 } dis=1.5 nbrs {c1,c2,c3,c4,c5, m1,m2} Un-SVD transformed, it's not conclusive (i.e., using X instead of X^). D = mcmm d = D/|D| (mc+mm)/ mc+mm/2 dot d a q * d q dot d ( 2.61 << a so q is classified as c. And, we note that O'FAUST is more conclusive with X than it is with X^. ) survey trees graph minors
6
I have put together a pBase of 75 Mother Goose Rhymes or Stories
I have put together a pBase of 75 Mother Goose Rhymes or Stories. Created a pBase of the 15 documents with 30 words (Universal Document Length, UDL) using as vocabulary, all white-space separated strings. te tf tf1 tf0 VOCAB Little Miss Muffet sat on a tuffet eating a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between . your pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 182 of curds and whey. There came a big spider and sat down... Lev-0 Little Miss Muffet Lev1 (term freq/exist) Humpty Dumpty Lev1 (term freq/exist) Lev-0 df3 df2 df1 df0 df VOCAB te04 te05 te08 te09 te27 te29 te34 a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between Level-2 pTrees (document frequency) te tf tf1 tf0 05HDS Humpty Dumpty sat on a wall. Humpt yDumpty a again all always an and apple April are around ashes, away away baby baby bark! beans beat bed, Beggars begins beside between . your pos 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 . 182 Next we look at using only content words (reduces VocabSize=8 and CorpusSize=11).
7
11 docs of the 15 (11 survivors of the content word reduction).
In this slide section, the vocabulary is reduce to content words (8 of them). mdl=5, vocab={baby,cry,dad,eat,man,mother,pig,shower}, VocabLen=8 and there are 11 docs of the 15 (11 survivors of the content word reduction). First Content Word Mask, FCWM Level-1 (rolled vocab of level-0) d= d= d= d= d= d= d= d= d= d= d= doc=73 doc=71 doc=54 Level-1 (roll up position of level-0) doc=53 doc=46 doc=29 te doc=27 tf Level-2 (roll up document of level-1) doc=09 df1 1 tf doc=08 df0 1 doc=05 tf df 2 3 VOCAB baby cry dad eat man mother pig shower doc=04 Level-0 POSITION
8
Level-0 (ordered by position, document, then vocab)
term doc tf tf1 tf0 te baby 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM 09HBD 27CBC 46TTP dad 04LMM 27CBC 29LFW eat 04LMM 08JSC man 04LMM 05HDS 53NAP mother04LMM pig 04LMM 46TTP 54BOF shower04LMM 71MWA 73SSW df 2 3 df1 1 df0 1 5 reading positions for doc=04LMM (Little Miss Muffet) baby cry dad eat man mother pig shower 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 1 baby 1 cry 1 dad 1 eat 1 man 1 mother 1 pig 1 shower Level-2 (roll up doc) Level-1 (roll up pos) Level-0 (ordered by position, document, then vocab)
9
{shower, pig, man, dad, cry, baby}
Masking FCW Taking a very simple task - that of clustering vocab by document frequency. Each cluster contains the words that are of relatively equal importance - assuming the more frequently the term occurs, the less important the term. 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW 1 baby In the original data, the clusters in decreasing order of importance are: df 2 3 baby cry dad eat man mother pig shower {mother, eat} {shower, pig, man, dad, cry, baby} 1 cry eat In FCW filtered data the clusters in decreasing order of importance are: {mother, cry, baby} {shower, pig, man, eat} df 1 2 baby cry eat man mother pig shower 1 man One could argue that the latter is a better clustering. Crying, babies and mothers are strongly associated? Men, pigs, eating and needing a shower are strongly associated? The point of this is to demonstrate (suggest?) that there may be new information in the expanded view of the text corpus that we take (not just starting from the tf matrix but including the reading sequences as well. I'm sure others have considered an "abstract only" or "executive summary only" data mine, but the horizontal structuring does not yield that input readily - our pTree approach does (just by applying the "abstract" mask). In the general case, an additional weighting (other than the usual, inverse of df type weightings of term importance within the corpus) could include the (inverse of) position number of the 1st occurrence of the term (normalized). Or even the (inverse of) the weighted average of the position number (or relative position numbers, since documents are different lengths). 1 mother 1 pig 1 shower df 1 2 df1 1 df0 1 baby cry eat man mother pig shower
10
baby cry dad eat man mother pig shower te tf tf1 tf0 baby 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW cry 04LMM dad 04LMM eat 04LMM man 04LMM mother04LMM pig 04LMM shower04LMM 1 1 2 baby cry eat man mother pig 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 1 1 1 1 1 1 1 1 1 1 dad shower 04LMM 05HDS 08JSC 09HBD 27CBC 29LFW 46TTP 53NAP 54BOF 71MWA 73SSW 1 1 1 1 1 1 1 1 df 2 3 df1 1 df0 1 baby cry dad eat man mother pig shower
11
APPENDIX: Latent semantic indexing (LSI) is indexing and retrieval that uses Singular value decomposition for patterns in terms and concepts in text. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings. LSI feature: ability to extract conceptual content of a body of text by establishing associations between terms that occur in similar contexts.[1] LSI overcomes synonymy, polysemy which cause mismatches in info retrieval [3] and cause Boolean keyword queries to mess up. LSI performs autodoc categorization (assignment of docs to predefined categories based on similarity to conceptual content of the categories.[5] LSI uses example docs for conceptual basis categories - concepts are compared to the concepts contained in the example items, and a category (or categories) is assigned to the docs based on similarities between concepts they contain and the concepts contained in example docs. Mathematics of LSI (linear algebra techniques to learn the conceptual correlations in a collection of text). Construct a weighted term-document matrix, do Singular Value Decomposition on it. Use that to identify the concepts contained in the text. Term Document Matrix, A: Each (of m) terms represented by a row, each (of n) doc is rep'ed by a column, with each matrix cell, aij, initially representing number of times the associated term appears in the indicated document, tfij. This matrix is usually large and very sparse. SVD basically reduces the dimensionality of the matrix to a tractable size by finding the singular values. It involves matrix operations and may not be amenable to pTree operations (i.e. horizontal methods are highly developed and my be best. We should study it though to see if we can identify a pTree based breakthrough for creating the reduction that SVD achieves. Here is a good paper on the subject of LSI and SVD: SVD: Let X be the t by d TermFrequency (tf) matrix. It can be decomposed as T0S0D0T where T and D have ortho-normal columns and S has only the singular values on its diagonal in descending order. Remove from T0,S0,D0, row-col of all but highest k singular values, giving T,S,D. X ~= X^ ≡ TSDT (X^ is the rank=k matrix closest to X). We have reduced the dimension from rank(X) to k and we note, X^X^T = TS2TT and X^TX^ = DS2DT There are three sorts of comparisons of interest: Comparing 1. terms (how similar are terms, i and j?) (comparing rows) 2. documents (how similar are documents i and j?) (comparing documents) 3. terms and documents (how associated are term i and doc j?) (examining individual cells) Comparing terms (how similar are terms, i and j?) (comparing rows) Dot product between two rows of X^ reflects their similarity (similar occurrence pattern across the documents). X^X^T is the square t x t symmetric matrix containing all these dot products. X^X^T = TS2TT This means the ij cell in X^X^T is the dot prod of i and j rows of TS (rows TS can be considered coords of terms). Comparing documents (how similar are documents, i and j?) (comparing columns) Dot product of two columns of X^ reflects their similarity (extent to which two documents have a similar profile of terms). X^TX^ is the square d x d symmetric matrix containing all these dot products. X^TX^ = DS2DT This means the ij cell in X^TX^ is the dot prod of i and j columns of DS (considered coords of documents). Comparing a term and a document (how associated are term i and document j?) (analyzing cell i,j of X^) Since X^ = TSDT cell ij is the dot product of the ith row of TS½ and the jth column of DS½
12
Applying the algorithm to C4:
FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p p7 2 p p p8 3 p p p9 pa 5 6 7 pf pb a pc b pd pe c d e f a b c d e f M M f1=p3, C1 doesn't split (complete). M f M4 1 p2 p5 p1 3 p p p9 4 p p8 p7 pf pb pe pc pd pa 8 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p p p p p p p p p pa pb pc pd pe pf D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C C C4 M1 M0
13
FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector.
d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a std of these distances from origin along the d-line
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.