But, depending on defintions 3 count=1_thin_intervals,

Slides:



Advertisements
Similar presentations
Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Advertisements

Mr F’s Maths Notes Statistics 8. Scatter Diagrams.
Support Vector Machines and Kernel Methods
Mr Barton’s Maths Notes
If you could sustain your body by eating nothing but donuts, and if you chose to do so… how many donuts would you eat during your life?
Support Vector Machines
Second Grade English High Frequency Words
Lesson 4: Percentage of Amounts.
Masquerade Detection Mark Stamp 1Masquerade Detection.
Here’s your first homework Section Assignment: (use this format in all your assignments) Wrong = Missing = Name and assignment Problem group description.
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Powerpoint Presentations Problems. Font issues #1 Some students make the font so tiny that it cannot be read.
Kinematics in 2-D Concept Map
Week 10 Nov 3-7 Two Mini-Lectures QMM 510 Fall 2014.
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 27 th, 2001 Final Presentation.
Strategic Reading Step 2 SCAN. Review from yesterday Preview- practice with Hamlet Oedipal Complex.
How To Do NPV’s ©2007 Dr. B. C. Paul Note – The principles covered in these slides were developed by people other than the author, but are generally recognized.
1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa pf 9 pb a pc b pd pe c d e f a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6.
A pTree organization for text mining... Position are April apple and an always. all again a... Term (Vocab)
PTree Text Mining... Position are April apple and an always. all again a... Term (Vocab)
Thank you for the kind feedback. I truly do hope you have enjoyed the course and have had a good learning experience. Most people said they found the course.
6-hop myrrh example (from Damian). Market agency targeting advertising to friends of customers: Entities: 1. advertisements 2. markets 3. merchants 4.
Level-0 FAUST for Satlog(landsat) is from a small section (82 rows, 100 cols) of a Landsat image: 6435 rows, 2000 are Tst, 4435 are Trn. Each row is center.
Enclose clusters with gaps using functionals (ScalarPTreeSets or SPTSs): C p,d (x)=(x-p) o d /  (x-p) o (x-p) Conical Separating clusters by cone gaps.
High Frequency Words.
Overview Data Mining - classification and clustering
FAUST Technology for Clustering (includes Anomaly Detection) and Classification (Where are we now?) FAUST technology for classification/clustering is built.
Learn to Write Personal-Business Letters Tech & Career Apps - Errickson.
5(I,C) (I,C) (I,C) (I,C)
Learning to use a ‘For Loop’ and a ‘Variable’. Learning Objective To use a ‘For’ loop to build shapes within your program Use a variable to detect input.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Mr Barton’s Maths Notes
‘Beyond levels will only work if pupils are in mixed attainment groups
Mr Barton’s Maths Notes
ESSENTIAL WORDS.
Chance Chance Community Chest JOB SEEKER Community Chest JOB SEEKER
[ ] Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].
1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2
Find f0, f1 and d1≡(f1-f0)/f1-f0) as before:
Mr F’s Maths Notes Number 7. Percentages.
Using Note Cards for Your Research Paper
Year 7 E-Me Web design.
Lecture 15: Text Classification & Naive Bayes
Table of Contents What Are Buyer Personas? ...……………………………………………………………. Slide 3 What Are Negative Personas? ……………………………………… ….. Slide.
Opening Bell Opportunity Evaluator (OBOE): The NYSE Bell is at 9AM EST (I think). At 8:53AM EST (EST=UTC-5), create Trends Around the Globe (TAG) pTrees.
Method-1: Find a furthest point from M, f0 = MaxPt[SpS((x-M)o(x-M))].
Mean as Fair Share and Balance Point
High Frequency Words. High Frequency Words a about.
Data Mining (and machine learning)
Graphing and the Coordinate Plane
Mean as Fair Share and Balance Point
LO To assess my understanding of transformations
Probability.
[ ] Find a furthest point from M, f0 = MaxPoint[SpS((x-M)o(x-M))].
Recap So Far: Direct Realism
PDR PTreeSet Distribution Revealer
8/29/11 1. Room Map 2. Tally Skills Test
Fry Word Test First 300 words in 25 word groups
From: Perrizo, William Sent: Thursday, February 02, :45 AM To: 'Mark Silverman' The Satlog (Landsat Satellite) data set from UCI Machine Learning.
Functional Analytic Unsupervised and Supervised data mining Technology
Mr Barton’s Maths Notes
I’m working on implementing this…  here’s where I am so far. 
Using Note Cards for Your Research Paper
Using Note Cards for Your Research Paper
pTree-k-means-classification-sequential (pkmc-s)
By: Harshal Nallapareddy and Eric Wang
Data Mining CSCI 307, Spring 2019 Lecture 21
Tree Diagrams Use your keyboard’s arrow keys to move the slides
pTrees predicate Tree technologies
Presentation transcript:

But, depending on defintions 3 count=1_thin_intervals, p x y 1 6 36 2 7 39 3 8 41 4 9 34 5 9 38 6 10 42 7 12 34 8 12 38 9 13 35 10 13 40 11 19 38 12 25 38 13 22 22 14 26 16 15 26 25 16 29 11 17 31 18 18 32 26 19 34 11 20 34 23 21 35 20 22 37 10 23 37 23 24 38 13 25 38 21 26 39 24 27 40 9 28 42 9 29 38 39 30 38 42 31 39 44 32 41 41 33 41 45 34 42 39 35 42 43 36 44 43 37 45 40 So in this case there are zero gaps (count=0_thin_intervals) on the fM line, But, depending on defintions 3 count=1_thin_intervals, allowing us to declare the points p12, p16 and p18 as anomalies first round. my VOM (34, 35) (my MEAN=(28,30), using the point values at left.) Round 2, etc. are straight forward in this example. So the two questions are, 1. How to determine count=k_thin_intervals, given a projection line. 2. How to pick a productive projection line from the nearly infinite number of possibilities. (I like to always start with fM in case it reveals the one anomaly of interest right away, but then it gets very difficult, especially in high dimensions.)

Thin interval finder on the fM line using the scalar pTreeSet, PTreeSet(xofM) (the pTree slices of these projection lengths) Looking for Width24_Count=1_ThinIntervals or W16_C1_TIs 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c 0 1 2 3 4 5 6 7 8 9 a b c d e f X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 xofM 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 f= p5' 1 C=3 C=2 p5 C=8 p4' 1 C=1 p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 W=24_C=1_TI [0000000, 0001111]=[0,16) We check how close p1ofM is to bdrys, 0, 16 (5, too close), so p1 not declared anomaly. W=24_C=1_TI [010 0000 , 010 1111] =[32, 48). Dist of p4ofM to a bdry pt is 2. p4 not declared an anomaly. W=24_C=1_TI [011 0000 , 011 1111] =[48, 64). Dis of p5ofM to p4ofM or to the bdry pt 64 is 11 so p5 is an anomaly and we cut through p5. W=24_C=1_TI [100 0000 , 100 1111] =[64, 90). Ordinarily we would cut through the interval midpoint, but in this case it is unnecessary since it would duplicate the p5 cut.

1. MapReduce FAUST. Current_Relevancy_Score =9. Killer_Idea_Score=2 1. MapReduce FAUST Current_Relevancy_Score =9 Killer_Idea_Score=2 Nothing comes to minds as to what we would do here.  MapReduce.Hadoop is a key-value approach to organizing complex BigData.  In FAUST PREDICT/CLASSIFY we start with a Training TABLE and in FAUST CLUSTER/ANOMALIZER  we start with a vector space. Mark suggests (my understanding), capturing pTreeBases as Hadoop/MapReduce key-value bases? I suggested to Arjun developing XML to capture Hadoop datasets as pTreeBases. The former is probably wiser. A wish list of great things that might result would be a good start. 2.  pTree Text Mining: Current_Relevancy_Score =10  Killer_Idea_Score=9   I I think Oblique FAUST is the way to do this.  Also there is the very new idea of capturing the reading sequence, not just the term-frequency matrix (lossless capture) of a corpus. 3. FAUST CLUSTER/ANOMALASER: Current_Relevancy_Score =9               Killer_Idea_Score=9   No No one has taken up the proof that this is a break through method.  The applications are unlimited! 4.  Secure pTreeBases: Current_Relevancy_Score =9            Killer_Idea_Score=10     This seems straight forward and a certainty (to be a killer advance)!  It would involve becoming the world expert on what data security really means and how it has been done by others and then comparing our approach to theirs.  Truly a complete career is waiting for someone here! 5. FAUST PREDICTOR/CLASSIFIER: Current_Relevancy_Score =9             Killer_Idea_Score=10 No one done a complete analysis of this is a break through method.  The applications are unlimited here too! 6.  pTree Algorithmic Tools: Current_Relevancy_Score =10                 Killer_Idea_Score=10 This is Md’s work.  Expanding the algorithmic tool set to include quadratic tools and even higher degree tools is very powerful.  It helps us all! 7.  pTree Alternative Algorithm Impl: Current_Relevancy_Score =9               Killer_Idea_Score=8 This is Bryan’s work.  Implementing pTree algorithms in hardware/firmware (e.g., FPGAs) - orders of magnitude performance improvement? 8.  pTree O/S Infrastructure: Current_Relevancy_Score =10                    Killer_Idea_Score=10 This is Matt’s work.  I don’t yet know the details, but Matt, under the direction of Dr. Wettstein, is finishing up his thesis on this topic – such changes as very large page sizes, cache sizes, prefetching,…  I give it a 10/10 because I know the people – they do double digit work always! From: Arjun.Roy@my.ndsu.edu] Sent: Thurs, Aug 09 Dear Dr. Perrizo, Do you think a map reduce class of FAUST algorithms could be built into a thesis? If the ultimate aim is to process big data, modification of existing P-tree based FAUST algorithms on Hadoop framework could be something to look on? I am myself not sure how far can I go but if you approve, then I can work on it. From: Mark to:Arjun Aug 9 From industry perspective, hadoop is king (at least at this point in time). I believe vertical data organization maps really well with a map/reduce approach –   these are complimentary as hadoop is organized more for unstructured data, so these topics are not mutually exclusive. So from industry side I’d vote hadoop… from Treeminer side text (although we are very interested in both) From: msilverman@treeminer.com Sent: Friday, Aug 10 I’m working thru a list of what we need to get done – it will include implementing anomaly detection which is now on my list for some time.  I tried to establish a number of things such that even if we had some difficulties with some parts we could show others (w/o digging us too deep). Once I get this I’ll get a call going.  I have another programming resource down here who’s been working with me on our production code who will also be picking up some of the work to get this across the finish line, and a have also someone who was a director at our customer previously assisting us in packaging it all up so the customer will perceive value received… I think Dale sounded happy yesterday.

pTree Text Mining data Cube layout: tePt=again tePt=all tePt=a lev2, pred=pure1 on tfP1 -stide 1 0 . . . hdfP t=a t=again t=all lev-2 (len=VocabLen) 8 1 3 0 . . . 2 . . . df count <--dfP3 <--dfP0 t=a t=again t=all 0 . . . . . . . . . ... tfP0 1 . . . ... tfP1 lev1tfPk eg pred tfP0: mod(sum(mdl-stride),2)=1 2 doc=1 d=2 d=3 term=a t=a t=a d=1 d=2 d=3 t=again t=again t=again . . . ... tf d=1 d=2 d=3 t=all t=all t=all ... tePt=again t=a d=1 t=a d=2 t=a d=3 1 tePt=a t=again t=again t=again d=1 d=2 d=3 tePt=all t=all d=1 t=all d=2 t=all d=3 lev1 (len=DocCt*VocabLen) lev0 corpusP (len=MaxDocLen*DocCt*VocabLen) t=a d=1 t=a d=2 t=a d=3 t=again d=1 1 Math book mask Libry Congress masks (document categories move us up document semantic hierarchy  ptf: positional term frequency The frequency of each term in each position across all documents (Is this any good?). 2 d=1 Preface 1 d=1 commas d=1 References Reading position masks (pos categories) move us up position semantic hierarchy  (and allows puncutation etc., placement.) 1 te ... tf2 1 ... tf1 1 ... tf0 3 2 tf are April apple and an always. all again a Vocab Terms 1 3 2 df . . . . . . 1 0 . . . 0 . . . 0 . . . JSE HHS LMM documnet 0 . . . 0 . . . 0 . . . Corpus pTreeSet data Cube layout: 1 2 3 4 5 6 7 Position

11 docs of the 15 (11 survivors of the content word reduction). In this slide section, the vocabulary is reduce to content words (8 of them). mdl=5, vocab={baby,cry,dad,eat,man,mother,pig,shower}, VocabLen=8 and there are 11 docs of the 15 (11 survivors of the content word reduction). First Content Word Mask, FCWM Level-1 (rolled vocab of level-0) d=73 1 0 0 0 0 d=71 1 0 0 0 0 d=54 1 0 0 0 0 d=53 1 0 0 0 0 d=46 1 0 0 0 0 d=29 1 0 0 0 0 d=27 1 0 0 0 0 d=09 1 0 0 0 0 d=08 1 0 0 0 0 d=05 1 0 0 0 0 d=04 1 0 0 0 0 doc=73 0 0 0 0 0 1 0 0 0 0 doc=71 0 0 0 0 0 1 0 0 0 0 doc=54 0 0 0 0 0 1 0 0 0 0 Level-1 (roll up position of level-0) doc=53 0 0 0 0 0 1 0 0 0 0 doc=46 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 doc=29 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 te 2 2 4 5 5 7 7 4 5 8 9 7 9 6 3 4 1 3 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 doc=27 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 tf1 2 2 4 5 5 7 7 4 5 8 9 7 9 6 3 4 1 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 Level-2 (roll up document of level-1) doc=09 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 df1 1 tf0 2 2 4 5 5 7 7 4 5 8 9 7 9 6 3 4 1 3 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 doc=08 0 0 0 0 0 1 1 0 0 0 df0 1 doc=05 0 0 0 0 0 1 0 0 0 0 tf 2 2 4 5 5 7 7 4 5 8 9 7 9 6 3 4 1 3 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 2 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 df 2 3 VOCAB baby cry dad eat man mother pig shower doc=04 0 0 0 0 0 1 0 0 0 0 Level-0 POSITION 1 2 3 4 5

Level-0 (ordered by position, document, then vocab) term doc tf tf1 tf0 te baby 04LMM 0 0 0 0 05HDS 0 0 0 0 08JSC 0 0 0 0 09HBD 1 0 1 1 27CBC 1 0 1 1 29LFW 0 0 0 0 46TTP 0 0 0 0 53NAP 0 0 0 0 54BOF 0 0 0 0 71MWA 0 0 0 0 73SSW 0 0 0 0 cry 04LMM 0 0 0 0 09HBD 0 0 0 0 27CBC 2 1 0 1 46TTP 1 0 1 1 dad 04LMM 0 0 0 0 27CBC 0 0 0 0 29LFW 1 0 1 1 eat 04LMM 1 0 1 1 08JSC 2 1 0 1 man 04LMM 0 0 0 0 05HDS 1 0 1 1 53NAP 1 0 1 1 mother04LMM 0 0 0 0 pig 04LMM 0 0 0 0 46TTP 2 1 0 1 54BOF 1 0 1 1 shower04LMM 0 0 0 0 71MWA 1 0 1 1 73SSW 1 0 1 1 df 2 3 df1 1 df0 1 5 reading positions for doc=04LMM (Little Miss Muffet) baby cry dad eat man mother pig shower 04LMM 2 3 4 5 05HDS 7 8 9 10 08JSC 12 13 14 15 09HBD 17 18 19 20 27CBC 22 23 24 25 29LFW 27 28 29 30 46TTP 32 33 34 35 53NAP 37 38 39 40 54BOF 42 43 44 45 71MWA 47 48 49 50 73SSW 52 53 54 55 1 baby 1 cry 1 dad 1 eat 1 man 1 mother 1 pig 1 shower Level-2 (roll up doc) Level-1 (roll up pos) Level-0 (ordered by position, document, then vocab)

Applying the algorithm to C4: FAUST=Fast, Accurate Unsupervised and Supervised Teaching (Teaching big data to reveal information) FAUST CLUSTER-fmg (furthest-to-mean gaps for finding round clusters): C=X (e.g., X≡{p1, ..., pf}= 15 pix dataset.) While an incomplete cluster, C, remains find M ≡ Medoid(C) ( Mean or Vector_of_Medians or? ). Pick fC furthest from M from S≡SPTreeSet(D(x,M) .(e.g., HOBbit furthest f, take any from highest-order S-slice.) If ct(C)/dis2(f,M)>DT (DensThresh), C is complete, else split C where P≡PTreeSet(cofM/|fM|) gap > GT (GapThresh) End While. Notes: a. Euclidean and HOBbit furthest. b. fM/|fM| and just fM in P. c. find gaps by sorrting P or O(logn) pTree method? C2={p5} complete (singleton = outlier). C3={p6,pf}, will split (details omitted), so {p6}, {pf} complete (outliers). That leaves C1={p1,p2,p3,p4} and C4={p7,p8,p9,pa,pb,pc,pd,pe} still incomplete. C1 is dense ( density(C1)= ~4/22=.5 > DT=.3 ?) , thus C1 is complete. Applying the algorithm to C4: In both cases those probably are the best "round" clusters, so the accuracy seems high. The speed will be very high! {pa} outlier. C2 splits into {p9}, {pb,pc,pd} complete. 1 p1 p2 p7 2 p3 p5 p8 3 p4 p6 p9 4 pa 5 6 7 8 pf 9 pb a pc b pd pe c d e f 0 1 2 3 4 5 6 7 8 9 a b c d e f M0 8.3 4.2 M1 6.3 3.5 f1=p3, C1 doesn't split (complete). M f M4 1 2 p2 p5 p1 3 p4 p6 p9 4 p3 p8 p7 5 pf pb 6 pe pc 7 pd pa 8 1 2 3 4 5 6 7 8 9 a b c d e f Interlocking horseshoes with an outlier X x1 x2 p1 1 1 p2 3 1 p3 2 2 p4 3 3 p5 6 2 p6 9 3 p7 15 1 p8 14 2 p9 15 3 pa 13 4 pb 10 9 pc 11 10 pd 9 11 pe 11 11 pf 7 8 D(x,M0) 2.2 3.9 6.3 5.4 3.2 1.4 0.8 2.3 4.9 7.3 3.8 3.3 1.8 1.5 C1 C2 C3 C4 M1 M0

FAUST Oblique PR = P(X dot d)<a d-line D≡ mRmV = oblique vector. d=D/|D| Separate classR, classV using midpoints of means (mom) method: calc a View mR, mV as vectors (mR≡vector from origin to pt_mR), a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d (Very same formula works when D=mVmR, i.e., points to left) Training ≡ choosing "cut-hyper-plane" (CHP), which is always an (n-1)-dimensionl hyperplane (which cuts space in two). Classifying is one horizontal program (AND/OR) across pTrees to get a mask pTree for each entire class (bulk classification) Improve accuracy? e.g., by considering the dispersion within classes when placing the CHP. Use 1. the vector_of_median, vom, to represent each class, rather than mV, vomV ≡ ( median{v1|vV}, 2. project each class onto the d-line (e.g., the R-class below); then calculate the std (one horizontal formula per class; using Md's method); then use the std ratio to place CHP (No longer at the midpoint between mr [vomr] and mv [vomv] ) median{v2|vV}, ... ) dim 2 vomR vomV r   r vv r mR   r      v v v v       r    r      v mV v      r    v v     r         v                     v2 v1 d-line dim 1 d a std of these distances from origin along the d-line