Maximize variance - is it wise?

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.

PARTITIONAL CLUSTERING

November 12, 2013Computer Vision Lecture 12: Texture 1Signature Another popular method of representing shape is called the signature. In order to compute.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Chapter 3 Database Management

Case-based Reasoning System (CBR)

Fitting a Model to Data Reading: 15.1,

Data Mining Techniques

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.

Data Mining Knowledge on rough set theory SUSHIL KUMAR SAHU.

Chapter 13 Designing Databases Systems Analysis and Design Kendall & Kendall Sixth Edition.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

Data Mining and Decision Support

INTRODUCTION TO INFORMATION SYSTEMS LECTURE 9: DATABASE FEATURES, FUNCTIONS AND ARCHITECTURES PART (2) أ/ غدير عاشور 1.

Data Resource Management – MGMT An overview of where we are right now SQL Developer OLAP CUBE 1 Sales Cube Data Warehouse Denormalized Historical.

Lecture-6 Bscshelp.com. Todays Lecture  Which Kinds of Applications Are Targeted?  Business intelligence  Search engines.

Data Mining and Text Mining. The Standard Data Mining process.

Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.

Introduction to Machine Learning, its potential usage in network area,

Data Mining – Intro.

FAUST Classifier Pr=P(xod)<a Pv=P(xod)a

Linear Algebra Review.

Clustering CSC 600: Data Mining Class 21.

Presenter Date | Location

CZ5211 Topics in Computational Biology Lecture 3: Clustering Analysis for Microarray Data I Prof. Chen Yu Zong Tel:

Operation Data Analysis Hints and Guidelines

2/2/13 Datamining Big Data big data: up to trillions of rows (or more) and, possibly, thousands of columns (or many more). I structure data vertically.

Fast Kernel-Density-Based Classification and Clustering Using P-Trees

Where are we at? We are working on perfecting FAUST Clustering (using distance dominated functional gap analysis). Our primary choice of functional is.

Data Mining K-means Algorithm

Maximizing theVariance =

Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.

Data Warehouse.

Mean Shift Segmentation

FAUST Functional-Gap clustering (FAUST=Functional Analytic Unsupervised and Supervised machine Teaching) relies on choosing a distance dominating functional.

به نام خدا Big Data and a New Look at Communication Networks Babak Khalaj Sharif University of Technology Department of Electrical Engineering.

COSC 6340 Projects & Homeworks Spring 2002

Chapter 2 Database Environment Pearson Education © 2009.

Maximize variance - is it wise?

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Fitting Curve Models to Edges

Maximize variance - is it wise?

MANAGING DATA RESOURCES

Computer Vision Lecture 16: Texture II

PDR PTreeSet Distribution Revealer

Data Warehousing and Data Mining

Chapter 3 The Simplex Method and Sensitivity Analysis

An Introduction to Data Warehousing

C.U.SHAH COLLEGE OF ENG. & TECH.

Database Environment Transparencies

FAUST Classifier Pr=P(xod)<a Pv=P(xod)a

Supporting End-User Access

CSCI N317 Computation for Scientific Applications Unit Weka

Connecting Data with Domain Knowledge in Neural Networks -- Use Deep learning in Conventional problems Lizhong Zheng.

MIS2502: Data Analytics Clustering and Segmentation

MIS2502: Data Analytics Clustering and Segmentation

Feature space tansformation methods

FAUST Classifier PR=Pxod<a PV=Pxoda d2-line d-line

Machine Learning in Practice Lecture 17

Chapter 17 Designing Databases

Data Warehousing Concepts

Chapter 7: Transformations

Chapter 3 Database Management

Data Pre-processing Lecture Notes for Chapter 2

Analytics, BI & Data Integration

Introduction to Artificial Intelligence Lecture 22: Computer Vision II

Presentation transcript:

Maximize variance - is it wise? FAUST CLUSTERING using distance dominated functional gap analysis. = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk 2 + jkajkdjdk V(d)=jajjdj2 subject to i=1..ndi2=1 dT o VX o d = VarDPPdX≡V V i XiXj-XiX,j : d1 ... dn V(d)=ijaijdidj = d1 dn ijaijdidj V(d) = x1 x2 xN x1od x2od xNod Xod=Fd(X)=DPPd(X) - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) Workhorse functional is DPPd(x). Which unit vector, d*, provides best gap(s)? 1. Exhaustive Search for d*: thru a grid of d's from unit sphere? (expensive?) 2. Use some heuristic search for d*? - Avoiding exhaustive search? : 2.1. GRAVY: GRAdient-based Variance optimization, Yet another 2.2. XCONDI? maximizes the maximum Consecutive Difference [gap]? 2.3. MEDEV: Use d that maxs |MedianF(X)-ExpectedvalueF(X)|. EVis easy to derive as a function of d, but the Median is not! (Can you do it?) MEDEV1: Heuristicate MedianF(X) with F(VectorOfMedians)=VOMod? MEDEV2: Heuirsticate entirely with D=VOM(X)MEAN(X), d=D/|D| V(d)= 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk) 2a11 2a12 ... 2a1n 2a21 2a22 ... 2a2n ' 2an1 ... 2ann d1 di dn GRADIENT(V) = 2A o d Maximize variance - is it wise? 0 0 0 0 0 0 0 0 1 0 5 0 0 0 0 0 2 0 5 2 0 0 0 0 3 0 5 2 3 0 0 0 4 0 5 4 3 6 0 0 median 5 0 5 4 3 6 9 0 6 0 5 6 6 6 9 10 7 0 5 6 6 6 9 10 8 0 5 8 6 9 9 10 9 0 5 8 9 9 9 10 10 10 10 10 10 10 10 10 std 3.16 2.87 2.13 3.20 3.35 3.82 4.57 4.98 variance 10.0 8.3 4.5 10.2 11.2 14.6 20.9 24.8 mean 5.00 0.91 5.00 4.55 4.18 4.73 5.00 4.55 consecutive 1 0 5 0 0 0 0 0 differences 1 0 0 2 0 0 0 0 1 0 0 0 3 0 0 0 1 0 0 2 0 6 0 0 1 0 0 0 0 0 9 0 1 0 0 2 3 0 0 10 1 0 0 0 0 0 0 0 1 0 0 2 0 3 0 0 1 10 5 2 1 1 1 0 avgCD 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 maxCD 1.00 10.00 5.00 2.00 3.00 6.00 9.00 10.00 |EV-median| 0.00 0.91 0.00 0.55 1.18 1.27 4.00 4.55 Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps = j=1..n Xjdj EV(DPPdX) =(1/N)i=1..Nj=1..n xi,jdj sub to i di2=1 Maximize wrt d, |Avg(DPPd(X)) - Median(DPPd(X)| = j (1/Ni xi,j ) dj Compute Median(DPPd(X)? Want to use only pTree processing. Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) Heuristic: Use Vector of Medians for Median(DPPd(X) ExpVal-MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good.

CONCRETE GRAVY MEDEV2  106405 0207 613481 3653akk .17 .05 .98 .01 9048 .. -.38 -.14 .79 -.46 11781 F-mn/8 gp4 on C2 0 1 2 2 1 1 3 1 2 5 2 3 8 1 2 10 1 1 11 1 5 16 1 2 18 1 5 23 1 1 24 1 1 25 2 1 26 2 1 27 1 2 29 4 1 30 2 1 31 2 1 32 1 1 33 3 2 35 4 1 36 4 1 37 4 1 38 6 1 39 2 1 40 7 1 41 2 1 42 3 1 43 2 1 44 1 1 45 2 1 46 9 1 47 3 1 48 8 1 49 5 1 50 6 1 51 4 1 52 5 1 53 7 1 54 2 1 55 6 1 56 4 1 57 2 1 58 1 1 59 1 1 60 1 1 61 1 1 62 2 1 63 1 1 64 1 1 65 1 F=C231 gp4 (F-MN)/8 0 1 7 7 1 1 8 2 1 9 2 1 10 2 1 11 1 1 12 1 2 14 6 1 15 7 4 19 1 1 20 3 1 21 3 1 22 2 1 23 1 2 25 1 2 27 1 2 29 1 1 30 1 1 31 1 2 33 1 6 39 1 3 42 1 4 46 1 10 56 2 ___ ___ [0,19) 17L 6M 0H F=C232 gp2 (F-MN)/8 0 1 1 1 1 1 2 2 1 3 1 2 5 2 1 6 1 1 7 2 1 8 2 1 9 1 7 16 1 1 17 3 1 18 2 2 20 7 1 21 8 1 22 7 1 23 1 2 25 2 1 26 3 1 27 2 1 28 3 1 29 1 1 30 1 1 31 2 2 33 1 1 34 4 1 35 3 3 38 3 1 39 8 11 50 2 1 51 1 127806 28340 616688 582 .20 .04 .98 .01 8353 .. .23 -.18 .95 -.13 8925 104320 30955 605471 4683akk d1 d2 d3 d4 V(d) .06 -.19 .93 -.30 11888 ... -.38 -.15 .78 -.48 12966 F=X Ct gp4 (F-MN)/8 0 1 2 2 1 2 4 1 2 6 1 1 7 1 1 8 1 2 10 1 1 11 1 1 12 1 1 13 1 3 16 2 3 19 1 2 21 1 5 26 1 1 27 1 1 28 2 1 29 2 1 30 1 2 32 5 1 33 2 1 34 2 1 35 1 1 36 3 1 37 3 1 38 3 1 39 5 1 40 3 1 41 7 1 42 6 1 43 3 1 44 5 1 45 1 1 46 3 1 47 3 1 48 4 1 49 7 1 50 4 1 51 6 1 52 10 1 53 3 1 54 4 1 55 8 1 56 5 1 57 3 1 58 7 1 59 2 1 60 2 1 61 1 1 62 2 2 64 1 1 65 2 1 66 1 1 67 2 70068 31353 639309 1327 .11 .05 .99 .00 6467 ... .84 -.14 .45 -.27 10281 ___ ___ [0,5) 2L 2M 1H C2321 ___ ___ [5,16) 1L 4M 3H C2322 ___ ___ [0,16) 0L 8M 0H C21 ___ ___ . [19,39) 13L 2M 0H ___ ___ [16,20) 1L 1M 4H C2323 ___ ___ [16,23) 0L 2M 0H C22 C23 ___ ___ [20,25) 3L 2M 18H C2324 ___ ___ . [39,56) 2L 2M 0H ___ ___ [25,33) 4L 2M 8H C2325 ___ ___ [0,26) 0L 14M 0H C1 C2 ___ ___ . [56,57) 0L 2M 0H ___ ___ [33,38) 0L 0M 8H C2326 ___ ___ [38,50) 0L 2M 9H C2327 0 1 1 1 1 4 5 1 1 6 1 1 7 1 1 8 4 1 9 1 1 10 1 1 11 2 1 12 1 1 13 5 1 14 1 1 15 3 1 16 3 1 17 4 1 18 1 1 19 3 1 20 9 1 21 4 1 22 3 1 23 7 1 24 2 1 25 4 1 26 8 1 27 7 1 28 7 1 29 10 1 30 3 1 31 1 1 32 3 1 33 5 1 34 4 1 35 5 2 37 2 1 38 2 2 40 1 2 42 4 1 43 1 1 44 1 1 45 1 1 46 4 3 49 1 7 56 1 2 58 1 3 61 1 4 65 1 1 66 1 3 69 1 2 71 1 6 77 1 3 80 1 3 83 1 3 86 1 14 100 1 3 103 1 2 105 1 3 108 2 4 112 1 ___ ___ [0,5) 2M outliers ___ ___ [50,52) 1L 0M 2H outliers MEDEV2  (F-)/4 gp4 0 1 1 1 1 7 8 1 4 12 1 4 16 1 2 18 1 2 20 2 1 21 2 2 23 5 1 24 1 1 25 2 1 26 1 1 27 1 1 28 4 1 29 1 1 30 2 1 31 1 1 32 2 1 33 2 1 34 5 1 35 1 1 36 3 1 37 7 2 39 1 1 40 2 1 41 1 1 42 6 2 44 2 2 46 2 1 47 1 1 48 5 2 50 2 1 51 2 2 53 5 1 54 1 1 55 2 1 56 1 1 57 1 1 58 2 1 59 4 1 60 3 1 61 5 1 62 1 1 63 7 1 64 2 1 65 3 1 66 3 2 68 2 1 69 1 2 71 2 2 73 1 1 74 1 2 76 2 2 78 2 4 82 2 2 84 1 6 90 2 8 98 1 9 107 1 16 123 1 ___ ___ [0 , 8) 0L 2M 0H outliers [12,16) 0L 1M 0H outliers [8,12) 0L 0M 1H outliers C23 gp3 (F-MN)/8 0 2 2 2 1 1 3 3 1 4 3 1 5 1 1 6 1 1 7 6 1 8 1 1 9 8 1 10 2 1 11 6 1 12 2 1 13 5 1 14 2 1 15 2 3 18 1 1 19 7 1 20 1 1 21 3 1 22 1 1 23 2 1 24 4 1 25 1 2 27 8 1 28 9 1 29 4 2 31 2 1 32 1 1 33 3 1 34 3 2 36 7 1 37 12 2 39 1 1 40 1 1 41 1 1 42 6 6 48 1 2 50 2 .17 .05 .98 .00 8114... .99 -.10 .11 .07 10286 C11 F-)/4 Ct gp4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1 9 7 1 10 4 1 11 9 2 13 3 1 14 6 1 15 4 1 16 1 3 19 5 4 23 2 3 26 5 1 27 4 1 28 9 1 29 5 2 31 6 1 32 5 3 35 6 5 40 2 C111 F-)/4 Ct gp4 0 1 16 16 3 1 17 2 1 18 9 1 19 3 2 21 5 6 27 3 1 28 5 1 29 14 1 30 1 8 38 2 2 40 15 1 41 3 4 45 3 2 47 2 19 66 3 21 87 1 C1 F-)/4 gp4 ___ ___ [0,16) 1L outlr ___ ___ [16,27 1L 21M C111 [ 0,18) 0L 32M 13H C231 [18,48) 11L 13M 54H C232 ___ __ [27,38 1L 2M 20H C112 ___ ___ [0,23) 3L23M 49H C111 ___ __[38,45) 20H C113 ___ __[45,66) 5H C114 ___ ___ =66 5H C115 ___ ___ =87 1H outlier ___ ___ [23,40) 38L0M 4H C112 ___ ___ [5 ,56) 43L 33M 55H C1 ___ ___ [40,41) 2L 1M outliers ___ ___ [56,77) 7M C2 ___ ___ [16,82) 43L 23M 53H C11 ___ ___ [77,99) 4M C3 ___ ___ [82,90) 0L 2M 1H outliers ___ ___ [90,98) 0L 2M 0H outliesr ___ ___ . =48 1H outlier =50 2M outliers MEDEV2: accuracy = 123/132 =93.1% GRAVY: accuracy=108/137 =78.8% ___ ___ [98107) 0L 1M 0H outlier ___ ___ [99123) 6M C4 ___ ___ [107,123) 0L 1M 0H outlier [123,124) 1H ol

GRAVY:16 errs accuracy 124/150=82.7% IRIS GRAVY C2 3762 808 2260 266ak d1 d2 d3 d4 V(d .84 .18 .51 .06 64 .57 .22 .71 .34 82 .51 .22 .74 .38 83 (F-MN)*3 Ct gp3 0 1 2 2 1 1 3 1 2 5 1 15 20 2 3 23 1 3 26 2 2 28 1 1 29 1 2 31 1 2 33 2 2 35 2 2 37 1 1 38 3 1 39 1 1 40 1 1 41 1 1 42 1 4 46 1 1 47 2 2 49 1 1 50 1 1 51 1 2 53 1 1 54 2 2 56 1 2 58 1 1 59 2 2 61 2 1 62 2 1 63 3 1 64 1 1 65 2 2 67 3 1 68 2 1 69 1 1 70 2 1 71 2 1 72 2 2 74 1 1 75 1 2 77 1 1 78 1 1 79 1 2 81 1 2 83 1 1 84 1 3 87 2 1 88 1 1 89 1 1 90 2 1 91 1 1 92 2 3 95 1 1 96 2 1 97 1 2 99 1 2 101 2 2 103 1 3 106 3 3 109 1 1 110 1 1 111 1 C23 (F-MN)*3 gp3 3847 818 2284 257 d1 d2 d3 d4 Vd.. .96 .22 .06 -.14 15 0 1 6 6 1 2 8 1 4 12 1 3 15 1 1 16 1 2 18 2 8 26 1 2 28 1 1 29 1 1 30 1 2 32 1 1 33 1 3 36 1 3 39 2 1 40 1 1 41 2 1 42 2 2 44 2 2 46 1 1 47 2 5 52 1 1 53 1 3 56 1 1 57 1 3 60 1 1 61 1 1 62 1 2 64 1 6 70 2 5 75 1 2 77 2 3 80 1 9 89 1 8 97 1 MEDEV2 C2 2(F-)gp4 0 1 4 4 1 1 5 1 4 9 1 3 12 1 2 14 1 1 15 1 2 17 1 2 19 2 2 21 1 1 22 2 1 23 3 1 24 3 1 25 1 1 26 2 2 28 1 1 29 1 1 30 2 1 31 2 1 32 2 1 33 1 1 34 3 1 35 1 1 36 3 1 37 1 1 38 3 1 39 1 1 40 1 1 41 1 1 42 3 1 43 3 1 44 1 1 45 2 1 46 5 1 47 2 1 48 1 1 49 1 1 50 1 2 52 1 1 53 2 1 54 2 1 55 2 1 56 1 1 57 1 1 58 2 1 59 1 1 60 3 2 62 2 2 64 4 1 65 1 3 68 3 1 69 1 4 73 1 1 74 1 2 76 2 4 80 1 4 84 1 2 86 2 5 91 1 ___ ___ [0,4) 1e C22 4(F-) gp4 0 1 6 6 1 4 10 1 2 12 1 4 16 1 3 19 1 2 21 1 2 23 1 2 25 1 1 26 1 1 27 1 1 28 1 1 29 2 1 30 1 1 31 1 2 33 2 1 34 1 4 38 1 1 39 1 3 42 1 1 43 1 1 44 1 1 45 1 1 46 2 3 49 1 1 50 2 1 51 1 1 52 1 1 53 2 1 54 1 2 56 1 2 58 2 1 59 1 2 61 1 2 63 1 1 64 1 2 66 3 2 68 1 1 69 2 1 70 1 1 71 2 2 73 1 1 74 2 1 75 2 1 76 2 1 77 1 2 79 1 2 81 1 5 86 1 2 88 2 2 90 1 1 91 1 1 92 2 2 94 1 1 95 1 2 97 1 1 98 1 3 101 2 1 102 2 4 106 1 1 107 1 2 109 1 1 110 2 1 111 2 6 117 1 1 118 1 1 119 1 1 120 1 F-MN Ct gp8 0 2 3 3 5 1 4 5 1 5 14 1 6 11 1 7 6 1 8 1 1 9 5 1 10 1 5 15 1 8 23 1 2 25 2 2 27 1 2 29 1 1 30 2 1 31 1 1 32 2 1 33 2 1 34 5 1 35 3 1 36 2 1 37 2 1 38 4 1 39 3 1 40 6 1 41 2 1 42 4 1 43 2 1 44 6 1 45 4 1 46 7 1 47 1 1 48 1 1 49 3 1 50 4 1 51 3 1 52 3 1 53 2 1 54 3 1 55 4 1 56 1 1 57 3 2 59 1 1 60 2 1 61 1 2 63 1 2 65 1 1 66 2 2 68 1 3414 933 1398 144 ak d1 d2 d3 d4 V(d .90 .24 .37 .04 180 .41 -.04 .84 .35 418 .36 -.08 .86 .36 420 F-MN Ct gp3 0 2 2 2 2 1 3 2 1 4 5 1 5 7 1 6 16 1 7 6 1 8 4 1 9 4 1 10 2 8 18 1 5 23 1 2 25 2 2 27 1 2 29 1 1 30 1 1 31 1 1 32 2 1 33 1 1 34 3 1 35 5 1 36 4 1 37 3 1 38 1 1 39 4 1 40 3 1 41 3 1 42 4 1 43 4 1 44 2 1 45 5 1 46 7 1 47 3 1 48 2 1 49 1 1 50 3 1 51 4 1 52 3 1 53 2 1 54 3 1 55 3 1 56 3 1 57 1 1 58 4 3 61 2 1 62 1 1 63 1 2 65 1 1 66 1 1 67 2 3 70 1 ___ ___ [4,9) 2e ___ ___ [0,6) 1e 0i ___ __[6,26) 2e 5i ___ ___ [0,20) 4e 1i C21 ___ ___ [0,23) 50s 1i C1 C2 ___ ___ [0,38) 18e ___ ___ [0,23) 50s 1i C1 ___ ___ [20,46) 19e 1i C22 ___ ___ [26,70) 16e 11i ___ ___ [7089) 6e ___ ___ [89, 98) 2e C221 8(F-) gp5 0 1 7 7 1 4 11 1 5 16 1 1 17 1 3 20 1 1 21 1 2 23 1 1 24 1 5 29 1 3 32 2 2 34 1 1 35 1 4 39 3 5 44 1 3 47 2 3 50 1 3 53 1 4 57 1 3 60 1 3 63 1 1 64 2 5 69 2 1 70 1 3 73 1 1 74 1 1 75 1 4 79 1 1 80 2 2 82 2 1 83 1 1 84 1 2 86 1 4 90 1 5 95 1 C221 29e 14i ___ ___ [38,86) ___ ___ [0,7) 1e ___ ___ [7,16) 2e ___ ___ [9,73) 47e 40i C22 ___ ___ 16,24 4e 1i ___ ___ [73,80) 0e 4i ___ ___ [80,84) 0e 1i ___ ___ [84,91) 0e 3i ___ ___ [89,92) 0e 1i ___ ___ [86,102 13i ___ ___ [23,61) 50e 40i C2 [61,71) 9i C3 ___ ___ [24,44) 9e ___ ___ [46,87) 27e 16i C23 ___ ___[44,69) 9e 2i ___ ___ [87,95) 9virg C24 ___ _[102,121) 13i GRAVY:16 errs accuracy 124/150=82.7% MEDEV2: 9 errs accuracy 141/150=94 % ___ ___ [95,106) 0e 8i C25 ___ ___ [106,109) 3i C26 ___ ___ [109,112) 3i C27 ___ ___ [69,95) 5e 10i ___ ___ [95,96) 1i

SEEDS GRAVY MEDEV2 GRAVY Accuracy 141/150 = 94 % 169 26 16 25 akk d1 d2 d3 d4 V(d .97 .15 .09 .14 0 .00 .07 1.00 .00 4 10(F-MN) Ct gp9 0 2 10 10 3 10 20 3 10 30 4 1 31 1 9 40 1 10 50 1 11 61 1 9 70 2 MEDEV2 10(F-MN)gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 219 31 14 29 akk d1 d2 d3 d4 V(d .98 .14 .06 .13 9 10(F-MN) Ct gp6 0 2 1 1 10 1 2 5 1 3 1 6 9 3 1 10 10 1 11 10 1 12 2 6 18 2 1 19 3 1 20 7 1 21 2 1 22 1 1 23 3 6 29 6 1 30 4 1 31 7 1 32 1 6 38 1 1 39 2 1 40 6 1 41 5 1 42 1 7 49 3 1 50 1 2 52 7 1 53 2 7 60 1 2 62 4 1 63 3 8 71 5 1 72 2 2 74 1 6 80 5 1 81 8 1 82 5 1 83 3 9 92 2 10 102 1 1 103 2 1 104 1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [0,10) 2k 0r 0c ___ ___ [10,20) 2k 0r 1c ___ ___ [20,30) 2k 0r 1c ___ ___ [9,18) 1k 0r 24c C2 ___ ___ [30,40) 4k 0r 1c ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [40,50) 0k 0r 1c C3 200(F-MN)gp12 0 2 12 12 3 12 24 3 12 36 5 12 48 1 12 60 1 12 72 1 40 112 2 ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [70,71) 0k 0r 2c ___ ___ [9,18) 1k 0r 24c C2 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [0,35) 8k 0r 0c ___ ___ [35,48) 2k 0r 3c C6 10(F-MN) gp12 0 3 10 10 1 12 22 3 10 32 3 9 41 2 7 48 1 256 36 10 32 akk d1 d2 d3 d4 V .98 .14 .04 .12 0 .00 -.00 .96 .29 3 ___ ___ [48,72) 0k 0r 2c ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [72,113) 0k 0r 3c ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [0,22) 4k 0r 0c ___ ___ [38,49) 13k 2r 0c C5 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [22,49) 3k 6r 0c C6 200(F-MN)gp12 0 3 12 12 1 38 50 3 10 60 1 2 62 3 12 74 2 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [0,50) 4k 0r 0c ___ ___ [50,60) 1k 0r 2c ___ ___ [38,49) 13k 2r 0c C5 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [60,74) 1k 0r 3c ___ ___ [74,75) 1k 0r 1c ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [102,105) 0k 4r 0c Cb ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [102,105) 0k 4r 0c Cb GRAVY Accuracy 141/150 = 94 % MEDEV2 Accuracy 140/150 = 93.3%

MEDEV2 WINE GRAVY ACCURACY CONCRETE IRIS SEEDS WINE F-MN Ct gp3 0 1 1 1 2 1 2 2 3 5 1 1 6 1 1 7 4 1 8 2 2 10 2 1 11 1 2 13 2 1 14 1 1 15 1 1 16 5 2 18 1 2 20 2 3 23 1 1 24 1 1 25 1 1 26 2 2 28 1 1 29 1 1 30 5 1 31 2 1 32 1 1 33 4 1 34 5 1 35 4 1 36 4 1 37 2 1 38 3 1 39 3 1 40 2 1 41 4 1 42 3 1 43 5 1 44 3 1 45 4 1 46 5 1 47 4 1 48 3 1 49 11 1 50 5 1 51 3 1 52 5 1 53 4 1 54 4 1 55 1 C11 F-MN Ct gp2 0 1 1 1 1 1 2 3 1 3 3 2 5 2 1 6 1 2 8 2 2 10 2 1 11 1 1 12 4 1 13 1 2 15 2 F-MN Ct gp8 0 1 12 12 1 3 15 2 13 28 1 2 30 1 2 32 2 2 34 1 1 35 2 3 38 1 8 46 1 1 47 3 10 57 1 1 58 1 1 59 1 1 60 1 2 62 1 2 64 1 1 65 1 1 66 1 1 67 4 1 68 2 1 69 1 1 70 1 2 72 3 1 73 1 1 74 3 1 75 2 1 76 1 1 77 1 2 79 1 3 82 1 1 83 1 1 84 2 1 85 1 1 86 1 2 88 2 1 89 4 1 90 2 1 91 1 1 92 6 1 93 3 1 94 5 1 95 4 2 97 5 1 98 2 1 99 1 1 100 4 1 101 7 1 102 4 1 103 2 1 104 3 1 105 6 1 106 3 1 107 8 1 108 10 1 109 2 1 110 4 1 111 5 1 112 2 1 113 4 1 114 1 .07 .15 .98 .12 588 -.01 .26 .97 .00 608 (F-MN) gp8 0 1 1 1 4 1 2 4 1 3 5 1 4 4 1 5 6 1 6 8 1 7 6 1 8 4 1 9 5 1 10 2 1 11 3 1 12 7 1 13 4 1 14 3 1 15 2 1 16 2 1 17 3 1 18 4 1 19 3 1 20 4 1 21 1 1 22 7 1 23 2 1 24 4 1 25 1 1 26 1 1 27 1 1 28 1 1 29 1 1 30 1 1 31 1 1 32 1 3 35 1 2 37 3 1 38 1 1 39 1 1 40 3 1 41 3 3 44 2 1 45 2 1 46 4 1 47 2 2 49 1 2 51 1 1 52 1 3 55 1 1 56 1 1 57 1 9 66 2 1 67 2 8 75 1 4 79 2 1 80 1 2 82 2 1 83 1 2 85 1 13 98 1 2 100 1 3 103 1 11 114 1 .11 .19 .96 .19 209 -.02 .41 .91 0 232 C1(F-MN) gp3 0 1 1 1 6 1 2 5 1 3 2 1 4 4 1 5 8 1 6 8 1 7 4 1 8 3 1 9 7 1 10 1 1 11 4 1 12 6 1 13 4 1 14 2 1 15 3 1 16 3 1 17 2 1 18 2 1 19 3 1 20 4 1 21 6 1 22 4 1 23 1 1 24 2 1 25 4 1 26 1 1 27 1 2 29 2 1 30 2 2 32 1 3 35 1 1 36 1 1 37 1 1 38 1 1 39 4 1 40 2 2 42 2 2 44 1 1 45 2 2 47 4 1 48 2 1 49 1 1 50 1 3 53 1 1 54 2 1 55 2 ___ _ [0.12) 1L 0H ___ _ [0.5) 3L 2H ___ _ [12,28) 1L 2H ___ _ [0.5) 4L 2H ___ _ [5,8) 2L 1H ___ _ [8,10) 0L 2H ___ _ [28,46) 2L 6H ___ _[10,15) 3L 5H ___ _ [46,57) 2L 2H ___ _[15,16) 1L 1H ___ _ [5,23) 10L 13H C11 C12 C121 F-MN max thinning 0 1 1 1 6 1 2 5 1 3 3 1 4 3 1 5 8 1 6 8 1 7 4 1 8 7 1 9 3 1 10 1 1 11 5 1 12 6 1 13 3 1 14 2 1 15 3 1 16 3 1 17 4 ___ _ [0.10) 23L 25H [10,18) 6L 21H ___ _ [0.35) 38L 68H C11 C12 F-MN Ct gp2 0 1 1 1 8 1 2 3 1 3 2 1 4 4 1 5 11 1 6 8 1 7 2 1 8 6 1 9 4 1 10 3 1 11 4 1 12 4 1 13 5 1 14 3 1 15 3 1 16 4 2 18 2 1 19 5 1 20 6 1 21 4 1 22 1 1 23 2 1 24 3 1 25 3 3 28 2 1 29 2 2 31 1 ___ _ [35,53) 10L 13H C12 ___ _ [53,56) 3L 2H ___ _ [0.66) 51L 83H C1 ___ _ [0.18) 29L 46H C121 ___ _ [66,75) 2L 2H ___ _ [75,98) 2L 6H ___ _[18,28) 7L 19H ___ _ [57,115) 51L 83H C1 ___ _[28,31) 2L 2H ___ _[31,32) 0L 1H ___ _ [98,115) 2L 2H ACCURACY CONCRETE IRIS SEEDS WINE GRAVY 62.7 82.7 94 99.3 MEDEV2 65.3 94 93.3 99.3

APPENDIX Functional Gap Clustering using Fpq(x)=RND[(x-p)o(q-p)/|q-p|-minF] on Spaeth image (p=avg The 15 Value_Arrays (one for each q=z1,z2,z3,...) z1 0 1 2 5 6 10 11 12 14 z2 0 1 2 5 6 10 11 12 14 z3 0 1 2 5 6 10 11 12 14 z4 0 1 3 6 10 11 12 14 z5 0 1 2 3 5 6 10 11 12 14 z6 0 1 2 3 7 8 9 10 z7 0 1 2 3 4 6 9 11 12 z8 0 1 2 3 4 6 9 11 12 z9 0 1 2 3 4 6 7 10 12 13 za 0 1 2 3 4 5 7 11 12 13 zb 0 1 2 3 4 6 8 10 11 12 zc 0 1 2 3 5 6 7 8 9 11 12 13 zd 0 1 2 3 7 8 9 10 ze 0 1 2 3 5 7 9 11 12 13 zf 0 1 3 5 6 7 8 9 10 11 X x1 x2 1 2 3 4 5 6 7 8 9 a b 1 1 1 1=q 3 1 2 3 2 2 3 2 4 3 3 4 5 2 5 5 9 3 6 15 1 7 f 14 2 8 15 3 9 6 p d 13 4 a b 10 9 b c e 1110 c 9 11 d a 1111 e 8 7 8 f 7 9 The 15 Count_Arrays z1 2 2 4 1 1 1 1 2 1 z2 2 2 4 1 1 1 1 2 1 z3 1 5 2 1 1 1 1 2 1 z4 2 4 2 2 1 1 2 1 z5 2 2 3 1 1 1 1 1 2 1 z6 2 1 1 1 1 3 3 3 z7 1 4 1 3 1 1 1 2 1 z8 1 2 3 1 3 1 1 2 1 z9 2 1 1 2 1 3 1 1 2 1 za 2 1 1 1 1 1 4 1 1 2 zb 1 2 1 1 3 2 1 1 1 2 zc 1 1 1 2 2 1 1 1 1 1 1 2 zd 3 3 3 1 1 1 1 2 ze 1 1 2 1 3 2 1 1 2 1 zf 1 2 1 1 2 1 2 2 2 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf gap: [F=6, F=10] gap: [F=2, F=5] F=2 F=1 Fp=MN,q=z1=0 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1 The FAUST algorithm: 1. project onto each pq line using the dot product with the unit vector from p to q. 2. Generate ValueArrays (also generate the CountArray and the mask pTrees). 3. Analyze all gaps and create sub-cluster pTree Masks.

Gap Revealer Width  24 so compute all pTree combinations down to p4 and p'4 d=M-p 1 z1 z2 z7 2 z3 z5 z8 3 z4 z6 z9 4 za 5 M 6 7 8 zf 9 zb a zc b zd ze c 0 1 2 3 4 5 6 7 8 9 a b c d e f Z z1 1 1 z2 3 1 z3 2 2 z4 3 3 z5 6 2 z6 9 3 z7 15 1 z8 14 2 z9 15 3 za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 F=zod 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 p= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 [011 0000, 011 1111] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [000 0000, 000 1111]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually  24 (the calculation of the min is a pTree process - no x looping required!) [001 0000, 001 1111] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [010 0000 , 010 1111] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [111 0000 , 111 1111]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [100 0000 , 100 1111]= [64, 80). This is clearly a 24 gap. [101 0000 , 101 1111]= [80, 96). z6od=80, zfod=83 [110 0000 , 110 1111]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z8 1.4 z7 z9 2.0 z7 z10 3.6 z7 z11 9.4 z7 z12 9.8 z7 z13 11.7 z7 z14 10.8 z8 z9 1.4 z8 z10 2.2 z8 z11 8.1 z8 z12 8.5 z8 z13 10.3 z8 z14 9.5 X1 X2 dX1X2 z9 z10 2.2 z9 z11 7.8 z9 z12 8.1 z9 z13 10.0 z9 z14 8.9 z10 z11 5.8 z10 z12 6.3 z10 z13 8.1 z10 z14 7.3 X1 X2 dX1X2 z11 z12 1.4 z11 z13 2.2 z11 z14 2.2 z12 z13 2.2 z12 z14 1.0 z13 z14 2.0

Barrel Clustering: (This method attempts to build barrel-shaped gaps around clusters) q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all barrel gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in barrel shaped gaps. a. SquareBarrelRadius functional, SBR(y) = (y-p)o(y-p) - ((y-p)od)2 b. BarrelLength functional, BL(y) = (y-p)od y barrel cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same BL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p barrel radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof + fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

Cone Clustering: (finding cone-shaped clusters) x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 52 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 14 1 i25 16 1 i40 18 2 i16 i42 19 2 i17 i38 20 2 i11 i48 22 2 23 1 24 4 i34 i50 25 3 i24 i28 26 3 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 49 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 23 6 i21 24 5 25 1 27 1 28 1 29 2 30 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some  angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 26 1 e21 e34 27 2 29 2 37 1 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 10 12 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 26 1 e21 e34 27 2 28 1 29 2 31 1 e35 37 1 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 11 2 i28 12 4 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 34 1 i39 43/50 e so picks out e

FAUST Classifier Pr=P(xod)<a Pv=P(xod)a Separate classr, classv using midpoints of means: Pr=P(xod)<a Pv=P(xod)a Set a = (mR+(mV-mR)/2)od = (mR+mV)/2 o d ? where D≡ mRmV d=D/|D| Training amounts to choosing the Cut hyperplane = (n-1)-dimensionl hyperplane (and thus cuts the space in two). Classify with one horizontal program (AND/OR) across the pTrees to get a mask pTree for each class (bulk classification). Improve accuracy? e.g., by considering the dispersion within classes. Use 1. vector_of_medians: (vomv≡ (median(v1), median(v2),...)) instead of means; then use stdev ratio to place the cut. 2. Cut at Midpt of Max{rod}, Min{vod}. If there is no gap, move Cut until r_errors + v_errors is minimized. 3. Hill-climb d to maximize gap (or minimize errors when applied to the training set). 4. Replace mr, mv with the avg of the margin points? 5. Round classes expected? use SDmr < |D|/2 for r-class and SDmv <|D|/2 for v-class. dim 2 vomR vomV r r vv r mR r v v v v r r v mV v r v v r v v2 v1 d-line dim 1 d a

2/2/13 Datamining Big Data big data: up to trillions of rows (or more) and, possibly, thousands of columns (or many more). I structure data vertically (pTrees) and process it horizontally. Looping across thousands of columns can be orders of magnitude faster than looping down trillions of rows. So sometimes that means a task can be done in human time only if the data is vertically organized. Data mining is [largely] CLASSIFICATION or PREDICTION (assigning a class label to a row based on a training set of classified rows). What about clustering and ARM? They are important and related! Roughly clustering creates/improves training sets and ARM is used to data mine more complex data (e.g., relationship matrixes, etc.). CLASSIFICATION is [largely] case-based reasoning. To make a decision we typically search our memory for similar situations (near neighbor cases) and base our decision on the decisions we made in those cases (we do what worked before for us or others). We let near neighbors vote. "The Magical Number Seven, Plus or Minus Two... Information"[2] cited to argue that the number of objects (contexts) an average human can hold in working memory is 7 ± 2. We can think of classification as providing a better 7 (so it's decision support, not decision making). One can say that all Classification methods (even model based ones) are a form of Near Neighbor Classification. E.g. in Decision Tree Induction (DTI) the classes at the bottom of a decision branch ARE the Near Neighbor set due to the fact that the sample arrived at that leaf. Rows of an entity table (e.g., Iris(SL,SW,PL,PW) or Image(R,G,B) describe instances of the entity (Irises or Image pixels). Columns are descriptive information on the row instances (e.g., Sepal Length, Sepal Width, Pedal Length, Pedal Width or Red, Green, Blue photon counts). If the table consists entirely of real numbers, then the row set can be viewed [as s subset of] a real vector space with dimension = # of columns. Then, the notion of "near" [in classification and clustering] can be defined using a dissimilarity (~distance) or a similarity. Two rows are near if the distance between them is low or their similarity is high. Near for columns can be defined using a correlation (e.g., Pearson's, Spearman's...) If the columns also describe instances of an entity then the table is really a matrix or relationship between instances of the row entity and the column entity. Each matrix cell measures some attribute of that relationship pair (The simplest: 1 if that row is related to that column, else 0. The most complex: an entire structure of data describing that pair (that row instance and that column instance). In Market Basket Research (MBR), the row entity is customers and the columnis items. Each cell: 1 iff that customer has that item in the basket. In Netflix Cinematch, the row entity is customers and column movies and each cell has the 5-star rating that customer gave to that movie. In Bioinformatics the row entity might be experiments and the column entity might be genes and each cell has the expression level of that gene in that experiment or the row and column entities might both be proteins and each cell has a 1-bit iff the two proteins interact in some way. In Facebook the rows might be people and the columns might also be people (and a cell has a one bit iff the row and column persons are friends) Even when the table appears to be a simple entity table with descriptive feature columns, it may be viewable as a relationship between 2 entities. E.g., Image(R,B,G) is a table of pixel instances with columns, R,G,B. The R-values count the photons in a "red" frequency range detected at that pixel over an interval of time. That red frequency range is determined more by the camera technology than by any scientific definition. If we had separate CCD cameras that could count photons in each of a million very thin adjacent frequency intervals, we could view the column values of that image as instances a frequency entity, Then the image would be a relationship matrix between the pixel and the frequency entities. So an entity table can often be usefully viewed as a relationship matrix. If so, it can also be rotated so that the former column entity is now viewed as the new row entity and the former row entity is now viewed as the new set of descriptive columns. The bottom line is that we can often do data mining on a table of data in many ways: as an entity table (classification and clustering), as a relationship matrix (ARM) or upon rotation that matrix, as another entity table. For a rotated entity table, the concepts of nearness that can be used also rotate (e.g., The cosine correlation of two columns morphs into the cosine of the angle between 2 vectors as a row similarity measure.)

DBs, DWs are merging as In-memory DBs: SAP® In-Memory Computing Enabling Real-Time Computing SAP® In-Memory enables real-time computing by bringing together online transaction proc. OLTP (DB) and online analytical proc. OLAP (DW). Combining advances in hardware technology with SAP InMemory Computing empowers business – from shop floor to boardroom – by giving real-time bus. proc. instantaneous access to data-eliminating today’s info lag for your business. In-memory computing is already under way. The question isn’t if this revolution will impact businesses but when/ how. In-memory computing won’t be introduced because a co. can afford the technology. It will be because a business cannot afford to allow its competitors to adopt the it first. Total cost is 30% lower than traditional RDBMSs due to: • Leaner hardware, less system capacity req., as mixed workloads of analytics, operations, performance mgmt is in a single system, which also reduces redundant data storage. [[Back to a single DB rather than a DB for TP and a DW for boardroom dec. sup.]] • Less extract transform load (ETL) between systems and fewer prebuilt reports, reducing support required to run sofwr. Report runtime improvements of up to 1000 times. Compression rates of up to a 10 times. Performance improvements expected even higher in SAP apps natively developed for inmemory DBs. Initial results: a reduction of computing time from hours to seconds. However, in-memory computing will not eliminate the need for data warehousing. Real-time reporting will solve old challenges and create new opportunities, but new challenges will arise. SAP HANA 1.0 software supports realtime database access to data from the SAP apps that support OLTP. Formerly, operational reporting functionality was transferred from OLTP applications to a data warehouse. With in-memory computing technology, this functionality is integrated back into the transaction system. Product managers will still look at inventory and point-of-sale data, but in the future they will also receive,eg., tell customers broadcast dissatisfaction with a product over Twitter. Or they might be alerted to a negative product review released online that highlights some unpleasant product features requiring immediate action. From the other side, small businesses running real-time inventory reports will be able to announce to their Facebook and Twitter communities that a high demand product is available, how to order, and where to pick up. Bad movies have been able to enjoy a great opening weekend before crashing 2nd weekend when negative word-of-mouth feedback cools enthusiasm. That week-long grace period is about to disappear for silver screen flops. Consumer feedback won’t take a week, a day, or an hour. The very second showing of a movie could suffer from a noticeable falloff in attendance due to consumer criticism piped instantaneously through the new technologies. It will no longer be good enough to have weekend numbers ready for executives on Monday morning. Executives will run their own reports on revenue, Twitter their reviews, and by Monday morning have acted on their decisions. Adopting in-memory computing results in an uncluttered arch based on a few, tightly aligned core systems enabled by service-oriented architecture (SOA) to provide harmonized, valid metadata and master data across business processes. Some of the most salient shifts and trends in future enterprise architectures will be: • A shift to BI self-service apps like data exploration, instead of static report solutions. • Central metadata and masterdata repositories that define the data architecture, allowing data stewards to work across all business units and all platforms Real-time in-memory computing technology will cause a decline Structured Query Language (SQL) satellite databases. The purpose of those databases as flexible, ad hoc, more business-oriented, less IT-static tools might still be required, but their offline status will be a disadvantage and will delay data updates. Some might argue that satellite systems with in-memory computing technology will take over from satellite SQL DBs. SAP Business Explorer tools that use in-memory computing technology represent a paradigm shift. Instead of waiting for IT to work on a long queue of support tickets to create new reports, business users can explore large data sets and define reports on the fly. Here is sample of what in-memory computing can do for you: • Enable mixed workloads of analytics, operations, and performance management in a single software landscape. • Support smarter business decisions by providing increased visibility of very large volumes of business information • Enable users to react to business events more quickly through real-time analysis and reporting of operational data. • Deliver innovative real-time analysis and reporting. • Streamline IT landscape and reduce total cost of ownership. In manufacturing enterprises, in-memory computing tech will connect the shop floor to the boardroom, and the shop floor associate will have instant access to the same data as the board [[shop floor = daily transaction processing. Boardroom = executive data mining]]. The shop floor will then see the results of their actions reflected immediately in the relevant Key Performance Indicators (KPI). SAP BusinessObjects Event Insight software is key. In what used to be called exception reporting, the software deals with huge amounts of realtime data to determine immediate and appropriate action for a real-time situation. The final example is from the utilities industry: The most expensive energy a utilities provides is energy to meet unexpected demand during peak periods of consumption. If the company could analyze trends in power consumption based on real-time meter reads, it could offer – in real time – extra low rates for the week or month if they reduce their consumption during the following few hours. This advantage will become much more dramatic when we switch to electric cars; predictably, those cars are recharged the minute the owners return home from work. Hardware: blade servers and multicore CPUs and memory capacities measured in terabytes. Software: in-memory database with highly compressible row / column storage designed to maximize in-memory comp. tech. [[Both row and column storage! They convert to column-wise storage only for Long-Lived-High-Value data?]] Parallel processing takes place in the database layer rather than in the app layer - as it does in the client-server arch.

IRIS(SL,SW,PL,PW)  DPPMinVec,MaxVEC ID PL PW SL DPP s1 51 35 14 2 60 0 1 1 1 1 0 0 s2 49 30 14 2 59 0 1 1 1 0 1 1 s3 47 32 13 2 60 0 1 1 1 1 0 0 s4 46 31 15 2 58 0 1 1 1 0 1 0 s5 50 36 14 2 60 0 1 1 1 1 0 0 s6 54 39 17 4 58 0 1 1 1 0 1 0 s7 46 34 14 3 60 0 1 1 1 1 0 0 s8 50 34 15 2 59 0 1 1 1 0 1 1 s9 44 29 14 2 59 0 1 1 1 0 1 1 s10 49 31 15 1 58 0 1 1 1 0 1 0 s11 54 37 15 2 60 0 1 1 1 1 0 0 s12 48 34 16 2 58 0 1 1 1 0 1 0 s13 48 30 14 1 59 0 1 1 1 0 1 1 s14 43 30 11 1 62 0 1 1 1 1 1 0 s15 58 40 12 2 63 0 1 1 1 1 1 1 s16 57 44 15 4 61 0 1 1 1 1 0 1 s17 54 39 13 4 61 0 1 1 1 1 0 1 s18 51 35 14 3 60 0 1 1 1 1 0 0 s19 57 38 17 3 58 0 1 1 1 0 1 0 s20 51 38 15 3 60 0 1 1 1 1 0 0 s21 54 34 17 2 57 0 1 1 1 0 0 1 s22 51 37 15 4 59 0 1 1 1 0 1 1 s23 46 36 10 2 64 1 0 0 0 0 0 0 s24 51 33 17 5 56 0 1 1 1 0 0 0 s25 48 34 19 2 56 0 1 1 1 0 0 0 s26 50 30 16 2 57 0 1 1 1 0 0 1 s27 50 34 16 4 57 0 1 1 1 0 0 1 s28 52 35 15 2 59 0 1 1 1 0 1 1 s29 52 34 14 2 60 0 1 1 1 1 0 0 s30 47 32 16 2 58 0 1 1 1 0 1 0 s31 48 31 16 2 57 0 1 1 1 0 0 1 s32 54 34 15 4 58 0 1 1 1 0 1 0 s33 52 41 15 1 61 0 1 1 1 1 0 1 s34 55 42 14 2 62 0 1 1 1 1 1 0 s35 49 31 15 1 58 0 1 1 1 0 1 0 s36 50 32 12 2 61 0 1 1 1 1 0 1 s37 55 35 13 2 61 0 1 1 1 1 0 1 s38 49 31 15 1 58 0 1 1 1 0 1 0 s39 44 30 13 2 60 0 1 1 1 1 0 0 s40 51 34 15 2 59 0 1 1 1 0 1 1 s41 50 35 13 3 61 0 1 1 1 1 0 1 s42 45 23 13 3 57 0 1 1 1 0 0 1 s43 44 32 13 2 60 0 1 1 1 1 0 0 s44 50 35 16 6 57 0 1 1 1 0 0 1 s45 51 38 19 4 56 0 1 1 1 0 0 0 s46 48 30 14 3 58 0 1 1 1 0 1 0 s47 51 38 16 2 59 0 1 1 1 0 1 1 s48 46 32 14 2 59 0 1 1 1 0 1 1 s49 53 37 15 2 60 0 1 1 1 1 0 0 s50 50 33 14 2 60 0 1 1 1 1 0 0 e1 70 32 47 14 25 0 0 1 1 0 0 1 e2 64 32 45 15 27 0 0 1 1 0 1 1 e3 69 31 49 15 22 0 0 1 0 1 1 0 e4 55 23 40 13 29 0 0 1 1 1 0 1 e5 65 28 46 15 24 0 0 1 1 0 0 0 e6 57 28 45 13 26 0 0 1 1 0 1 0 e7 63 33 47 16 25 0 0 1 1 0 0 1 e8 49 24 33 10 37 0 1 0 0 1 0 1 e9 66 29 46 13 25 0 0 1 1 0 0 1 e10 52 27 39 14 31 0 0 1 1 1 1 1 e11 50 20 35 10 34 0 1 0 0 0 1 0 e12 59 30 42 15 29 0 0 1 1 1 0 1 e13 60 22 40 10 30 0 0 1 1 1 1 0 e14 61 29 47 14 24 0 0 1 1 0 0 0 e15 56 29 36 13 35 0 1 0 0 0 1 1 e16 67 31 44 14 27 0 0 1 1 0 1 1 e17 56 30 45 15 26 0 0 1 1 0 1 0 e18 58 27 41 10 31 0 0 1 1 1 1 1 e19 62 22 45 15 23 0 0 1 0 1 1 1 e20 56 25 39 11 32 0 1 0 0 0 0 0 e21 59 32 48 18 23 0 0 1 0 1 1 1 e22 61 28 40 13 31 0 0 1 1 1 1 1 e23 63 25 49 15 21 0 0 1 0 1 0 1 e24 61 28 47 12 25 0 0 1 1 0 0 1 e25 64 29 43 13 28 0 0 1 1 1 0 0 e26 66 30 44 14 27 0 0 1 1 0 1 1 e27 68 28 48 14 23 0 0 1 0 1 1 1 e28 67 30 50 17 21 0 0 1 0 1 0 1 e29 60 29 45 15 26 0 0 1 1 0 1 0 e30 57 26 35 10 36 0 1 0 0 1 0 0 e31 55 24 38 11 32 0 1 0 0 0 0 0 e32 55 24 37 10 33 0 1 0 0 0 0 1 e33 58 27 39 12 32 0 1 0 0 0 0 0 e34 60 27 51 16 20 0 0 1 0 1 0 0 e35 54 30 45 15 27 0 0 1 1 0 1 1 e36 60 34 45 16 27 0 0 1 1 0 1 1 e37 67 31 47 15 24 0 0 1 1 0 0 0 e38 63 23 44 13 25 0 0 1 1 0 0 1 e39 56 30 41 13 31 0 0 1 1 1 1 1 e40 55 25 40 13 30 0 0 1 1 1 1 0 e41 55 26 44 12 27 0 0 1 1 0 1 1 e42 61 30 46 14 26 0 0 1 1 0 1 0 e43 58 26 40 12 30 0 0 1 1 1 1 0 e44 50 23 33 10 37 0 1 0 0 1 0 1 e45 56 27 42 13 29 0 0 1 1 1 0 1 e46 57 30 42 12 30 0 0 1 1 1 1 0 e47 57 29 42 13 29 0 0 1 1 1 0 1 e48 62 29 43 13 28 0 0 1 1 1 0 0 e49 51 25 30 11 40 0 1 0 1 0 0 0 e50 57 28 41 13 30 0 0 1 1 1 1 0 set 51 35 14 2 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 30 14 2 0 1 1 0 0 0 1 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 13 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 46 31 15 2 0 1 0 1 1 1 0 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 36 14 2 0 1 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 54 39 17 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 0 0 set 46 34 14 3 0 1 0 1 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 50 34 15 2 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 44 29 14 2 0 1 0 1 1 0 0 0 1 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 set 49 31 15 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 54 37 15 2 0 1 1 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 48 34 16 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 30 14 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 set 43 30 11 1 0 1 0 1 0 1 1 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 1 set 58 40 12 2 0 1 1 1 0 1 0 1 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 57 44 15 4 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 54 39 13 4 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 set 51 35 14 3 0 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 1 1 1 0 0 0 0 0 1 1 set 57 38 17 3 0 1 1 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 set 51 38 15 3 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 1 0 0 0 0 1 1 set 54 34 17 2 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 set 51 37 15 4 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 1 0 0 set 46 36 10 2 0 1 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 set 51 33 17 5 0 1 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 set 48 34 19 2 0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 1 0 set 50 30 16 2 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 50 34 16 4 0 1 1 0 0 1 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 set 52 35 15 2 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 52 34 14 2 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 47 32 16 2 0 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 48 31 16 2 0 1 1 0 0 0 0 0 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 0 set 54 34 15 4 0 1 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 1 0 0 set 52 41 15 1 0 1 1 0 1 0 0 1 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 1 set 55 42 14 2 0 1 1 0 1 1 1 1 0 1 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 50 32 12 2 0 1 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 set 55 35 13 2 0 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 0 set 44 30 13 2 0 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 51 34 15 2 0 1 1 0 0 1 1 1 0 0 0 1 0 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 35 13 3 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 45 23 13 3 0 1 0 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 0 0 0 0 1 1 set 44 32 13 2 0 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 set 50 35 16 6 0 1 1 0 0 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 set 51 38 19 4 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 set 48 30 14 3 0 1 1 0 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 0 0 0 0 0 1 1 set 51 38 16 2 0 1 1 0 0 1 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 set 46 32 14 2 0 1 0 1 1 1 0 1 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 0 set 53 37 15 2 0 1 1 0 1 0 1 1 0 0 1 0 1 0 0 0 1 1 1 1 0 0 0 0 1 0 set 50 33 14 2 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 1 0 ver 70 32 47 14 1 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 64 32 45 15 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 69 31 49 15 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 55 23 40 13 0 1 1 0 1 1 1 0 1 0 1 1 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 65 28 46 15 1 0 0 0 0 0 1 0 1 1 1 0 0 0 1 0 1 1 1 0 0 0 1 1 1 1 ver 57 28 45 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 1 0 1 0 0 1 1 0 1 ver 63 33 47 16 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 0 1 1 1 1 0 1 0 0 0 0 ver 49 24 33 10 0 1 1 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 66 29 46 13 1 0 0 0 0 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 0 0 1 1 0 1 ver 52 27 39 14 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 1 0 ver 50 20 35 10 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 59 30 42 15 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 1 1 ver 60 22 40 10 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 0 ver 61 29 47 14 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 1 1 1 0 0 1 1 1 0 ver 56 29 36 13 0 1 1 1 0 0 0 0 1 1 1 0 1 0 1 0 0 1 0 0 0 0 1 1 0 1 ver 67 31 44 14 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 56 30 45 15 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 58 27 41 10 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 1 0 0 1 0 0 1 0 1 0 ver 62 22 45 15 0 1 1 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 56 25 39 11 0 1 1 1 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 1 0 0 1 0 1 1 ver 59 32 48 18 0 1 1 1 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 ver 61 28 40 13 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 63 25 49 15 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 0 1 0 0 1 1 1 1 ver 61 28 47 12 0 1 1 1 1 0 1 0 1 1 1 0 0 0 1 0 1 1 1 1 0 0 1 1 0 0 ver 64 29 43 13 1 0 0 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 66 30 44 14 1 0 0 0 0 1 0 0 1 1 1 1 0 0 1 0 1 1 0 0 0 0 1 1 1 0 ver 68 28 48 14 1 0 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 ver 67 30 50 17 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 0 0 1 0 0 0 1 ver 60 29 45 15 0 1 1 1 1 0 0 0 1 1 1 0 1 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 57 26 35 10 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 0 1 0 1 0 ver 55 24 38 11 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 1 1 ver 55 24 37 10 0 1 1 0 1 1 1 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 1 0 ver 58 27 39 12 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 0 0 1 1 1 0 0 1 1 0 0 ver 60 27 51 16 0 1 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 ver 54 30 45 15 0 1 1 0 1 1 0 0 1 1 1 1 0 0 1 0 1 1 0 1 0 0 1 1 1 1 ver 60 34 45 16 0 1 1 1 1 0 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 1 0 0 0 0 ver 67 31 47 15 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 ver 63 23 44 13 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 ver 56 30 41 13 0 1 1 1 0 0 0 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 0 1 ver 55 25 40 13 0 1 1 0 1 1 1 0 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 0 1 ver 55 26 44 12 0 1 1 0 1 1 1 0 1 1 0 1 0 0 1 0 1 1 0 0 0 0 1 1 0 0 ver 61 30 46 14 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 0 1 1 1 0 0 0 1 1 1 0 ver 58 26 40 12 0 1 1 1 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 0 0 1 1 0 0 ver 50 23 33 10 0 1 1 0 0 1 0 0 1 0 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 ver 56 27 42 13 0 1 1 1 0 0 0 0 1 1 0 1 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 57 30 42 12 0 1 1 1 0 0 1 0 1 1 1 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 ver 57 29 42 13 0 1 1 1 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0 1 ver 62 29 43 13 0 1 1 1 1 1 0 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 ver 51 25 30 11 0 1 1 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 0 1 0 1 1 ver 57 28 41 13 0 1 1 1 0 0 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 0 1 1 0 1 IRIS(SL,SW,PL,PW)  DPPMinVec,MaxVEC vir 63 33 60 25 0 1 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 1 0 0 0 1 1 0 0 1 vir 58 27 51 19 0 1 1 1 0 1 0 0 1 1 0 1 1 0 1 1 0 0 1 1 0 1 0 0 1 1 vir 71 30 59 21 1 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 0 1 vir 63 29 56 18 0 1 1 1 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 0 0 1 0 0 1 0 vir 65 30 58 22 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 vir 76 30 66 21 1 0 0 1 1 0 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 1 vir 49 25 45 17 0 1 1 0 0 0 1 0 1 1 0 0 1 0 1 0 1 1 0 1 0 1 0 0 0 1 vir 73 29 63 18 1 0 0 1 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 0 1 0 vir 67 25 58 18 1 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 0 1 0 0 1 0 0 1 0 vir 72 36 61 25 1 0 0 1 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 vir 65 32 51 20 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 1 1 0 1 0 1 0 0 vir 64 27 53 19 1 0 0 0 0 0 0 0 1 1 0 1 1 0 1 1 0 1 0 1 0 1 0 0 1 1 vir 68 30 55 21 1 0 0 0 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 1 0 1 vir 57 25 50 20 0 1 1 1 0 0 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 1 0 0 vir 58 28 51 24 0 1 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 vir 64 32 53 23 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 1 1 1 vir 65 30 55 18 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 77 38 67 22 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 0 0 0 1 1 0 1 0 1 1 0 vir 77 26 69 23 1 0 0 1 1 0 1 0 1 1 0 1 0 1 0 0 0 1 0 1 0 1 0 1 1 1 vir 60 22 50 15 0 1 1 1 1 0 0 0 1 0 1 1 0 0 1 1 0 0 1 0 0 0 1 1 1 1 vir 69 32 57 23 1 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 0 0 1 0 1 0 1 1 1 vir 56 28 49 20 0 1 1 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 1 0 1 0 1 0 0 vir 77 28 67 20 1 0 0 1 1 0 1 0 1 1 1 0 0 1 0 0 0 0 1 1 0 1 0 1 0 0 vir 63 27 49 18 0 1 1 1 1 1 1 0 1 1 0 1 1 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 67 33 57 21 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 0 1 0 1 vir 72 32 60 18 1 0 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 vir 62 28 48 18 0 1 1 1 1 1 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 vir 61 30 49 18 0 1 1 1 1 0 1 0 1 1 1 1 0 0 1 1 0 0 0 1 0 1 0 0 1 0 vir 64 28 56 21 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 0 1 vir 72 30 58 16 1 0 0 1 0 0 0 0 1 1 1 1 0 0 1 1 1 0 1 0 0 1 0 0 0 0 vir 74 28 61 19 1 0 0 1 0 1 0 0 1 1 1 0 0 0 1 1 1 1 0 1 0 1 0 0 1 1 vir 79 38 64 20 1 0 0 1 1 1 1 1 0 0 1 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 vir 64 28 56 22 1 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 0 1 0 1 1 0 vir 63 28 51 15 0 1 1 1 1 1 1 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 1 1 1 1 vir 61 26 56 14 0 1 1 1 1 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 1 0 vir 77 30 61 23 1 0 0 1 1 0 1 0 1 1 1 1 0 0 1 1 1 1 0 1 0 1 0 1 1 1 vir 63 34 56 24 0 1 1 1 1 1 1 1 0 0 0 1 0 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 64 31 55 18 1 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 1 1 0 1 0 0 1 0 vir 60 30 18 18 0 1 1 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 1 0 0 1 0 vir 69 31 54 21 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 vir 67 31 56 24 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 0 0 0 1 1 0 0 0 vir 69 31 51 23 1 0 0 0 1 0 1 0 1 1 1 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 vir 68 32 59 23 1 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1 0 1 0 1 1 1 vir 67 33 57 25 1 0 0 0 0 1 1 1 0 0 0 0 1 0 1 1 1 0 0 1 0 1 1 0 0 1 vir 67 30 52 23 1 0 0 0 0 1 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 1 1 vir 63 25 50 19 0 1 1 1 1 1 1 0 1 1 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 1 vir 65 30 52 20 1 0 0 0 0 0 1 0 1 1 1 1 0 0 1 1 0 1 0 0 0 1 0 1 0 0 vir 62 34 54 23 0 1 1 1 1 1 0 1 0 0 0 1 0 0 1 1 0 1 1 0 0 1 0 1 1 1 vir 59 30 51 18 0 1 1 1 0 1 1 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 i1 63 33 60 25 10 0 0 0 1 0 1 0 i2 58 27 51 19 19 0 0 1 0 0 1 1 i3 71 30 59 21 11 0 0 0 1 0 1 1 i4 63 29 56 18 15 0 0 0 1 1 1 1 i5 65 30 58 22 12 0 0 0 1 1 0 0 i6 76 30 66 21 5 0 0 0 0 1 0 1 i7 49 25 45 17 24 0 0 1 1 0 0 0 i8 73 29 63 18 8 0 0 0 1 0 0 0 i9 67 25 58 18 12 0 0 0 1 1 0 0 i10 72 36 61 25 10 0 0 0 1 0 1 0 i11 65 32 51 20 19 0 0 1 0 0 1 1 i12 64 27 53 19 16 0 0 1 0 0 0 0 i13 68 30 55 21 15 0 0 0 1 1 1 1 i14 57 25 50 20 19 0 0 1 0 0 1 1 i15 58 28 51 24 17 0 0 1 0 0 0 1 i16 64 32 53 23 17 0 0 1 0 0 0 1 i17 65 30 55 18 16 0 0 1 0 0 0 0 i18 77 38 67 22 6 0 0 0 0 1 1 0 i19 77 26 69 23 0 0 0 0 0 0 0 0 e50 57 28 41 13 30 0 0 1 1 1 1 0 i40 69 31 54 21 16 0 0 1 0 0 0 0 i41 67 31 56 24 13 0 0 0 1 1 0 1 i42 69 31 51 23 18 0 0 1 0 0 1 0 i43 58 27 51 19 19 0 0 1 0 0 1 1 i44 68 32 59 23 11 0 0 0 1 0 1 1 i45 67 33 57 25 12 0 0 0 1 1 0 0 i46 67 30 52 23 17 0 0 1 0 0 0 1 i47 63 25 50 19 19 0 0 1 0 0 1 1 i48 65 30 52 20 18 0 0 1 0 0 1 0 i49 62 34 54 23 16 0 0 1 0 0 0 0 i50 59 30 51 18 20 0 0 1 0 1 0 0

"Gap Hill Climbing": mathematical analysis One way to increase the size of the functional gaps is to hill climb the standard deviation of the functional, F (hoping that a "rotation" of d toward a higher STDev would increase the likelihood that gaps would be larger ( more dispersion allows for more and/or larger gaps). We can also try to grow one particular gap or thinning using support pairs as follows: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask of each F-slice ordered by increasing F-value (in fact it is the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place.). The d2-gap is much larger than the d1=gap. It is still not the optimal gap though. Would it be better to use a weighted mean (weighted by the distance from the gap - that is weighted by the d-barrel radius (from the center of the gap) on which each point lies?) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? d1 d1-gap 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 0 e 2 3 d 4 5 6 c 7 8 b 9 a 9 8 7 6 5 a j k l m n 4 b c q r s 3 d e f o p 2 g h 1 i d1 d1-gap =p q= 0 1 2 3 4 5 6 7 8 9 a b c d e f f 1 e 2 3 d 4 5 6 c 7 8 b 9 a 9 8 7 6 5 a j k 4 b c q 3 d e f 2 1 d2 d2-gap p q d2 d2-gap

HILL CLIMBING GAP WIDTH On CLUS2unionCLUS3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 CL123 p is avg=14 q is avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 No conclusive gaps Sparse Lo end: Check [0,9] 0 1 2 2 3 7 7 9 9 i39 e49 e8 e44 e11 e32 e30 e15 e31 i39 0 17 21 21 24 22 19 19 23 e49 17 0 4 4 7 8 8 9 9 e8 21 4 0 1 5 7 8 10 8 e44 21 4 1 0 4 6 8 9 7 e11 24 7 5 4 0 7 9 11 7 e32 22 8 7 6 7 0 3 6 1 e30 19 8 8 8 9 3 0 4 4 e15 19 9 10 9 11 6 4 0 6 e31 23 9 8 7 7 1 4 6 0 i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 CLUS1<7 (50 Set) 7 <CLUS2< 16 (4 Virg, 48 Vers) Here, the gap between CLUS1 and CLUS2 is made more pronounced???? (Why?) But the thinning between CLUS2 and CLUS3 seems even more obscure??? Although this doesn't prove anything, it is not good news for the method! It did not grow the gap we wanted to grow (between CLUSTER2 and CLUSTER3. There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. CLUS3>16 (46 Virg, 2 Vers) Sparse Hi end: Check [38,47] distances 38 39 42 42 44 45 45 47 47 i31 i8 i36 i10 i6 i23 i32 i18 i19 i31 0 3 5 10 6 7 12 12 10 i8 3 0 7 10 5 6 11 11 9 i36 5 7 0 8 5 7 9 10 9 i10 10 10 8 0 10 12 9 9 14 i6 6 5 5 10 0 3 9 8 5 i23 7 6 7 12 3 0 11 10 4 i32 12 11 9 9 9 11 0 4 13 i18 12 11 10 9 8 10 4 0 12 i19 10 9 9 14 5 4 13 12 0 i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Next we attempt to hill-climb the gap at 16 using the half-space averages.

CAINE 2013 Call for Papers 26th International Conference on Computer Applications in Industry and Engineering September 25{27, 2013, Omni Hotel, Los Angles, Califorria, USA Sponsored by the International Society for Computers and Their Applications (ISCA) CAINE{2013 will feature contributed papers as well as workshops and special sessions. Papers will be accepted into oral presentation sessions. The topics will include, but are not limited to, the following areas: Agent-Based Systems Image/Signal Processing Autonomous Systems Information Assurance Big Data Analytics Information Systems/Databases Bioinformatics, Biomedical Systems/Engineering Internet and Web-Based Systems Computer-Aided Design/Manufacturing Knowledge-based Systems Computer Architecture/VLSI Mobile Computing Computer Graphics and Animation Multimedia Applications Computer Modeling/Simulation Neural Networks Computer Security Pattern Recognition/Computer Vision Computers in Education Rough Set and Fuzzy Logic Computers in Healthcare Robotics Computer Networks Fuzzy Logic Control Systems Sensor Networks Data Communication Scientic Computing Data Mining Software Engineering/CASE Distributed Systems Visualization Embedded Systems Wireless Networks and Communication Important Dates: Workshop/special session proposal . . May 2.5,.2.013 Full Paper Submis . .June 5,.2013. Notice Accept ..July.5 , 2013. Pre-registration & Camera-Ready Paper Due . . . ..August 5, 2013. Event Dates . . .Sept 25-27, 2013 SEDE Conf is interested in gathering researchers and professionals in the domains of SE and DE to present and discuss high-quality research results and outcomes in their fields. SEDE 2013 aims at facilitating cross-fertilization of ideas in Software and Data Engineering, The conference topics include, but not limited to: . Requirements Engineering for Data Intensive Software Systems. Software Verification and Model of Checking. Model-Based Methodologies. Software Quality and Software Metrics. Architecture and Design of Data Intensive Software Systems. Software Testing. Service- and Aspect-Oriented Techniques. Adaptive Software Systems . Information System Development. Software and Data Visualization. Development Tools for Data Intensive. Software Systems. Software Processes. Software Project Mgnt . Applications and Case Studies. Engineering Distributed, Parallel, and Peer-to-Peer Databases. Cloud infrastructure, Mobile, Distributed, and Peer-to-Peer Data Management . Semi-Structured Data and XML Databases. Data Integration, Interoperability, and Metadata. Data Mining: Traditional, Large-Scale, and Parallel. Ubiquitous Data Management and Mobile Databases. Data Privacy and Security. Scientific and Biological Databases and Bioinformatics. Social networks, web, and personal information management. Data Grids, Data Warehousing, OLAP. Temporal, Spatial, Sensor, and Multimedia Databases. Taxonomy and Categorization. Pattern Recognition, Clustering, and Classification. Knowledge Management and Ontologies. Query Processing and Optimization. Database Applications and Experiences. Web Data Mgnt and Deep Web May 23, 2013 Paper Submission Deadline June 30, 2013 Notification of Acceptance July 20, 2013 Registration and Camera-Ready Manuscript Conference Website: http://theory.utdallas.edu/SEDE2013/ ACC-2013 provides an international forum for presentation and discussion of research on a variety of aspects of advanced computing and its applications, and communication and networking systems. Important Dates May 5, 2013 - Special Sessions Proposal June 5, 2013 - Full Paper Submission July 5, 2013 - Author Notification Aug. 5, 2013 - Advance Registration & Camera Ready Paper Due CBR International Workshop Case-Based Reasoning CBR-MD 2013 July 19, 2013, New York/USA Topics of interest include (but are not limited to): CBR for signals, images, video, audio and text Similarity assessment Case representation and case mining Retrieval and indexing Conversational CBR Meta-learning for model improvement and parameter setting for processing with CBR Incremental model improvement by CBR Case base maintenance for systems Case authoring Life-time of a CBR system Measuring coverage of case bases Ontology learning with CBR Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop on Data Mining in Life Sciences DMLS Discovery of high-level structures, incl e.g. association networks Text mining from biomedical literatur Medical images mining Biomedical signals mining Temporal and sequential data mining Mining heterogeneous data Mining data from molecular biology, genomics, proteomics, pylogenetic classification With regard to different methodologies and case studies: Data mining project development methodology for biomedicine Integration of data mining in the clinic Ontology-driver data mining in life sciences Methodology for mining complex data, e.g. a combination of laboratory test results, images, signals, genomic and proteomic samples Data mining for personal disease management Utility considerations in DMLS, including e.g. cost-sensitive learning Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop on Data Mining in Marketing DMM'2013 In business environment data warehousing - the practice of creating huge, central stores of customer data that can be used throughout the enterprise - is becoming more and more common practice and, as a consequence, the importance of data mining is growing stronger. Applications in Marketing Methods for User Profiling Mining Insurance Data E-Markteing with Data Mining Logfile Analysis Churn Management Association Rules for Marketing Applications Online Targeting and Controlling Behavioral Targeting Juridical Conditions of E-Marketing, Online Targeting and so one Controll of Online-Marketing Activities New Trends in Online Marketing Aspects of E-Mailing Activities and Newsletter Mailing Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013 Workshop Data Mining in Ag DMA 2013 Data Mining on Sensor and Spatial Data from Agricultural Applications Analysis of Remote Sensor Data Feature Selection on Agricultural Data Evaluation of Data Mining Experiments Spatial Autocorrelation in Agricultural Data Submission Deadline: March 20th, 2013 Notification Date: April 30th, 2013 Camera-Ready Deadline: May 12th, 2013 Workshop date: July 19th, 2013

FAUST Functional-Gap clustering (FAUST=Functional Analytic Unsupervised and Supervised machine Teaching) relies on choosing a distance dominating functional (map to R1 s.t. |F(x)-F(y)|Dis(x,y) x,y; so that any F-gap implies a linear cluster break. Dot Product Projection: DPPd(y)≡ (y-p)od where the unit vector, d, can be obtained as d=(p-q)/|p-q| for points, p and q. Square Distance Functional: SDp(y) ≡ (y-p)o(y-p) Coordinate Projection is a the simplest DPP: ej(y) ≡ yj Dot Product Radius: DPRpq(y) ≡ √ SDp(y) - DPPpq(y)2 Square Dot Product Radius: SDPRpq(y)≡ SDp(y)-DPPpq(y)2 Note: The same DPPd gaps are revealed by DPd(y)≡ yod since ((y-p)od=yod-pod and thus DP just shifts all DPP values by pod. X X1...Xj ...Xn x1 x2 . xi xi,j xN x1od x2od xiod xNod d1 dn = X mm d = DPPd(X) Finding a good unit vector, d, for Dot Product functional, DPP. to maximize gaps subjected to i=1..n di2= 1 Method-1: Maximize VarianceDPPd(X) wrt d. Let upper bar mean column average. VarDPPd(X) = (Xod)2 - ( Xod )2 - ( j=1..nXj dj )2 = (1/N)i=1..N( (j=1..n xi,jdj)2 ) = (1/N)i=1..N(j=1..n xi,jdj) (k=1..n xi,kdk) - (j=1..nXj dj) (k=1..nXk dk) = (1/N)i=1..N( j=1..n xi,j2dj2 + 2j<k xi,jxi,kdjdk ) - (j=1..n Xj2dj2 + 2j<k XjXkdjdk ) VX o dd =VarDPPdX V X1... Xj ... Xn X1 : Xi XiXj-XiX,j XN V d1... dj ... dn d1 di didj dN V = +(2j=1..n<k XjXkdjdk ) = (1/N) j=1..nXj2 dj2 - (j=1..n Xj2dj2 - 2j<k XjXkdjdk ) = (1/N) j=1..nXj2 dj2 - (j=1..n Xj2dj2 + +(2j=1..n<k XjXkdjdk ) - 2j<k XjXkdjdk ) = (1/N) j=1..n( Xj2 - Xj2 ) dj2 + +(2j=1..n<k=1..n (XjXk - XjXk ) djdk ) Alg-1 (heuristic): Use unit vec (a1...an)≡A that max's YoA, A=Y/|Y|. D≡(√ ,...,√ ) and d≡D/|D| (outliers 1st?) X12 - X12 Alg-2 (heuristic): Find k s.t. is max. Set dk=1, dh=0 hk. We've already done this using ek with max stdev) Xk2 - Xk2 Alg-3 (optimum): Find d maxing VarDPPd(X). VX, dd are n2-vectors. V=VXodd. The dd maxing V is VX/|VX|. |d|=1 |dd|=1: |d|=1 |dd|=1 |dd| =SQRT(i=1..ndi2d12 +i=1..ndi2d22 + ... +i=1..ndi2dn2 ) |dd| = SQRT( j=1..n(i=1..ndi2)dj2 ) |dd| = SQRT(j=1..n 1dj2) =SQRT(j=1..n dj2) = 1 |dd|=1 |d|=1 1=|dd| = SQRT( i=1..ndi2d12 + i=1..ndi2d22 + ... + i=1..ndi2dn2 ) = SQRT( (i=1..ndi2) (j=1..ndj2) ) = SQRT( (i=1..ndi2)2 ) = i=1..ndi2 dT o VX o d = VarDPPdX≡V subject to i=1..ndi2=1 d1 ... dn V i XiXj-XiX,j : d1 : dn = V(d)=ijaijdidj

Hierarchical Clustering ABC DEFG  DE  FG A BC F G D E B C Any maximal anti-chain (maximal set of nodes in which no 2 are directly connected) is a clustering (a dendogram offers many).

Hierarchical Clustering But the “horizontal” anti-chains are the clusterings resulting from the top down (or bottom up) method(s).

CONCRETE GRAVEL F=(DPP-MN)/4 Concrete(C, W, FA, A) Accuracy=90% 0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 100 1 103 1 105 1 108 2 112 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 med=10 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2 Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err H Median=11 Avg=10.7 med=14 med=9 med=18 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3 Median=18 Avg=18 med=17 med=21 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2 Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 med=23 med=40 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2 Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6 Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5 Median=47 Avt=47 med=34 med=33 med=56 Accuracy=90% med=61 ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2 Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 med=57 med=62 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3 Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 med=71 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 med=71 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 med=86 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 What about agglomerating using single link agglomeration (minimum pairwise distance? Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors C1 C2 C3 C4 CONCRETE

GRAVEL Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ =0 0L 0M 3H CLUS 4.4.1 gap=7 Median=0 Avg=0 =7 0L 0M 4H CLUS 4.4.2 gap=2 Median=7 Avg=7 [8,14] 1L 5M 22H CLUS 4.4.3 1L+5M err H Median=11 Avg=10.7 gap=3 ______ =15 0L 0M 4H CLUS 4.3.1 gap=3 Median=15 Avg=15 =18 0L 0M 10H CLUS 4.3.2 gap=3 Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS 4.7.2 gap=2 Median=22 Avg=22 2H errs in L [24,30) 10L 0M 0H CLUS_4.7.1 Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS 4.2.1 gap=2 Median=31 Avg=32.3 =34 0L 2M 0H CLUS 4.2.2 gap=6 Median=34 Avg=34 ______ =40 0L 4M 0H CLUS_4.2.3 gap=7 Median=40 Avg=40 =47 0L 3M 0H CLUS_4.2.4 gap=5 Median=47 Avt=47 Accuracy=90% ______ [50,59) 12L 1M 4H CLUS 4.8.1 gap=2 Median=55 Avg=55 1M+4H errs in L [59,63) 8L 0M 0H CLUS_4.8.2 Median=61.5 Avg=61.3 gap=2 ______ =64 2L 0M 2H CLUS 4.6.1 gap=3 Median=64 Avg=64 2 H errs in L [66,70) 10L 0M 0H CLUS 4.6.2 Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_4.5 Median=71 Avg=71.7 ______ gap=7 =79 5L 0M 0H CLUS_4.1.1 gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_4.1 1 Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE

We will do round-2 on C1, but first analyze thinning, TC1k, in C1. MEDEV2 WINE GRAVY 70 156 1005 123 akk d1 d2 d3 d4 Vd .07 .15 .98 .12 622 .07 .35 .93 .04 653 starting with (0010): .00 .00 1.00 .00 567 .07 .35 0.93 .04 653 F-MN Ct gp2 0 1 1 1 4 1 2 4 1 3 5 1 4 3 1 5 1 1 6 7 1 7 8 1 8 7 1 9 2 1 10 5 1 11 4 1 12 5 1 13 6 1 14 1 1 15 4 1 16 2 1 17 2 2 19 9 1 20 3 1 21 1 1 22 4 1 23 4 1 24 5 1 25 1 1 26 2 1 27 1 2 29 2 1 30 1 1 31 1 1 32 1 3 35 1 1 36 1 1 37 3 1 38 1 1 39 1 1 40 1 1 41 1 2 43 6 1 44 3 1 45 1 1 46 2 1 47 1 1 48 1 3 51 1 1 52 1 4 56 1 4 60 1 3 63 1 2 65 2 2 67 1 5 72 1 2 74 1 9 83 1 1 84 1 1 85 1 1 86 1 1 87 1 1 88 2 11 99 1 6 105 1 8 113 1 6 119 1 We will do round-2 on C1, but first analyze thinning, TC1k, in C1. We also note that F=13 has all 5 TC12 errors, though there is not gap or thinning that separates [13,14) out (i.e., F fully linearizes L, H) F-MN Ct gp2 0 1 1 1 4 1 2 4 1 3 5 1 4 4 1 5 7 1 6 7 1 7 8 1 8 2 1 9 5 1 10 4 1 11 5 1 12 7 1 13 3 1 14 3 2 16 2 1 17 9 1 18 3 1 19 1 1 20 4 1 21 4 1 22 5 1 23 1 1 24 2 1 25 1 2 27 2 1 28 1 1 29 1 1 30 1 1 31 1 2 33 1 1 34 3 1 35 1 1 36 2 1 37 1 2 39 6 1 40 4 1 41 2 1 42 1 1 43 1 2 45 1 2 47 1 3 50 1 4 54 1 2 56 1 2 58 2 2 60 1 4 64 1 1 65 1 9 74 1 1 75 1 1 76 1 1 77 2 1 78 1 1 79 1 10 89 1 6 95 1 7 102 1 5 107 1 ___ _ [0,5) 17L 0H TC13 [5,14) 40L 5H TC12 ___ _ [0,12) 56L 0H [12,16) 1L 12H ___ ___ [0,16) 57L 12H C1 [16,118) 0L 81H C2 ___ _ [5.13) 39L 0H [13,14) 1L 5H C1 46 32 209 101 akk d1 d2 d3 d4 Vd .19 .14 .87 .42 18 .18 .38 .90 .14 20 6*(F-MN) Ct gp6 0 1 6 6 3 1 7 1 5 12 4 6 18 1 2 20 3 1 21 1 5 26 1 1 27 2 2 29 1 6 35 7 5 40 5 1 41 2 5 46 1 2 48 1 1 49 6 5 54 2 6 60 5 5 65 4 6 71 2 2 73 3 6 79 4 2 81 2 3 84 1 5 89 3 6 95 3 10 105 2 ___ _ _ . [14,19) 0L 9H TC11 ___ ___ [0,19) 57L 14H C1 ___ _ [0,3) 1L 0H C18 [3,18) 8L 0H C17 C1:8*(F-MN) gp8 0 1 9 9 2 1 10 2 8 18 4 8 26 4 1 27 1 8 35 4 8 43 7 8 51 5 1 52 2 7 59 6 1 60 2 7 67 1 1 68 1 7 75 4 1 76 1 7 83 4 8 91 5 8 99 7 7 106 3 8 114 3 ___ _ _ . [18,35) 24L 0H C16 ___ ___ [19,29) 0L 30H C2 ___ _ _ . [35,60) 24L 0H C15 ___ ___ [29,35) 0L 5H C3 ___ _ _ . [60,71) 9L 0H C14 ___ ___ [35,43) 0L 9H C4 ___ _ _ . [71,79) 5L 0H C13 ___ ___ [0,99) 56L 0H C11 [99,115) 1L 12H C12 ___ _ _ . [79,95) 1L 9H C12 ___ _ _ . [95,105) 0L 5H C11 ___ ___ [43,51) 0L 14H C5 ___ ___ [51,56) 0L 2H C6 ___ ___ [56,60) 0L 1H C7 ___ ___ [60,63) 0L 1H C8 We note that cut=12 clusters with 140/150=99.3% accuracy, however, again, there is no alg that would cut at 12 (not even a thinning). So we apply a 2nd round to C1. ___ ___ [63,72) 0L 4H C9 ___ ___ [72,83) 0L 2H Ca ___ ___ [83,99) 0L 7H Cb ACCURACY CONCRETE IRIS SEEDS WINE GRAVEL 78.8 82.7 94 99.3 MEDEX2 93.1 94 93.3 99.3 ___ ___ [99,105) 0L 1H Cc ___ ___ [105,113) 0L 1H Cd ___ ___ [113,119) 0L 1H Ce ___ ___ [119,120) 0L 1H Cf