R = UT o I ru,i = u o i = f=1..F ru,f * rf,i ^

1 R = UT o I ru,i = u o i = f=1..F ru,f * rf,i ^
Simon Funk: Netflix provided a database of 100M ratings (1 to 5) of 17K movies by 500K users. as a triplet of numbers: (User,Movie,Rating). The challenge: For (User,Movie,?) not in the database, predict how the given User would rate the given Movie. Think of the data as a big sparsely filled matrix, with userIDs across the top and movieIDs down the side (or vice versa then transpose everything), and each cell contains an observed rating (1-5) for that movie (row) by that user (column), or is blank meaning you don't know. This matrix would have 8.5B entries, but you are only given values for 1/85th of those 8.5B cells (or 100M of them). The rest are all blank. Netflix posed a "quiz" of a bunch of question marks plopped into previously blank slots, and your job is to fill in best-guess ratings in their place. Squared error (se) measures accuracy (Your guess = 1.5, actual = 2, you get docked for (2-1.5)^2 or They use root mean squared error (rmse), but rmse and mse monotonically related.) There is a date for ratings and question marks (so a cell can potentially have >=1 rating in it. Any movie can be described in terms of some aspects or attributes such as overall quality, action(y/n?), comedy(y/n?), stars, producer, etc. Every user's preferences can be roughly described in terms of whether they tend to rate quality/action/comedy/star/producer/etc. high or low. If true, then ratings ought to be explainable by a lot less than 8.5 billion numbers (e.g., a single number specifying how much action a particular movie has may help explain why a few million action-buffs like that movie.) SVD assumes rating(u,m) is sum of preferences about the various aspects. E.g., take 40 aspects - a movie, m, is described by 40 values, m(f), saying how much that movie exemplifies that aspect, and a user is described by 40 values, u(f), saying how much they prefer each aspect. A rating(u,m) = u(f) dot m(f) (40*(17K+500K) values =~20M << 8.5B.) or: ratingsMatrix[user][movie] = sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40 (or 1 to F in general) R = UT o I ru,i = u o i = f=1..F ru,f * rf,i UT f1 f fF u1 : uTestSizeU = u ru,f I i i iTestSizeI f1 fF o f rf,i R i i iTestSizeI . u ru,i The original matrix has been decomposed to 2 oblong matrices: Kx40 movie aspect matrix, 500Kx40 user preference matrix. SVD is a trick for finding the 2 smaller matrices which minimize the resulting approx error--specifically the mean squared error. So, if we take the rank=40 SVD of the 8.5B matrix, we have the best (least error) approx we can within the limits of our user-movie-rating model. I.e., the SVD has found "best" generalizations. Take the derivative of the approx error and follow it. This has the bonus that we can ignore the unknown error on the 8.4B empty slots. Take the derivative of the equations for the error--just the given values, not the empties--with respect to the parameters: userValue[user] = lrate*err*movieValue[movie]; movieValue[movie] += lrate*err*userValue[user]; With Horizontal data, the code is evaluated for each rating. So, to train for one sample: real *userValue= userFeature[featureBeingTrained]; real *movieValue= movieFeature[featureBeingTrained]; real lrate = 0.001; More correctly: ru,f = uv = userValue[user] += err * movieValue[movie]; rf,i = movieValue[movie] += err * uv; finds the most prominent feature remaining (most reduces error). When it's good, shift it onto done features, start a new one (cache residuals of the 100M. "What does that mean for us???). This Gradient descent has no local minima, which means it doesn't really matter how it's initialized. u+ = lrate ( u,i * iT  * u ) where u,i = ru,i - ru,i and ru,i = actual rating ^

2 Refinements: Prior to starting SVD, Note: AvgRating(movie), AvgOffset(UserRating, MovieAvgRating), for every user. I.e.: static inline real predictRating_Baseline(int movie, int user) {return averageRating[movie] + averageOffset[user];} So, that's the return value of predictRating before the first SVD feature even starts training. You'd think avg rating for a movie would just be... its average rating! Alas, Occam's razor was a little rusty that day. If m only appears once with r(m,u)=1 say, AvgRating(m)=1? Probably not! View r(m,u)=1 as a draw from a true prob dist who's avg you want... View that true average itself as a draw from a prob dist of averages--the histogram of average movie ratings. Assume both distributions Gaussian, then the best-guess mean should be lin combo of observed mean and apriori mean, with a blending ratio equal to the ratio of variances. If Ra and Va are the mean and variance (squared standard deviation) of all of the movies' average ratings (which defines your prior expectation for a new movie's average rating before you've observed any actual ratings) and Vb is the average variance of individual movie ratings (which tells you how indicative each new observation is of the true mean--e.g,. if the average variance is low, then ratings tend to be near the movie's true mean, whereas if the avg variance is high, ratings tend to be more random and less indicative) then: BogusMean = sum(ObservedRatings)/count(ObservedRatings) K = Vb/Va BetterMean = [GlobalAverage*K + sum(ObservedRatings)] / [K + count(ObservedRatings)] The point here is simply that any time you're averaging a small number of examples, the true average is most likely nearer the apriori average than the sparsely observed average. Note if the number of observed ratings for a particular movie is zero, the BetterMean (best guess) above defaults to the global average movie rating as one would expect. Moving on: 20M free params is a lot for a 100M TrainSet. Seems neat to just ignore all blanks, but we have expectations about them. As-is, this modified SVD algorithm tends to make a mess of sparsely observed movies or users. If you have a user who has only rated 1 movie, say American Beauty=2 while the avg is 4.5, and further that their offset is only -1, we'd, prior to SVD, expect them to rate it 3.5. So the error given to the SVD is -1.5 (the true rating is 1.5 less than we expect). m(Action) is training up to measure the amount of Action, say, .01 for American Beauty (ust slightly more than avg). SVD optimize predictions, which it can do by eventually setting our user's preference for Action to a huge I.e., the alg naively looks at the only example it has of this user's preferences and in the context of only the one feature it knows about so far (Action), determines that our user so hates action movies that even the tiniest bit of action in American Beauty makes it suck a lot more than it otherwise might. This is not a problem for users we have lots of observations for because those random apparent correlations average out and the true trends dominate. We need to account for priors. As with the average movie ratings, blend our sparse observations in with some sort of prior, but it's a little less clear how to do that with this incremental algorithm. But if you look at where the incremental algorithm theoretically converges, you get: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2)] The numerator there will fall in a roughly zero-mean Gaussian distribution when charted over all users, which through various gyrations: userValue[user] = [sum residual[user,movie]*movieValue[movie]] / [sum (movieValue[movie]^2 + K)] And finally back to: userValue[user] += lrate * (err * movieValue[movie] - K * userValue[user]); movieValue[movie] += lrate * (err * userValue[user] - K * movieValue[movie]); This is equivalent to penalizing the magnitude of the features. To cut over fitting, allowing use of more features.

3 Moving on: Linear models are limiting
Moving on: Linear models are limiting. We've bastardized the whole matrix analogy so much that we aren't really restricted to linear models: We can add non-linear outputs such that instead of predicting with: sum (userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. We can use: sum G(userFeature[f][user] * movieFeature[f][movie]) for f from 1 to 40. Two choices for G proved useful. 1. clip the prediction to 1-5 after each component is added. E.g., each feature is limited to only swaying rating within the valid range, and any excess beyond that is lost rather than carried over. So, if the first feature suggests +10 on a scale of 1-5, and the second feature suggests -1, then instead of getting a 5 for the final clipped score, it gets a 4 because the score was clipped after each stage. The intuitive rationale here is that we tend to reserve the top of our scale for the perfect movie, and the bottom for one with no redeeming qualities whatsoever, and so there's a sort of measuring back from the edges that we do with each aspect independently. More pragmatically, since the target range has a known limit, clipping is guaranteed to improve our perf, and having trained a stage with clipping on, use it with clipping on. I did not really play with this extensively enough to determine there wasn't a better strategy. A second choice for G is to introduce some functional non-linearity such as a sigmoid. I.e., G(x) = sigmoid(x). Even if G is fixed, this requires modifying the learning rule slightly to include the slope of G, but that's straightforward. The next question is how to adapt G to the data. I tried a couple of options, including an adaptive sigmoid, but the most general and the one that worked the best was to simply fit a piecewise linear approximation to the true output/output curve. That is, if you plot the true output of a given stage vs the average target output, the linear model assumes this is a nice 45 degree line. But in truth, for the first feature for instance, you end up with a kink around the origin such that the impact of negative values is greater than the impact of positive ones. That is, for two groups of users with opposite preferences, each side tends to penalize more strongly than the other side rewards for the same quality. Or put another way, below-average quality (subjective) hurts more than above-average quality helps. There is also a bit of a sigmoid to the natural data beyond just what is accounted for by the clipping. The linear model can't account for these, so it just finds a middle compromise; but even at this compromise, the inherent non-linearity shows through in an actual-output vs. average-target-output plot, and if G is then simply set to fit this, the model can further adapt with this new performance edge, which leads to potentially more beneficial non-linearity and so on... This introduces new free parameters and encourages over fitting especially for the later features which tend to represent small groups. We found it beneficial to use this non-linearity only for the first twenty or so features and to disable it after that. Moving on: Despite the regularization term in the final incremental law above, over fitting remains a problem. Plotting the progress over time, the probe rmse eventually turns upward and starts getting worse (even though the training error is still inching down). We found that simply choosing a fixed number of training epochs appropriate to the learning rate and regularization constant resulted in the best overall performance. I think for the numbers mentioned above it was about 120 epochs per feature, at which point the feature was considered done and we moved on to the next before it started over fitting. Note that now it does matter how you initialize the vectors: Since we're stopping the path before it gets to the (common) end, where we started will affect where we are at that point. I wonder if a better regularization couldn't eliminate overfitting altogether, something like Dirichlet priors in an EM approach--but I tried that and a few others and none worked as well as the above. Here is the probe and training rmse for the first few features with and w/o regularization term "decay" enabled. Same thing, just the probe set rmse, further along where you can see the regularized version pulling ahead: This time showing probe rmse (vertical) against train rmse (horizontal). Note how the regularized version has better probe performance relative to the training performance: Anyway, that's about it. I've tried a few other ideas over the last couple of weeks, including a couple of ways of using the date information, and while many of them have worked well up front, none held their advantage long enough to actually improve the final result. If you notice any obvious errors or have reasonably quick suggestions for better notation or whatnot to make this explanation more clear, let me know. And of course, I'd love to hear what y'all are doing and how well it's working, whether it's improvements to the above or something completely different. Whatever you're willing to share,

4 #define WIN32_LEAN_AND_MEAN #include <windows.h>
//======================================================= // SVD Sample Code (C) 2007 Timely Development ( // STANDARD DISCLAIMER: // - THIS CODE AND INFORMATION IS PROVIDED "AS IS" WITHOUT WARRANTY // - OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING BUT NOT // - LIMITED TO THE IMPLIED WARRANTIES OF MERCHANTABILITY AND/OR // - FITNESS FOR A PARTICULAR PURPOSE. //==================================================== #define WIN32_LEAN_AND_MEAN #include <windows.h> #include <stdio.h> #include <math.h> #include <tchar.h> #include <map> using namespace std; //================================================ // Constants and Type Declarations #define TRAINING_PATH L"C: etflix raining_set*.txt" #define TRAINING_FILE L"C: etflix raining_set\%s" #define FEATURE_FILE L"C: etflixfeatures.txt" #define TEST_PATH L"C: etflix\%s" #define PREDICTION_FILE L"C: etflixprediction.txt" #define MAX_RATINGS //Ratings in entire training set (+1) #define MAX_CUSTOMERS //Custs in entire training set (+1) #define MAX_MOVIES //Movies in entire training set (+1) #define MAX_FEATURES //Number of features to use #define MIN_EPOCHS //Min number of epochs per feature #define MAX_EPOCHS // Max epochs per feature #define MIN_IMPROVEMENT //Min improve req cont current feature #define INIT // Initialization value for features #define LRATE // Learning rate parameter #define K // Reg param to min over-fitting typedef unsigned char BYTE; typedef map<int, int> IdMap; typedef IdMap::iterator IdItr; struct Movie { int RatingCount; int RatingSum; double RatingAvg; double PseudoAvg; //Wtd avg to deal with small movie counts }; struct Customer { int CustomerId; int RatingCount; int RatingSum; }; struct Data int CustId; short MovieId; BYTE Rating; float Cache; class Engine private: int m_nRatingCount; // Current number of loaded ratings Data m_aRatings[MAX_RATINGS]; //Array of ratings data Movie m_aMovies[MAX_MOVIES]; //Array of movie metrics Customer m_aCustomers[MAX_CUSTOMERS]; //Array of customer metrics float m_aMovieFeatures[MAX_FEATURES][MAX_MOVIES]; //Array of feat by mov float m_aCustFeatures[MAX_FEATURES][MAX_CUSTOMERS];//Array feas by cust IdMap m_mCustIds; // Map for one time translation of ids to compact array index inline double PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing=true); inline double PredictRating(short movieId, int custId); bool ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut); bool ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue); bool ParseFloat(wchar_t*pwzBuffer, int nLength,int &nPosition,float& fValue); public: Engine(void); ~Engine(void) { }; void CalcMetrics(); void CalcFeatures(); void LoadHistory(); void ProcessTest(wchar_t* pwzFile); void ProcessFile(wchar_t* pwzFile); //============= // Program Main int _tmain(int argc, _TCHAR* argv[]) Engine* engine = new Engine(); engine->LoadHistory(); engine->CalcMetrics(); engine->CalcFeatures(); engine->ProcessTest(L"qualifying.txt"); wprintf(L" Done "); getchar(); return 0; }

5 //=====================================
// Engine Class // Initialization Engine::Engine(void) { m_nRatingCount = 0; for (int f=0; f<MAX_FEATURES; f++) for (int i=0; i<MAX_MOVIES; i++) m_aMovieFeatures[f][i] = (float)INIT; for (int i=0; i<MAX_CUSTOMERS; i++) m_aCustFeatures[f][i] = (float)INIT; } // // Calculations - This Paragraph contains all of the relevant code // CalcMetrics // - Loop through the history and pre-calculate metrics used in the training // - Also re-number the customer id's to fit in a fixed array void Engine::CalcMetrics() int i, cid; IdItr itr; wprintf(L" Calculating intermediate metrics "); // Process each row in the training set for (i=0; i<m_nRatingCount; i++) Data* rating = m_aRatings + i; // Increment movie stats m_aMovies[rating->MovieId].RatingCount++; m_aMovies[rating->MovieId].RatingSum += rating->Rating; // Add customers (using a map to re-number id's to array indexes) itr = m_mCustIds.find(rating->CustId); if (itr == m_mCustIds.end()) cid = 1 + (int)m_mCustIds.size(); // Reserve new id and add lookup m_mCustIds[rating->CustId] = cid; // Store off old sparse id for later m_aCustomers[cid].CustomerId = rating->CustId; // Init vars to zero m_aCustomers[cid].RatingCount = 0; m_aCustomers[cid].RatingSum = 0; else cid = itr->second; // Swap sparse id for compact one rating->CustId = cid; m_aCustomers[cid].RatingCount++; m_aCustomers[cid].RatingSum += rating->Rating; } // Do a follow-up loop to calc movie averages for (i=0; i<MAX_MOVIES; i++) { Movie* movie = m_aMovies+i; movie->RatingAvg = movie->RatingSum / (1.0 * movie->RatingCount); movie->PseudoAvg = (3.23 * 25 + movie->RatingSum) / ( movie->RatingCount); // CalcFeatures - Iteratively train each feature on entire data set // Once sufficient progress has been made, move on void Engine::CalcFeatures() int f, e, i, custId, cnt = 0; Data* rating; double err, p, sq, rmse_last, rmse = 2.0; short movieId; float cf, mf; for (f=0; f<MAX_FEATURES; f++) wprintf(L" --- Calculating feature: %d --- ", f); // Keep looping until you have passed a minimum number // of epochs or have stopped making significant progress for (e=0; (e < MIN_EPOCHS) || (rmse <= rmse_last - MIN_IMPROVEMENT); e++) cnt++; sq = 0; rmse_last = rmse; for (i=0; i<m_nRatingCount; i++) rating = m_aRatings + i; movieId = rating->MovieId; custId = rating->CustId; // Predict rating and calc error p = PredictRating(movieId, custId, f, rating->Cache, true); err = (1.0 * rating->Rating - p); sq += err*err; // Cache off old feature values cf = m_aCustFeatures[f][custId]; mf = m_aMovieFeatures[f][movieId]; // Cross-train the features m_aCustFeatures[f][custId] += (float)(LRATE * (err * mf - K * cf)); m_aMovieFeatures[f][movieId] += (float)(LRATE * (err * cf - K * mf)); rmse = sqrt(sq/m_nRatingCount); wprintf(L" <set x='%d' y='%f' /> ",cnt,rmse);

6 // Cache off old predictions
for (i=0; i<m_nRatingCount; i++) { rating = m_aRatings + i; rating->Cache = (float)PredictRating(rating->MovieId, rating->CustId, f, rating->Cache, false); } // PredictRating - During training there is no need to loop through all of the features // - Use a cache for the leading features and do a quick calculation for the trailing // - The trailing can be optionally removed when calculating a new cache value double Engine::PredictRating(short movieId, int custId, int feature, float cache, bool bTrailing) // Get cached value for old features or default to an average double sum = (cache > 0) ? cache : 1; //m_aMovies[movieId].PseudoAvg;  //Add contribution of current feature sum += m_aMovieFeatures[feature][movieId] * m_aCustFeatures[feature][custId]; if (sum > 5) sum = 5; if (sum < 1) sum = 1; // Add up trailing defaults values if (bTrailing) sum += (MAX_FEATURES-feature-1) * (INIT * INIT); return sum; // PredictRating - This version is used for calculating the final results // - It loops through the entire list of finished features double Engine::PredictRating(short movieId, int custId) double sum = 1; //m_aMovies[movieId].PseudoAvg; for (int f=0; f<MAX_FEATURES; f++) sum += m_aMovieFeatures[f][movieId] * m_aCustFeatures[f][custId]; // Data Loading / Saving // LoadHistory // - Loop through all of the files in the training directory void Engine::LoadHistory() WIN32_FIND_DATA FindFileData; HANDLE hFind; bool bContinue = true; int count = 0; // TEST // Loop through all of the files in the training directory hFind = FindFirstFile(TRAINING_PATH, &FindFileData); if (hFind == INVALID_HANDLE_VALUE) return; while (bContinue) { this->ProcessFile(FindFileData.cFileName); bContinue = (FindNextFile(hFind, &FindFileData) != 0); //if (++count > 999) break; // TEST: Uncomment to only test with the first X movies } FindClose(hFind); // ProcessFile Load a history: <MovieId>:<CustomerId>,<Rating> <CustomerId>,<Rating>... void Engine::ProcessFile(wchar_t* pwzFile) FILE *stream; wchar_t pwzBuffer[1000]; wsprintf(pwzBuffer,TRAINING_FILE,pwzFile); int custId, movieId, rating, pos = 0; wprintf(L"Processing file: %s ", pwzBuffer); if (_wfopen_s(&stream, pwzBuffer, L"r") != 0) return; // First line is the movie id fgetws(pwzBuffer, 1000, stream); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); m_aMovies[movieId].RatingCount = 0; m_aMovies[movieId].RatingSum = 0; // Get all remaining rows while ( !feof( stream ) ) { pos = 0; ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, rating); m_aRatings[m_nRatingCount].MovieId = (short)movieId; m_aRatings[m_nRatingCount].CustId = custId; m_aRatings[m_nRatingCount].Rating = (BYTE)rating; m_aRatings[m_nRatingCount].Cache = 0; m_nRatingCount++; // Cleanup fclose( stream ); // ProcessTest - Load a sample set in the following format // <Movie1Id>: <CustomerId> <CustomerId> ... <Movie2Id>: <CustomerId> // - And write results: <Movie1Id>: <Rating> <Raing> void Engine::ProcessTest(wchar_t* pwzFile) FILE *streamIn, *streamOut; int custId, movieId, pos = 0; double rating; bool bMovieRow; wsprintf(pwzBuffer, TEST_PATH, pwzFile); wprintf(L"

7 Processing test: %s ", pwzBuffer);
if (_wfopen_s(&streamIn, pwzBuffer, L"r") != 0) return; if (_wfopen_s(&streamOut, PREDICTION_FILE, L"w") != 0) return; fgetws(pwzBuffer, 1000, streamIn); while ( !feof( streamIn ) ) { bMovieRow = false; for (int i=0; i<(int)wcslen(pwzBuffer); i++) bMovieRow |= (pwzBuffer[i] == 58); } pos = 0; if (bMovieRow) ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, movieId); // Write same row to results fputws(pwzBuffer,streamOut); else ParseInt(pwzBuffer, (int)wcslen(pwzBuffer), pos, custId); custId = m_mCustIds[custId]; rating = PredictRating(movieId, custId); // Write predicted value swprintf(pwzBuffer,1000,L"%5.3f ",rating); //wprintf(L"Got Line: %d %d %d ", movieId, custId, rating); // Cleanup fclose( streamIn ); fclose( streamOut ); // // Helper Functions bool Engine::ReadNumber(wchar_t* pwzBufferIn, int nLength, int &nPosition, wchar_t* pwzBufferOut) int count = 0; int start = nPosition; wchar_t wc = 0;   // Find start of number while (start < nLength) { wc = pwzBufferIn[start]; if ((wc >= 48 && wc <= 57) || (wc == 45)) break; start++; } // Copy each character into the output buffer nPosition = start; while(nPosition<nLength&&((wc>=48&&wc<=57)||wc==69 || wc==101 || wc==45 || wc==46)) pwzBufferOut[count++] = wc; wc = pwzBufferIn[++nPosition]; // Null terminate and return pwzBufferOut[count] = 0; return (count > 0); } bool Engine::ParseFloat(wchar_t* pwzBuffer, int nLength, int &nPosition, float& fValue) { wchar_t pwzNumber[20]; bool bResult = ReadNumber(pwzBuffer, nLength, nPosition, pwzNumber); fValue = (bResult) ? (float)_wtof(pwzNumber) : 0; return false; bool Engine::ParseInt(wchar_t* pwzBuffer, int nLength, int &nPosition, int& nValue) nValue = (bResult) ? _wtoi(pwzNumber) : 0; return bResult;

8 Maximizing theVariance =
How do we use this theory? For Dot Product gap based Clustering, we can hill-climb akk below to a d that gives us the global maximum variance. Heuristically, higher variance means more prominent gaps. Maximizing theVariance x1 x2 : xN x1od x2od xNod = Xod=Fd(X)=DPPd(X) d1 dn Given any table, X(X1, ..., Xn), and any unit vector, d, in n-space, let For Dot Product Gap based Classification, we can start with X = the table of the C Training Set Class Means, where Mk≡MeanVectorOfClassk . M1 M2 : MC V(d)≡VarianceXod=(Xod)2 - (Xod)2 = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 Then Xi = Mean(X)i and and XiXj = Mean Mi1 Mj1 . : MiC MjC - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) N 1 These computations are O(C) (C=number of classes) and are instantaneous. Once we have the matrix A, we can hill-climb to obtain a d that maximizes the variance of the dot product projections of the class means. = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk N 1 2 FAUST Classifier MVDI (Maximized Variance Definite Indefinite: = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) Build a Decision tree Find d that maximizes variance of dot product projections of class means each round Apply DI each round subject to i=1..ndi2=1 dT o A o d = V(d) V i XiXj-XiX,j : d dn d1 dn FAUST technology relies on: a distance dominating functional, F Use of gaps in range(F) to separate. We can separate out the diagonal or not: For Unsupervised (Clustering) Hierarchical Divisive? Piecewise Linear? other? Perf Anal (which approach is best for which type of table?) + jkajkdjdk V(d)=jajjdj2 ijaijdidj V(d) = For Supervised (Classification), Decision Tree? Nearest Nbr? Piecewise Linear? Perf Anal (which is best for training set?) d1≡(V(d0));  d0, one can hill-climb it to locally maximize the variance, V, as follows: d2≡(V(d1)):... where White papers: Terabyte Head Wall. The Only Good Data is Data in Motion Multilevel pTrees: k=0,1 suffices! A PTreeSet is defined by specifying a table, an array of stride_lengths (usually equi-length so just that one length is specified) and a stride_predicate (T\F condition on a stride (stride=bag [or array?] of bits): So the metadata of PTreeSet(T,sl,sp) specifies T, sl and sp. A “raw” PTreeSet has sl=1 and the identity predicate (sl and sp not used). A “cooked” PTreeSet (AKA Level-1 PTreeSet) for a table with sl1 (main purpose: provide compact summary information on the table.) Let PTS(T) be a raw PTreeSet, then it, plus PTS(T,64,p), ..., PTS(T,64^k,p) form a tree of vertical summarizations of T. Note that P(T, 64*64, p) is different from P(P(T,64,p), 64, p), but both make sense since P(t, 64, p) is a table and P(P(T, 64, p), 64, p) is just a cooked pTree on it. 2a a a1n 2a21 2a a2n : ' 2an ann d1 di dn V(d)≡Gradient(V)=2Aod or V(d)= 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj Ubhaya Theorem1:  k{1,...,n} s.t. d=ek will hill-climb V to its globally max. Let d=ek s.t. akk is a maximal diagonal element of A, Theorem2 (working on it): d=ek will hill-climb V to its globally maximum.

9 FAUST MVDI 16.5  xod0 < 38 xod0 < 16.5 48 < xod0 xod1 < 9
on IRIS 15 records from each Class for Testing (Virg39 was removed as an outlier.) Definite_____ Indefinite s-Mean s e-Mean e s_ei empty i-Mean i se_i (-1, 16.5=avg{23,10})s sCt= (16.5, 38)e eCt= (48.128)i iCt=39 d=(.33, -.1, .86, .38) indef[38, 48]se_i seCt=26 iCt=13 Definite Indefinite i-Mean i e-Mean e i_e empty d=(-.55, -.33, .51, .57) (-1,8)e Ct= (10,128)i Ct=9 indef[8,10]e_i eCt=5 iCt=4 In this case, since the indefinite interval is so narrow, we absorb it into the two definite intervals; resulting in decision tree: 38  xod0  48 d1=(-.55, -.33, .51, .57) Versicolor xod1 < 9 Virginica xod1  9 Setosa d0=(.33, -.1, .86,.38) xod0 < 16.5 16.5  xod0 < 38 48 < xod0

10 SatLog 413train 4atr 6cls 127test
FAUST MVDI SatLog 413train 4atr 6cls 127test Using class means: FoMN Ct min max max+1 mn mn mn Using full data: (much better!) mn mn mn Gradient Hill Climb of Variance(d) d1 d2 d3 d4 Vd) Fomn Ct min max max+1 mn mn mn mn mn mn F[a,b) Class d=( ) Gradient Hill Climb of Var(d)on t25 d1 d2 d3 d4 Vd) F[a,b) Class d=( ) MNod Ct ClMn ClMx ClMx+1 mn mn F[a,b) Class 7 5 5 2 2 d=( ) Cl=7 cl=7 Gradient Hill Climb of Var(d)on t257 Same using class means or training subset. cl=4 F[a,b) Class 4 1 1 d=(-.66, .19, .47, .56) F[a,b) Class 1 d=(-.81, .17, .45, .33) F[a,b) Class 5 7 d=(-.01, -.19, .7, .69) Gradient Hill Climb of Var(d)on t75 Gradient Hill Climb of Var(d)on t13 On the 127 sample SatLog TestSet: 4 errors or 96.8% accuracy. speed? With horizontal data, DTI is applied one unclassified sample at a time (per execution thread). With this pTree Decision Tree, we take the entire TestSet (a PTreeSet), create the various dot product SPTS (one for each inode), create ut SPTS Masks. These masks mask the results for the entire TestSet. Gradient Hill Climb of Var(d)on t143 For WINE: min max+1 Awful results! Gradient Hill Climb of Var t156161 Inconclusive both ways so predict purality=4(17) (3ct=3 tct=6 Gradient Hill Climb of Var t146156 Inconclusive both ways so predict purality=4(17) (7ct=15 2ct=2 Gradient Hill Climb of Var t127 Inconclusive predict purality=7(62 4(15) 1(5) 2(8) 5(7)

11 FAUST MVDI Concrete Seeds 7 test errors / 30 = 77%
xod0<320 Class=m (test:1/1) d1= xod0>=634 Class=l (test:1/1) 7 test errors / 30 = 77% For Concrete min max+1 train l m h Test l ****** m ****** h ****** 321 l m h 0 l ***** m ***** h 92 ***** xod2<28 Class= l or m d3= d2= xod2>=92 Class=m (test:2/2) d4 = xod3<544 Cl=m *test 0/0) xod2>=662 Cl=h (test:11/12) xod3<969 Cl=l *test 6/9) xod3>=868 Cl=m (test:1/1) xod4<640 Cl=l *test 2/2) xod4>=681 Cl=l (test:0/3) d0 l m h Seeds d3 l m h 0 l ******* m ******* h ******* 8 test errors / 32 = 75% d1 l m h xod<13.2 Class=h errs:0/5) xod>=19.3 Class=m errs0/1) l m h xod<13.2 Class=h errs:0/5) xod>=18.6 Class=m errs0/4) d2 l m h 1 l ****** m ****** h 662 ****** Class=h errs:0/1) l m h Class=m errs0/0) Class=l errs:0/4) Class=m errs8/12)

12 FAUST Classifier PR=Pxod<a PV=Pxoda d2-line d-line
0. Cut in middle of the means: a= (mR+(mV-mR)/2)od = (mR+mV)/2od D≡mRmV d=D/|D| PR=Pxod<a PV=Pxoda 1. Cut in the middle of:VectorOfMedians (VOM), not the means. Use stdev ratio not middle for even better cut placement? 2. Cut in the middle of {Max{Rod}, Min{Vod}. (assuming mRodmVod) If no gap, move cut to minimize Rerrors + Verrors. 3. Hill-climb d to maximize gap or to minimize training set errors or (simplest) to minimize dis(max{rod},min{vod}) . 4. Replace mr, mv with the avg of the margin points? Min{Vod}Max{Rod} CutR=CutV=avg{minVod,minRod}, else CutR≡Min{Vod}, Cut≡Max{Rod} 5. PR=Pxod<CutR PV=Pxod>CutV y PR or yPV , Definite classifications; else re-do on Indefinite region, PCutRxodCutV until actual  gap (AND with certain stop cond? E.g., "On nth round, use definite only (cut at midpt(mR,mV)." V Another way to view FAUST DI is that it is a Decision Tree Method. With each non-empty indefinite set, descend down the tree to a new level For each definite set, terminate the descent and make the classification. R dim 2 vomR d2-line d2 vomV MaxRod MnVod Each round, it may be advisable to go through an outlier removal process on each class before setting Min{Vod} and Max{Rod} (E.g., Iteratively check if F-1(Min{Vod}) consists of V-outliers). r   v v r mR   r    v v v v       r    r      v mV v      r    v v     r         v                     d-line dim 1 d a

13 FAUST DI K-class training set, TK, and a given d (e. g
FAUST DI K-class training set, TK, and a given d (e.g., from D≡MeanTKMedTK): Let mi≡meanCi s.t. dom1dom2 ...domK Mni≡Min{doCi} Mxi≡Max{doCi} Mn>i≡Minj>i{Mnj} Mx<i≡Maxj<i{Mxj} Definitei = ( Mx<i, Mn>i ) Indefinitei,i+1 = [ Mn>i, Mx<i+1 ] Then recurse on each Indefinite. For IRIS 15 records were extracted from each Class for Testing. The rest are the Training Set, TK. D=MEANsMEANe Definite_____ Indefinite__ s-Mean s e-Mean e se empty i-Mean i ei F <  setosa (35 seto) ST ROUND D=MeansMeane 18 < F <  versicolor (15 vers) 37  F   IndefiniteSet (20 vers, 10 virg) 48 < F  virginica (25 virg) F <  versicolor (17 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 7  F   IndefSet3 ( 3 vers, 5 virg) 10 < F  virginica ( 0 vers, 5 virg) F <  versicolor ( 2 vers. 0 virg) IndefSet3 ROUND D=MeaneMeani 3  F   IndefSet4 ( 2 vers, 1 virg) Here we will assign 0  F  7 versicolor 7 < F  virginica ( 0 vers, 3 virg) < F virginica Test: F <  setosa (15 seto) ST ROUND D=MeansMeane 15 < F <  versicolor ( 0 vers, 0 virg) 15  F   IndefiniteSet (15 vers, 1 virg) 41 < F  virginica ( virg) F <  versicolor (15 vers. 0 virg) IndefSet2 ROUND D=MeaneMeani 20 < F  virginica ( 0 vers, 1 virg) 100% accuracy. Option-1: The sequence of D's is: Mean(Classk)Mean(Classk+1) k= (and Mean could be replaced by VOM or?) Option-2: The sequence of D's is: Mean(Classk)Mean(h=k+1..nClassh) k= (and Mean could be replaced by VOM or?) Option-3: D seq: Mean(Classk)Mean(h not used yetClassh) where k is the Class with max count in subcluster (VoM instead?) Option-2: D seq.: Mean(Classk)Mean(h=k+1..nClassh) (VOM?) where k is Class with max count in subcluster. Option-4: D seq.: always pick the means pair which are furthest separated from each other. Option-5: D Start with Median-to-Mean of IndefiniteSet, then means pair corresp to max separation of F(meani), F(meanj) Option-6: D Always use Median-to-Mean of IndefiniteSet, IS. (initially, IS=X)

14 FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Option-4, means pair most separated in X. m d(m1,m2) DEFINITE INDEFINITE m d(m1,m3) inf 0 m d(m2,m3)  F  106, inf so totally non-productive! Option-6: D Median-to-Mean of IndefSet (initially IS=X) m meanF1 DEFINITE Cl= INDEFINITE m meanF2 def3[ -inf 21) m `2.0 meanF3 def1[ ) ind1[ ) On whole TR def2[ 58 inf) ind2[ ) m avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m avF3 def1[ 37 inf ) in11[ ) On Indef-1 Cl= Cls1 outlier(F=54) m avF1 DEFINITE INDEFINITE def3[ -inf 0 ) m avF3 def1[ 13 inf ) in11[ ) On Indef-11 Cl= Cls1 outlier (F=29) m avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m avF3 def1[ 19 inf ) in111[ ) On Indef-111 Cl= Cls3 outlier (F=0) m avF1 DEFINITE INDEFINITE def3[ -inf 9 ) m avF3 def1[ 19 inf ) in1111[ ) On Indef-1111 Cl= done! declare Class=1

15 FAUST DI sequential For SEEDS 15 records were extracted from each Class for Testing. Option-6: D Median-to-Mean of X m meanF1 DEFINITE Cl= INDEFINITE m meanF2 def3[ -inf 21) m `2.0 meanF3 def1[ ) ind31[ ) On whole TR def2[ 58 inf) ind12[ ) m avF1 DEFINITE INDEFINITE m avF3 def1[-inf 18 ) def3[ 55 inf ) in1313[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet31 m avF1 DEFINITE INDEFINITE m avF3 def3[ -inf 10 ) def1[ 20 inf ) in313131[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet1313 m avF1 DEFINITE INDEFINITE m avF3 def1[ -inf 0 ) def3[ 5 inf ) C1= [ ) Cl= The rest, Class=1 D Mean(loF)-to-Mean(hiF) of IndefSet (d repeats after this so=C1 m avF1 DEFINITE INDEFINITE m avF2 def1[ -inf 2 ) def2[ 15 inf ) in1212[ ) Cl= D Mean(loF)-to-Mean(hiF) of IndefSet12 [-inf, 21)class= [28, 49)class= [58.inf) class=3 d=(.,9, -,1, -.2, -.2) [21,28)ind31 d=(-.9, -.1, .14, -.1) [49, 58)ind12 d=(0, .31, -.9, 0) [-inf,18)def [49, 58)ind23

16 Maximize variance - is it wise?
FAUST CLUSTERING = jXj2 dj2 +2j<kXjXkdjdk - " = j=1..n(Xj2 - Xj2)dj2 + +(2j=1..n<k=1..n(XjXk - XjXk)djdk ) V(d)≡VarDPPd(X)= (Xod)2 - (Xod)2 = i=1..N(j=1..n xi,jdj)2 - ( j=1..nXj dj )2 N 1 = ijxi,j2dj2 + j<k xi,jxi,kdjdk - jXj2dj2 +2j<k XjXkdjdk 2 + jkajkdjdk V(d)=jajjdj2 subject to i=1..ndi2=1 dT o VX o d = VarDPPdX≡V V i XiXj-XiX,j : d dn d1 dn ijaijdidj V(d) = x1 x2 xN x1od x2od xNod = Xod=Fd(X)=DPPd(X) - (jXj dj) (kXk dk) = i(j xi,jdj) (k xi,kdk) Use DPPd(x), but which unit vector, d*, provides the best gap(s)? 1. DPPd exhaustively searches a grid of d's for the best gap provider. 2. Use some heuristic to choose a good d? GV: Gradient-optimized Variance MM: Use the d that maximizes |MedianF(X)-Mean(F(X))|. We have Avg as a function of d. Median? (Can you do it?) HMM: Use a heuristic for MedianF(X): F(VectorOfMedians)=VOMod MVM: Use D=MEAN(X)VOM(X), d=D/|D| V(d)= 2a11d1 +j1a1jdj 2a22d2 +j2a2jdj : 2anndn +jnanjdj do=ek s.t. akk is max or d0k=akk d1≡(V(d0)) d2≡(V(d1)) til F(dk) 2a a a1n 2a21 2a a2n ' 2an ann d1 di dn GRADIENT(V) = 2A o d Maximize variance - is it wise? median std variance Avg consecutive differences avgCD maxCD ||mean-VOM| Finding good unit vector, d, for Dot Prod functional, DPP. to maximize gaps = j=1..n Xjdj Mean(DPPdX)=(1/N)i=1..Nj=1..n xi,jdj sub to i di2=1 Maximize wrt d, |Mean(DPPd(X)) - Median(DPPd(X)| =j (1/Ni xi,j ) dj Compute Median(DPPd(X)? Want to use only pTree processing. Want a formula in d and numbers only (like the one above for the mean (involves only the vector d and the numbers X1 ,..., Xn ) MEDIAN picks out last 2 sequences which have best gaps (discounting outlier gaps at the extremes) and it discards 1,3,4 which are not so good.

17 FAUST Clustering, simple example: Gd(x)=xod Fd(x)=Gd(x)-MinG on a dataset of 15 image points
The 15 Value_Arrays (one for each q=z1,z2,z3,...) z z z z z z z z z za zb zc zd ze zf X x1 x a b =q f p d a b b c e c d a e 8 f 7 9 The 15 Count_Arrays z z z z z z z z z za zb zc zd ze zf Level0, stride=z1 PointSet (as a pTree mask) z1 z2 z3 z4 z5 z6 z7 z8 z9 za zb zc zd ze zf gap: [F=6, F=10] gap: [F=2, F=5] F=2 F=1 Fp=MN,q=z1=0 pTree masks of the 3 z1_clusters (obtained by ORing) z11 1 z12 1 z13 1

18 What is the DPPd FAUST CLUSTER algorithm?
What have we learned? What is the DPPd FAUST CLUSTER algorithm? X2=SubCluster2 SubCluster1 D=MedianMean, d1≡D/|D| is a good start. But first, Variance-Gradient hill-climb it. (Median means Vector of Medians). For X2=SubCluster2 use a d2 which is perpendicular to d1? In high dimensions, there are many perpendicular directions. GV hill-climb d2=D2/|D2| (D2=MedianX2-MeanX2) constrained to be  to d1, i.e., constrained to d2od1=0 (in addition to d2od2=1. We may not want to constrain this second hill-climb to unit vectors perpendicular to d1. It might be the case that the gap gets wider using a d2 which is not perpendicular to d1? GMP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) Variance-Gradient hill-climbed subject only to dod=1 (We shouldn't constrain the 2nd hill-climb to d1od2=0 and subsequent hill-climbs to dkodh=0, h=2...k-1. (gap could be larger). So the 2nd round starts at d2≡Unitized( Vom{x-xod1|xX2} - Mean{x-xod1|xX2} ) and hill-climbs subject only to dod=1) GCCP: Gradient hill-climb (wrt d) VarianceDPPd starting at d2=D2/|D2| where D2=CCi(X2)-CCj(X2), and hill-climbs subject to dod=1, where the CCs are two of the Circumscribing rectangle's Corners (the CCs may be a faster calculations than Mean and Vom). Taking all edges and diagonals of CCR(X) (the Coordinate-wise Circumscribing Rectangle of X) provides a grid of unit vectors. It is an equi-spaced grid iff we use a CCC(X) (Coordinate-wise Circumscribing Cube of X). Note that there may be many CCC(X)s. A canonical one is the one that is furthest from the origin (take the longest side first. Extend each other side the same distance from the origin side of that edge. A good choice may be to always take the longest side of CR(X) as D, D≡LSCR(X). Should outliers on the (n-1)-dim-faces at the ends of LSCR(X) be removed first? So remove all LSCR(X)-endface outliers until after removal the same side is still the LSCR(X). Then use that LSCR(X) as D.

19 MVM WINE GV C11 F-MN gp2 15 2 GM ACCURACY WINE GV MVM GM C1(F-MN) gp3 55 2 F-MN Ct gp8 [0.12) 1L 0H (F-MN) gp8 _4L 2H XF-M gp3 114 1 ___ _ [12,28) 1L 2H _2L 1H 2L 0H C1 C7F-M*3 g3 96 1 _0L 2H _0L 2H C2 3L 5H ___ _ [28,46) 2L 6H 1L 1H C5 g3 33 1 ___ _ [46,57) 2L 2H _2L 4H C71 C6*8 16 C121 max thin 17 4 _2L 5H C3 _0L 1H C4 C1 F-M Ct g3 55 1 _3L 0H ___ L 2H _1L 2H _0L 1H C4 ___ L 2H _1L 2H 3L 2H _2L 23L 25H 6L 21H _1L 6H _2L 12H 5L 5H _9L 7H C5 C763F-M*8 g8 71 2 _1L 4H C76*4 g3 97 2 C11 10L 13H C12 ___ _1L 0H 0L 2H ___ _0L 1H _2L 0H [0.35) C11 38L 68H C12 F-M gp2 31 1 4L 8H C6 _2L 4H 4L8H C763 _0L 2H C766 *16 g4 115 1 _0L 1H [35,53) C12 10L 13H ___ _ 2L 9H _2L 0H ___ [53,56) 3L 2H 29L 46H ___ _ 1L 8H _3L 1H 51L 83H [0.66) C1 _1L 3H ___ _ [66,75) 2L 2H _2L 0H _2L 0H 7L 19H 2L 2H ___ _ [75,98) 2L 6H 0L 1H ___ [57,115) 51L 83H C1 _4L 8H ___ _ [98,115) 2L 2H 38L 68H C7 17L 15H C766 _2L 0H _0L 2H ___ 28L 44H C76 1L ___ _ 3L 3H _1L 0H

20 SEEDS GV MVM GM C3 .97 .15 .09 .14 0 ACCURACY SEEDS WINE GV 94 62.7
akk d1 d2 d3 d4 V(d 10(F-MN) gp6 10(F-MN)gp6 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [0,9) 0k 0r 18c C1 ___ ___ [9,18) 1k 0r 24c C2 GM ___ ___ [9,18) 1k 0r 24c C2 10(F-MN) gp3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [18,29) 10k 0r 8c C3 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [29,38) 18k 0r 0c C4 ___ ___ [38,49) 13k 2r 0c C5 C2: 10(F-MN) gp10 63 1 ___ ___ [0,22) 0k 0r 42c C1 ___ ___ [38,49) 13k 2r 0c C5 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [0,31) 9k 0r 0c C21 ___ ___ [49,60) 7k 6r 0c C6 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [31,41) 1k 0r 4c C22 ___ ___ [60,71) 1k 7r 0c C7 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [22,33) k 0r 8c C2 ___ ___ [41,64) 0k 0r 4c C23 ___ ___ [71,80) 0k 8r 0c C8 ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [80,92) 0k 21r 0c C9 ___ ___ [92,102) 0k 2r 0c Ca ___ ___ [102,105) 0k 4r 0c Cb ___ ___ [102,105) 0k 4r 0c Cb C3 200(F-MN)gp12 ___ ___ [33,57) k 2r 0c C3 C F-M g9 70 2 ___ ___ [0,10) k 0r 0c ___ ___ [0,35) k 0r 0c ___ ___ [10,20) 2k 0r 1c ___ ___ [35,48) 2k 0r 3c C4: 10(F-MN) gp21 99 3 ___ ___ [20,30) 2k 0r 1c ___ ___ [48,72) 0k 0r 2c ___ ___ [57,69) 6k 9r 0c C4 ___ ___ [30,40) 4k 0r 1c ___ ___ [69,76) 1k 4r 0c C6 ___ ___ [40,50) 0k 0r 1c ___ ___ [72,113) 0k 0r 3c ___ ___ [50,61) 0k 0r 1c ___ ___ [61,70) 0k 0r 1c ___ ___ [0,52) 1k 7r C41 ___ ___ [70,71) k 0r 2c ___ ___ [52,79) 1k 2r C42 C6 10(F-M) g12 48 1 akk C6 200(F-MN)gp12 74 2 ___ ___ [79100) 4k 0r C43 ___ ___ [0,50) k 0r 0c ___ ___ [50,60) k 0r 2c ___ ___ [76,103) 0k 26r 0c C7 ___ ___ [0,22) 4k 0r 0c ___ ___ [60,74) k 0r 3c ___ ___ [74,75) k 0r 1c ___ ___ [22,49) 3k 6r 0c ___ ___ [103,109) 0k 6r 0c C8

21 MVM IRIS GM C12 4*F-M g3 93 1 GV C2 d1 d2 d3 d4 (F-MN)*3 Ct gp3 111 1 C23 F-M*3 g3 97 1 ACCURACY IRIS SEEDS WINE GV MVM GM F-MN gp8 68 1 F-MN Ct gp5 70 2 F-MN Ct gp3 70 1 ___ 1e 0i C1 2*(F-M g3 96 1 50s i C1 C2 __2e i ___ 4e 1i C21 ___ 19e 1i C22 4(F-) g4 ... ___50s 1i C1 ___ 6e 0i ___ e C221 29e 14i ___ ___ ___28i C11 ___ 19e 1i C22 ___ 16e 11i 18e i C123 ___ e ___ e ___ 3e i C221 8F- g5 95 1 ___1e ___ 0e i ___ 2e ___ i ___ 0e i C221 8F-)g5 95 1 ___50e 49i C1 C123 12*F-M g4 85 1 ___ 3e __ 4e i __ 1i ___ 1e ___ 50e 40i C2 9i C3 _46e 21i C12 ___ 5e 1i ___9e ___ 4e C13 ___ 27e 16i C23 ___9e ___ 50s 1i C2 ___ 9e i . MVM C2 2(F-)g4 ... 91 1 __9e 2i _ 4e ___ 9i C24 ___ e __9e i __ 0e 2i . 47e i C22 ___ i ___ i ___ 2e 6i . ___ 5e i ___ _3i ___ 0e i ___ i ___ 2e i ___ 5e i

22 CONCRETE MVM GV GM MVM C11 F-/4 g4 0 4 2 2 1 2 4 4 2 6 25 2 8 2 1
40 2 MVM (F-)/4 gp4 GM ACCURACY CONCRETE IRIS SEEDS WINE GV MVM GM C F-m/8 g4 C2 s 65 1 X g4 (F-MN)/8 s ___ M C2 gp8 (F-MN)/5 C L 8M 0H C L 33M 55H C M 0H C23 g4 F-MN/8 67 2 C21 g4 F-M/4 86 1 99 3 C211 g5 F-M)/4 98 2 GV C L M 49H ___ 7M C2 ___5L ___ 4M C3 ___6M C4 ___ 30L 1M 4H C231 g4 F-M/8 s 56 2 __20L 5M . ___14M 0H C1 C2 C1F-/4 g4 ... 1s+2s 123 1 C23 g3 F-M/8 50 2 ___ 5L 1M . 0L 32M 13H 11L 13M 54H ___ M . ___ 2L 1M . C211 32L M 0H ___5L 1M _30L 8H_ . 3L 2M C L 23M 53H C212 g5 F-M/3 C232 g2 F-M/8 51 1 ___6M 2H C212 7L M 10H 2L 2M 1H __6L 3M . __1L 2H ___1L 4M 3H C111 F-/4 g4 87 1 ___1L 1M 4H ___ __1L ___ H ___ 3L 2M 18H 43L 38M 55H C2 0L 14M 0H C1 1L M 43L M 55H C21 __ 1L 2M 20H C213 4L M 38H ___4L 2M H ___ H ___ ___ 1H 2M ___ M 9H C214 0L M 7H 0L M 0H C22 ___1L H ___ __ 31H

23 ABALONE GV C1 g3 400*F-M ... 97 1 MVM GM ACR CONC IRIS SEEDS WINE ABAL GV MVM GM g3 200*F-M ... s 92 1 1H 1M _ 1H X g2 100(F-M) 102 1 C1 g3 300(F-M) 92 1 2M 1H _ 5M 12H _ 6L 1M _ 3L 30L M 12H C1 C1 g3 100*F-M s 71 2 1H 7M 4H . 20L M 11H C11 10L 1M 0H 12L M _ C11 g3 400*F-M .. 85 1 3L M _ 2M 1H _ 4M 1H _ 2L 0M 0H _ 1L M 1H _ 16M 8H C11 C2 g3 300*F-M 109 1 17L M 9H C111 3L 7L 3M 0H _ C111 g3 1500*F-M ... 3L _ C11 g (F-M) 90 1 3M _ 6L 8M 0H _ 17M 2H . 13M 5H _ 4L M _ 1M 2H _ 0M 6H _ 1M 2H _ 4L M 15H C1 10L 1M 0H 3M 1H _ 2L M 1H _ 12M 7H _ 3L 13M 2H 1L M _ 15H _ 5M 10H _ 1M _ 4L 8M 4H 1H 6M 5H _ 3L M 1H 1M 1H _ 1H 3L M H

24 KOSblogs d=e841 (highest STD). d=UnitSTDVec g>6*avg
gp=1 Ct=8 C outliers. Some of them are substantial MVM gaps>6*avg d=e841 (highest STD). DOC W=841 C0 C1 1 2 C2 C3 C4 C5 C6 C7 C8 C9 C10 C13 otlrs otlrs otlrs otlr otlrs otlr otlr otlr otlr otlr otlr otlr C11 C12 AvgGp.0085 gp>6*avg ROW KOS F GAP CT 0.1=AvgGp 64=#gaps Row# Doc# F 28.2=MxGp .6=GapThreshold Gap 0 ___ ___ gap=.65 Ct=9 C1 ___ ___ gap=.6 Ct=2613 C2 ___ ___ gap=.75 Ct= 502 C3 ___ ___ gap=.6 Ct= 87 C4 Doc F=DPPd Gap 24=MxGp GV on 22 highest STD KOS wds d=( ) ___ ___ gap=.61 Ct=30 C5 ___ ___ gap=.73 Ct=45 C6 ___ ___ gap=.89 Ct=8 C7 ___ ___ gap=.65 Ct=8 C8 ___ ___ gp=.72 Ct= 11 C9 ___ ___ gp=.65 Ct=1 outlr ___ ___ gp=.61 Ct=12 C11 ___ ___ gp=1.2 Ct=6 C12 ___ ___ gp=1.1 Ct=11 C13 ___ ___ gap=.67 Ct=1 utlr Cluster size: d=USTD MVM GV 3 4 5 6 10 42 316 3029 ___ ___ gp=1.1 Ct=3 C15 ___ ___ gp=1.8 Ct=4 C16 ___ ___ gp=1.8 Ct=5 otl;r

25 GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM CONC d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) On these pages we display the variance hill-climb for each of the four datasets (Concrete, IRIS, Seeds, Wine) for a grid of starting unit vectors, d. I took the circumscribing unit non-negative cube and used all the Unitized diagonals. In low dimension (all dimension=4 here) this grid is very nearly a uniform grid. Note that this will work less and less well as the dimension grows. In all cases, the same local max and nearly the same unit vector are reached.

26 GV using a grid (Unitized Corners of Unit Cube + Diagonal of the Variance Matrix + Mean-to-Vector_of_Medians) 2 IRIS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM SEEDS d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM WINE d1 d2 d3 d4 VAR UCUC(1000) UCUC(0100) UCUC(0010) UCUC(0001) UCUC(1100) UCUC(1010) UCUC(1001) UCUC(0110) UCUC(0101) UCUC(0011) UCUC(1110) UCUC(1101) UCUC(1011) UCUC(0111) UCUC(1111) akk MVM As we all know, Dr. Ubhaya is the best Mathematician on campus and he is attempting to prove three things: 1. That a GV-hill-climb that does not reach the global max Variance is rare indeed. 2. That one is guaranteed to reach the global maximum with at least one of the coordinate unit vectors (so a 90 degree grid will always suffice) That akk will always reach the global max.

27 Finding round clusters that aren't DPPd separable? (no linear gap)
Find the golf ball? Suppose we have a white mask pTree. No linear gaps exits to reveal it. Search a grid of d-tubes until a DPPd gap is found in the interior of the tube (Form mask pTree for interior of the d-tube. Apply DPPd that mask to reveal interior gaps.) Look for conical gaps (fix the the cone point at the middle of tube) over all cone angles (look for an interval of angles with no points). Notice that this method includes DPPd since a gap for a cone angle of 90 degrees is linear.

28 FAUST Gap Revealer Width  24 so compute all pTree combinations down to p4 and p' d=M-p 1 z1 z z7 2 z z z8 3 z z z9 za M 6 7 zf zb a zc b zd ze c a b c d e f Z z z z z z z z z z za 13 4 zb 10 9 zc 11 10 zd 9 11 ze 11 11 zf 7 8 F=zod 11 27 23 34 53 80 118 114 125 110 121 109 83 p6 1 p5 1 p4 1 p3 1 p2 1 p1 1 p0 1 p6' 1 p5' 1 p4' 1 p3' 1 p2' 1 p1' 1 p0' 1 p= &p5' 1 C=3 p5' C=2 p5 C=8 &p4' 1 C=1 p4' p4 C=2 C=0 C=6 p6' 1 C=5 p6 C10 [ , ] = [ 48, 64). z5od=53 is 19 from z4od=34 (>24) but 11 from 64. But the next int [64,80) is empty z5 is 27 from its right nbr. z5 is declared an outlier and we put a subcluster cut thru z5 [ , ]= [0,15]=[0,16) has 1 point, z1. This is a 24 thinning. z1od=11 is only 5 units from the right edge, so z1 is not declared an outlier) Next, we check the min dis from the right edge of the next interval to see if z1's right-side gap is actually  24 (the calculation of the min is a pTree process - no x looping required!) [ , ] = [16,32). The minimum, z3od=23 is 7 units from the left edge, 16, so z1 has only a 5+7=12 unit gap on its right (not a 24 gap). So z1 is not declared a 24 (and is declared a 24 inlier). [ , ] = [32,48). z4od=34 is within 2 of 32, so z4 is not declared an anomaly. [ , ]= [112,128) z7od=118 z8od=114 z9od=125 zaod=114 zcod=121 zeod=125 No 24 gaps. But we can consult SpS(d2(x,y) for actual distances: [ , ]= [64, 80). This is clearly a 24 gap. [ , ]= [80, 96). z6od=80, zfod=83 [ , ]= [96,112). zbod=110, zdod=109. So both {z6,zf} declared outliers (gap16 both sides. Which reveals that there are no 24 gaps in this subcluster. And, incidentally, it reveals a 5.8 gap between {7,8,9,a} and {b,c,d,e} but that analysis is messy and the gap would be revealed by the next xofM round on this sub-cluster anyway. X1 X2 dX1X2 z7 z z7 z z7 z z7 z z7 z z7 z z7 z z8 z z8 z z8 z z8 z z8 z z8 z X1 X2 dX1X2 z9 z z9 z z9 z z9 z z9 z z10 z z10 z z10 z z10 z X1 X2 dX1X2 z11 z z11 z z11 z z12 z z12 z z13 z

29 FAUST Tube Clustering: (This method attempts to build tubular-shaped gaps around clusters)
q Allows for a better fit around convex clusters that are elongated in one direction (not round). Gaps in dot product lengths [projections] on the line. Exhaustive Search for all tubular gaps: It takes two parameters for a pseudo- exhaustive search (exhaustive modulo a grid width). 1. A StartPoint, p (an n-vector, so n dimensional) 2. A UnitVector, d (a n-direction, so n-1 dimensional - grid on the surface of sphere in Rn). Then for every choice of (p,d) (e.g., in a grid of points in R2n-1) two functionals are used to enclose subclusters in tubular gaps. a. SquareTubeRadius functional, STR(y) = (y-p)o(y-p) - ((y-p)od)2 b. TubeLength functional, TL(y) = (y-p)od y tube cap gap width Given a p, do we need a full grid of ds (directions)? No! d and -d give the same TL-gaps. Given d, do we need a full grid of p starting pts? No! All p' s.t. p'=p+cd give same gaps. Hill climb gap width from a good starting point and direction. MATH: Need dot product projection length and dot product projection distance (in red). p tube radius gap width squared is y - (yof) fof f o y - f y y - f |f| yo = y - (yof) fof f squared = yoy - 2 (yof)2 fof fof (fof)2 Squared y on f Proj Dis = yoy - (yof)2 fof dot product projection distance squared = yoy - 2 (yof)2 fof + yo dot prod proj len f |f| Squared y-p on q-p Projection Distance = (y-p)o(y-p) - ( (y-p)o(q-p) )2 (q-p)o(q-p) 1st = yoy -2yop + pop - ( yo(q-p) - p o(q-p |q-p| 2 M-p |M-p| (y-p)o For the dot product length projections (caps) we already needed: = ( yo(M-p) - po M-p ) That is, we needed to compute the green constants and the blue and red dot product functionals in an optimal way (and then do the PTreeSet additions/subtractions/multiplications). What is optimal? (minimizing PTreeSet functional creations and PTreeSet operations.)

30 Cone Clustering: (finding cone-shaped clusters)
x=s2 cone=.1 39 2 40 1 41 1 44 1 45 1 46 1 47 1 i39 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 59 w maxs-to-mins cone=.939 i25 i40 i16 i42 i17 i38 i11 i48 22 2 23 1 i34 i50 i24 i28 i27 27 5 28 3 29 2 30 2 31 3 32 4 34 3 35 4 36 2 37 2 38 2 39 3 40 1 41 2 46 1 47 2 48 1 i39 53 1 54 2 55 1 56 1 57 8 58 5 59 4 60 7 61 4 62 5 63 5 64 1 65 3 66 1 67 1 68 1 114 14 i and 100 s/e. So picks i as 0 w naaa-xaaa cone=.95 12 1 13 2 14 1 15 2 16 1 17 1 18 4 19 3 20 2 21 3 22 5 i21 24 5 25 1 27 1 28 1 29 2 i7 41/43 e so picks e Corner points Gap in dot product projections onto the cornerpoints line. x=s1 cone=1/√2 60 3 61 4 62 3 63 10 64 15 65 9 66 3 67 1 69 2 50 x=s2 cone=1/√2 47 1 59 2 60 4 61 3 62 6 63 10 64 10 65 5 66 4 67 4 69 1 70 1 51 x=s2 cone=.9 59 2 60 3 61 3 62 5 63 9 64 10 65 5 66 4 67 4 69 1 70 1 47 Cosine cone gap (over some  angle) w maxs cone=.707 0 2 8 1 10 3 12 2 13 1 14 3 15 1 16 3 17 5 18 3 19 5 20 6 21 2 22 4 23 3 24 3 25 9 26 3 27 3 28 3 29 5 30 3 31 4 32 3 33 2 34 2 35 2 36 4 37 1 38 1 40 1 41 4 42 5 43 5 44 7 45 3 46 1 47 6 48 6 49 2 51 1 52 2 53 1 55 1 137 w maxs cone=.93 8 1 i10 13 1 14 3 16 2 17 2 18 1 19 3 20 4 21 1 24 1 25 4 e21 e34 27 2 29 2 i7 27/29 are i's F=(y-M)o(x-M)/|x-M|-mn restricted to a cosine cone on IRIS w aaan-aaax cone=.54 7 3 i27 i28 8 1 9 3 i20 i34 11 7 12 13 13 5 14 3 15 7 19 1 20 1 21 7 22 7 23 28 24 6 100/104 s or e so 0 picks i x=i1 cone=.707 34 1 35 1 36 2 37 2 38 3 39 5 40 4 42 6 43 2 44 7 45 5 47 2 48 3 49 3 50 3 51 4 52 3 53 2 54 2 55 4 56 2 57 1 58 1 59 1 60 1 61 1 62 1 63 1 64 1 66 1 75 x=e1 cone=.707 33 1 36 2 37 2 38 3 39 1 40 5 41 4 42 2 43 1 44 1 45 6 46 4 47 5 48 1 49 2 50 5 51 1 52 2 54 2 55 1 57 2 58 1 60 1 62 1 63 1 64 1 65 2 60 Cosine conical gapping seems quick and easy (cosine = dot product divided by both lengths. Length of the fixed vector, x-M, is a one-time calculation. Length y-M changes with y so build the PTreeSet. w maxs cone=.925 8 1 i10 13 1 14 3 16 3 17 2 18 2 19 3 20 4 21 1 24 1 25 5 e21 e34 27 2 28 1 29 2 e35 i7 31/34 are i's w xnnn-nxxx cone=.95 8 2 i22 i50 10 2 i28 i24 i27 i34 13 2 14 4 15 3 16 8 17 4 18 7 19 3 20 5 21 1 22 1 23 1 i39 43/50 e so picks out e

31 "Gap Hill Climbing": mathematical analysis
rotation d toward a higher F-STD or grow 1 gap using support pairs: F-slices are hyperplanes (assuming F=dotd) so it would makes sense to try to "re-orient" d so that the gap grows. Instead of taking the "improved" p and q to be the means of the entire n-dimensional half-spaces which is cut by the gap (or thinning), take as p and q to be the means of the F-slice (n-1)-dimensional hyperplanes defining the gap or thinning. This is easy since our method produces the pTree mask the sequence of F-values and the sequence of counts of points that give us those value that we use to find large gaps in the first place. Dot F p=aaan q=aaax 0 6 1 28 2 7 3 7 4 1 5 1 9 7 10 3 11 5 12 13 13 8 14 12 15 4 16 2 17 12 18 5 19 6 20 6 21 3 22 8 23 3 24 3 C1<7 (50 Set) d2-gap >> than d1=gap (still not optimal.) Weight mean by the dist from gap? (d-barrel radius) 7<C2<16 (4i, 48e) In this example it seems to make for a larger gap, but what weightings should be used? (e.g., 1/radius2) (zero weighting after the first gap is identical to the previous). Also we really want to identify the Support vector pair of the gap (the pair, one from one side and the other from the other side which are closest together) as p and q (in this case, 9 and a but we were just lucky to draw our vector through them.) We could check the d-barrel radius of just these gap slice pairs and select the closest pair as p and q??? C3>16 (46i, 2e) d1 d1-gap hill-climb gap at 16 w half-space avgs. a b c d e f f e d c b a 9 8 7 6 a j k l m n b c q r s d e f o p g h i d1 d1-gap a b c f e d c b a 9 8 7 6 a j k b c q d e f 2 1 C2uC3 p=avg<16 q=avg>16 0 1 1 1 2 2 3 1 7 2 9 2 10 2 11 3 12 3 13 2 14 5 15 1 16 3 17 3 18 2 19 2 20 4 21 5 22 2 23 5 24 9 25 1 26 1 27 3 28 2 29 1 30 3 31 5 32 2 33 3 34 3 35 1 36 2 37 4 38 1 39 1 42 2 44 1 45 2 47 2 =p q= No conclusive gaps Sparse Lo end: Check [0,9] i39 e49 e8 e44 e11 e32 e30 e15 e31 i e e e e e e e e i39,e49,e11 singleton outliers. {e8,i44} doubleton outlier set d2 d2-gap p q d2 d2-gap There is a thinning at 22 and it is the same one but it is not more prominent. Next we attempt to hill-climb the gap at 16 using the mean of the half-space boundary. (i.e., p is avg=14; q is avg=17. C123 p avg=14 q avg=17 0 1 2 3 3 2 4 4 5 7 6 4 7 8 8 2 9 11 10 4 12 3 13 1 20 1 21 1 22 2 23 1 27 2 28 1 29 1 30 2 31 4 32 2 33 3 34 4 35 1 36 3 37 4 38 2 39 2 40 5 41 3 42 3 43 6 44 8 45 1 46 2 47 1 48 3 49 3 51 7 52 2 53 2 54 3 55 1 56 3 57 3 58 1 61 2 63 2 64 1 66 1 67 1 Sparse Hi end: Check [38,47] distances i31 i8 i36 i10 i6 i23 i32 i18 i19 i i i i i i i i i i10,i18,i19,i32,i36 singleton outliers {i6,i23} doubleton outlier Here, gap between C1,C2 is more pronounced Why? Thinning C2,C3 more obscure? It did not grow gap wanted to grow (tween C2 ,C3.

33 Hierarchical Clustering
ABC DEFG Hierarchical Clustering Any maximal anti-chain (maximal set of nodes s.t no 2 directly connected) is a clustering. (dendogram offers many DE FG A BC F G D E B C But horizontal anti-chains are clusterngs from top down (or bottom up) method(s).

34 CONCRETE GV F=(DPP-MN)/4 Concrete(C, W, FA, A) Accuracy=90%
0 1 1 1 5 1 6 1 7 1 8 4 9 1 10 1 11 2 12 1 13 5 14 1 15 3 16 3 17 4 18 1 19 3 20 9 21 4 22 3 23 7 24 2 25 4 26 8 27 7 28 7 29 10 30 3 31 1 32 3 33 6 34 4 35 5 37 2 38 2 40 1 42 3 43 1 44 1 45 1 46 4 49 1 56 1 58 1 61 1 65 1 66 1 69 1 71 1 77 1 80 1 83 1 86 1 CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 med=10 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 med=14 med=9 med=18 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 med=17 med=21 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 med=23 med=40 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 med=34 med=33 med=56 Accuracy=90% med=61 ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 med=57 med=62 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 med=71 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 med=71 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 med=86 Suppose we know (or want) 3 clusters, Low, Medium and High Strength. Then we find ______ CLUS 4 gap=7 [52,74) 0L 7M 0H CLUS_3 Suppose we know that we want 3 strength clusters, Low, Medium and High. We can use an anti-chain that gives us exactly 3 subclusters two ways, one show in brown and the other in purple Which would we choose? The brown seems to give slightly more uniform subcluster sizes. Brown error count: Low (bottom) 11, Medium (middle) 0, High (top) 26, so 96/133=72% accurate. The Purple error count: Low 2, Medium 22, High 35, so 74/133=56% accurate. ______ gap=6 [74,90) 0L 4M 0H CLUS_2 ________ [0.90) 43L 46 M 55H gap=14 [90,113) 0L 6M 0H CLUS_1 What about agglomerating using single link agglomeration (minimum pairwise distance? Agglomerate (build dendogram) by iteratively gluing together clusters with min Median separation. Should I have normalize the rounds? Should I have used the same Fdivisor and made sure the range of values was the same in 2nd round as it was in the 1st round (on CLUS 4)? Can I normalize after the fact, I by multiplying 1st round values by 100/88=1.76? Agglomerate the 1st round clusters and then independently agglomerate 2nd round clusters? _____________At this level, FinalClus1={17M} 0 errors C1 C2 C3 C4 CONCRETE

35 GV Agglomerating using single link (min pairwise distance = min gap size! (glue min-gap adjacent clusters 1st) CLUS 4 (F=(DPP-MN)/2, Fgap2 0 3 7 4 9 1 10 12 11 8 12 7 15 4 18 10 21 3 22 7 23 2 25 2 26 3 27 1 28 2 29 1 31 3 32 1 34 2 40 4 47 3 52 1 53 3 54 3 55 4 56 2 57 3 58 1 60 2 61 2 62 4 64 4 67 2 68 1 71 7 72 3 79 5 85 1 87 2 _______ = L 0M 3H CLUS gap= Median=0 Avg=0 = L 0M 4H CLUS gap= Median=7 Avg=7 [8,14] 1L 5M 22H CLUS L+5M err H Median=11 Avg=10.7 gap=3 ______ = L 0M 4H CLUS gap= Median=15 Avg=15 = L 0M 10H CLUS gap= Median=18 Avg=18 ______ [20,24) 0L 10M 2H CLUS gap= Median=22 Avg=22 2H errs in L [24,30) L 0M 0H CLUS_ Median=26 Avg=26 gap=2 [30,33] 0L 4M 0H CLUS gap= Median=31 Avg=32.3 = L 2M 0H CLUS gap= Median=34 Avg=34 ______ = L 4M 0H CLUS_ gap= Median=40 Avg=40 = L 3M 0H CLUS_ gap= Median=47 Avt=47 Accuracy=90% ______ [50,59) L 1M 4H CLUS gap= Median=55 Avg=55 1M+4H errs in L [59,63) L 0M 0H CLUS_ Median=61.5 Avg=61.3 gap=2 ______ = L 0M 2H CLUS gap= Median=64 Avg= H errs in L [66,70) 10L 0M 0H CLUS Median=67 Avg=67.3 gap=3 [70,79) 10L 0M 0H CLUS_ Median=71 Avg=71.7 ______ gap=7 = L 0M 0H CLUS_ gap=6 Median=79 Avg=79 [74,90) 2L 0M 1H CLUS_ Merr in L Median=87 Avg=86.3 The first thing we can notice is that outliers mess up agglomerations which are supervised by knowledge of the number of subclusters expected. Therefore we might remove outliers by backing away from all gap5 agglomerations, then looking for a 3 subcluster max anti-chains. What we have done is to declare F<7 and F>84 as extreme tripleton outliers sets; and F=79. F=40 and F=47 as singleton outlier sets because they are F-gapped by at least 5 (which is actually 10) on either side. The brown gives more uniform sizes. Brown errors: Low (bottom) 8, Medium (middle) 12 and High (top) 6, so 107/133=80% accurate. The one decision to agglomerate C4.7.1 to C4.7.2 (gap=3) instead of C4.3.2 to C4.7.2 (gap=3) lots of error. C4.7.1 and C4.7.2 are problematic since they are separate out, but in increasing F order, it's H M L M L, so if we suspected this pattern we would look for 5 subclusters. The 5 orange errors in increasing F-order are: 6, 2, 0, 0, 8 so 127/133=95% accurate. If you have ever studied concrete, you know it is a very complex material. The fact that it clusters out with a F-order pattern of HMLML is just bizarre! So we should expect errors. CONCRETE

