Check out the ebook on FPGAs and DP. 2points about the topic: 1. Thinking about FPGA DM together with the raging debate about the efficacy of Non-SQL,

Slides:



Advertisements
Similar presentations
Matrix Multiplication Hyun Lee, Eun Kim, Jedd Hakimi.
Advertisements

Chapter 17: The binomial model of probability Part 2
Practicum 2: - Asymptotics - List and Tree Structures Fundamental Data Structures and Algorithms Klaus Sutner Feb. 5, 2004.
Machine Learning in Practice Lecture 7 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
With PGP-D, to get pTree info, you need: the ordering (the mapping of bit position to table row) the predicate (e.g., table column id and bit slice or.
CSC1016 Coursework Clarification Derek Mortimer March 2010.
Mr Barton’s Maths Notes
Why Program? CSE111 – Great ideas in Computer Science Clearly programming fits here Programming is a Great Idea in Computer Science. It has allowed computers.
Ensemble Learning: An Introduction
Assessing cognitive models What is the aim of cognitive modelling? To try and reproduce, using equations or similar, the mechanism that people are using.
Review of Matrix Algebra
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Sorting and Searching Algorithms Week 11 DSA. Recap etc. Arrays are lists of data 1-D, 2-D etc. Lists associated with searching and sorting Other structures.
CS 106 Introduction to Computer Science I 10 / 16 / 2006 Instructor: Michael Eckmann.
Last lecture summary Fundamental system in linear algebra : system of linear equations Ax = b. nice case – n equations, n unknowns matrix notation row.
Mr Barton’s Maths Notes
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Topic 4 Computer Mathematics and Logic
While Loops and Do Loops. Suppose you wanted to repeat the same code over and over again? System.out.println(“text”); System.out.println(“text”); System.out.println(“text”);
Hadoop: The Definitive Guide Chap. 8 MapReduce Features
Lecture 15 Practice & exploration Subfunctions: Functions within functions “Guessing Game”: A hands-on exercise in evolutionary design © 2007 Daniel Valentine.
1 What NOT to do I get sooooo Frustrated! Marking the SAME wrong answer hundreds of times! I will give a list of mistakes which I particularly hate marking.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
G = (  n  SUPu 1 e(u 1,n)FM n,...,  n  SUPu lastu e(u lastu,n)FM n,...,  v  SUPm 1 e(v,m 1 )UF v,...,  v  SUPlastm 1 e(v,m lastm )UF v ) 0 = dsse(t)/dt.
7 Graph 7.1 Even and Odd Degrees.
Example 5.8 Non-logistics Network Models | 5.2 | 5.3 | 5.4 | 5.5 | 5.6 | 5.7 | 5.9 | 5.10 | 5.10a a Background Information.
a b c d e f g h i j k.
Database Management 9. course. Execution of queries.
Week 5 - Wednesday.  What did we talk about last time?  Exam 1!  And before that?  Review!  And before that?  if and switch statements.
Databases From A to Boyce Codd. What is a database? It depends on your point of view. For Manovich, a database is a means of structuring information in.
Multiplying Whole Numbers © Math As A Second Language All Rights Reserved next #5 Taking the Fear out of Math 9 × 9 81 Single Digit Multiplication.
Extending the Definition of Exponents © Math As A Second Language All Rights Reserved next #10 Taking the Fear out of Math 2 -8.
Introduction to Algorithms Jiafen Liu Sept
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.
1 What to do before class starts??? Download the sample database from the k: drive to the u: drive or to your flash drive. The database is named “FormBelmont.accdb”
CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
COMPSCI 102 Discrete Mathematics for Computer Science.
CSCI1600: Embedded and Real Time Software Lecture 28: Verification I Steven Reiss, Fall 2015.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
Multiplication of Common Fractions © Math As A Second Language All Rights Reserved next #6 Taking the Fear out of Math 1 3 ×1 3 Applying.
HPVD using error = rating 2 NotEqual prediction 2 A B C D E F G H I J K L M N O P Q R S
Ensemble Methods Construct a set of classifiers from the training data Predict class label of previously unseen records by aggregating predictions made.
MATH 256 Probability and Random Processes Yrd. Doç. Dr. Didem Kivanc Tureli 14/10/2011Lecture 3 OKAN UNIVERSITY.
Computer Graphics Matrices
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
Lecture 17 Undecidability Topics:  TM variations  Undecidability June 25, 2015 CSCE 355 Foundations of Computation.
Searching Topics Sequential Search Binary Search.
LECTURE 02: EVALUATING MODELS January 27, 2016 SDS 293 Machine Learning.
Revision on Matrices Finding the order of, Addition, Subtraction and the Inverse of Matices.
8/3/2007CMSC 341 BTrees1 CMSC 341 B- Trees D. Frey with apologies to Tom Anastasio.
The Law of Averages. What does the law of average say? We know that, from the definition of probability, in the long run the frequency of some event will.
Reliability of Disk Systems. Reliability So far, we looked at ways to improve the performance of disk systems. Next, we will look at ways to improve the.
Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )
Take a feature vector, fv=fv u fv m of SVD, let fv(t) = t*v (this is for the line search of SVD). To do the line search, mse(t) = 1/|Ratings|  (v,m) 
1 Math CAMPPP 2012 Plenary 1 Why students struggle with fractions.
CS203 Lecture 14. Hashing An object may contain an arbitrary amount of data, and searching a data structure that contains many large objects is expensive.
Lower Bounds & Sorting in Linear Time
AP CSP: Cleaning Data & Creating Summary Tables
Eigenfaces (for Face Recognition)
Statistical Data Analysis - Lecture /04/03
Optimizing Malloc and Free
Ch8: Sorting in Linear Time Ming-Te Chi
Objective of This Course
Fundamentals of Data Representation
Ensembles.
Lower Bounds & Sorting in Linear Time
Here is the result after 1 round when using a fixed increment line search to find minimize mse with respect to the LRATE used:
Presentation transcript:

Check out the ebook on FPGAs and DP. 2points about the topic: 1. Thinking about FPGA DM together with the raging debate about the efficacy of Non-SQL, key-value, Hadoop… DP (google: DeWitt, Hadoop), the following success path for pTrees leaps to mind: Hadoop et al, stores and catalogues Big Data across massive clusters of computers. SQL DBMSs can’t manage data that big. But what the SQL people [DeWitt-Stonebraker et al] claim is that non-SQL leaves you high and dry in terms of processing that Big Data. The Hadoop response: “Yes, but we never intended otherwise. And BTW, you guys can’t even store and catalog big data! And if you don’t have it, YOU aren’t processing it (and neither is anyone else!) ”. Is this an opening for pTrees? YES! Assume a Hadoop, big data, distributed TrainingSet (identified as the Hadoop data for a particular high-value Prediction/Classification task). Assume we have conversion routines to convert it to a single PTS (PTreeSet) (possibly multilevel, compressed). Now we've a PTS for Classifctn, we use FPGA impls of Md’s SPTSA (ScalarPTreeSetAlgebra) and FAUST algorithms (incl. SVD) to [nearly instantaneously] produce answers. Note that each PTS is really the concatenation of many SPTSs, so Md’s SPTSA and FAUST apply immediately once we have the PTS. How many SPTSs does a PTS of a distributed Hadoop TrainingSet produce? Exactly as many as the dimension of the Hadoop TrainingSet. pTree Unbelievers (pTUs) will say: Sure, but let’s see those “conversion routines” that convert the Hadoop, big data, distributed TrainingSet [needed for a particular high-value Classification task] to a PTS." Our answer: pTrees don't perform magic! pTrees are the best tool for only certain tasks. If the customer can’t give us a well defined description of the Training Table for a classification task (i.e., says “Here’s all our Hadoop data. Mine it!”), we should just walk away (as should everyone), because the customer is asking for Mining Magic (MM). And the only bonafide MM is the candy. We'll admit, however, that “Defining the classification task and producing the TrainingTable for it IS HARD! Looking for the low hanging fruit: For which Big Data classification can we identify/convert a TrainingSet to a PTS? It only takes 1 big winner! Don’t be everything to everybody. A strategy? Suppose we come upon a Big Data Classification task that the non-SQL people are already doing for big money. That means they are identifying the pieces in Hadoop which make up the TrainingTable and then they are using a [pathetic] combination of distributed map-reduce routines to mine it, one TrainingTable virtual record at a time. We ought to be able to take use their “map” code (to identify the Training Data), strip from their “reduce” to “gather raw data” into a TT, replace their VPHD code with “produce a PTS” conversion. Then we FAUST it. 2. I’m curious what Dr. Wettstein’s reaction of the opening in the abstract regarding the cause of the industries move to parallel programming. It seems the author plays down the reaching of the miniaturization limit (paramagnetic limit) and focuses on power consumption and heat dissipation issues as causes. Of course, I suppose, if further breakthroughs in miniaturization had occurred (more elements per die) those breakthroughs would have expressed themselves also as power/heat breakthroughs? So it’s like watching a sports game, what you see depends upon where your seat is. Students: pay attention to anything Dr. Wettstein has time to share on this! He was in direct contact with Intel Research during this time so his “seat” was right behind home plate, while this author’s may have been in the cheap seats. Synthesis Lectures on Data ManagementSynthesis Lectures on Data Management edited by M. Tamer Ozsu of the University of Waterloo Now Available: Data Processing on FPGAs by Jens Teubner (Technical University of Dortmund, Germany) and Louis Woods (ETH Zurich) Download Free SampleData Processing on FPGAs Download Free Sample

After thinking and discussing it with several people, I think we should code in C++ even tho C# got the node last meeting ;-) The reason is that C++ is a research language (while C# and Java may be the best industry production languages). I have had this experience before and I gave in. The result was (in the case of the previous large development project, SMILEY) a system that no one used after the research evolved a bit. There are all sorts of reason, which we can discuss next week. but also, we have a fantastic resource in Greg Wettstein (and Bryan Mesich) and a foundation of Recommender code in C++ which is absolutely first rate. We should use it! Once the research stabilizes, we can do a production version in C#. A case in point is the SVD recommender - we don't know how it will end up (I don't understand Funk's algs yet. Anybody?). The bottom line is speed in Horizontal Processing Vertical Data (HPVD), we need to be able to control everything including memory allocation/deallocation, I/O mgmt (including cache), AND/OR/COMP level-0 and level-1 algorithm coding, HOBBIT style (bit-slice) shortcutting in Md's ScalarPTreeSet Algebra - just to name a few. Here's a first test of that assertion: Can you define a Level-1 (or even level-0) PTreeSet in C# which takes only 1 memory bit per data bit without compromising bit level processing speed? Maybe you can, but in SMILEY, we ended up using a byte per bit because it was easier coding. Here's a second test: Can you code Dr. Wettstein's Logical Processing and 1-bit-counting speed enhancements (found in the Netflix code) in C#. I guess what I'm trying to get across is that coding speed is not the issue - execution speed is. If we don't get maximal execution speed we've got nothing! It isn't that we have a lot of completely new basic algorithms (we do have some of course) it's that we have an approach (HPVD at the bitslice level, possibly compressed to a tree) which facilitates orders of magnitude speedup of most DM algs. With orders of magnitude speedup, we can implement algorithms which will run in acceptable time that used to take too long to be useable (e.g., training up an SVD classifier with 1,000,000 features). I believe, to do that, we need HPVD (pTrees). And I believe to do pTrees right we need to use C++. Prove me wrong by training a 1M feature SVD using C#?). I actually hope I'm wrong ;-)

Using the new dataset with 20 movies and 51 users. Initial: a b c d e f g h i j k l m n... LRATE MSE Line Search Details Delta mse a b c d e f g h i j k l m n... LRATE MSE Line Search Details Delta mse a b c d e f g h i j k l m n... LRATE MSE Line Search Details Delta mse Note: = 1.75* = 0.79* Line Search Details Delta mse

a b c d e f g h i j k l m n o p q r s t LRATE MSE Without line search, using Funk's LRATE=.001, to arrive at ~ same mse (and a nearly identical feature vector) it takes 81 rounds: Going from the round 1 result (LRATE=.0525) shown here, we do a second round and again do fixed increment line search: We came up with an approximately minimized mse at LRATE=,030. Going from this line search resulting from LRATE=.03, we do another round round: Going from this line search resulting from LRATE=.02, we the same for the next round: LRATE=.02 stable, near-optimal? (No further line search). After 200 rounds at LRATE=.02. (note that it took ~2000 rounds without line search and with line search ~219): Comparing this feature vector to the one we got with ~2000 rounds at LRATE=.001 (without line search) we see that we arrive at a very different feature vector: , no ls a b c d e f g h i j k l m n o p q r s t LRATE , w ls However, the UserFeatureVector protions differ by constant multiplier and the MovieFeatureVector portions differ by a different constant. If we divide the LR=.001 vector by the LR=.020, we get the following multiplier vector (one is not a dialation of the other but if we split user portion from the movie portion, they are!!! What does that mean!?!?!?! ".001/.020" 1.80 avg 0.04 std0.54 avg 0.01 std Another interesting observation is that 1 / 1.8 =.55, that is, 1 / AVGufv = AVGmfv. They are reciporicals of oneanother!!! This makes sense since it means, if you double the ufv you have to halve the mfv to get the same predictions. The bottom line is that the predictions are the same! What is the nature of the set of vectors that [nearly] minimize the mse? It is not a subspace (not closed under scalar multiplication) but it is clearly closed under "reciporical scalar multiplication" (multiplying the mfv's by the reciporical of the ufv's multiplier). Waht else can we say about it? So, we get an order of magnitude speedup fromline search. It may be more than that since we may be able to do all the LRATE calculations in parallel (without recalculating the error matrix or feature vectors????). Or we there may be a better search mechanism than fixed increment search. A binary type search? Othere? Here is the result after 1 round when using a fixed increment line search to find minimize mse with respect to the LRATE used:

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 1 \a=Z /rvnfv~fv~{goto}L~{edit}+.005~/XImse<omse ~/xg\a~ ~{goto}se~/rvfv~{end}{down}{down}~ /xg\a~ LRATE omse fv A22: +A2-A$10*$U2 /* error for u=a, m=1 */ A30: +A10+$L*(A$22*$U$2+A$24*$U$4+A$26*$U$6+A$29*$U$9) /* updates f(u=a) */ U29: +U9+$L*(($A29*$A$30+$K29*$K$30+$N29*$N$30+$P29*$P$30)/4) /* updates f(m=8 */ AB30: +U29 /* copies f(m=8) feature update in the new feature vector, nfv */ /* counts the number of actual ratings (users) for m=1 */ X22: /*adds ratings counts for all 8 movies = training count*/ AD30: /* averages se's giving the mse */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD 21 working error and new feature vector (nfv) **0 ** ** 0 ** ** ** ** ** **** ** ** 1 0 ** ** ** L mse nfv A52: +A22^2 /*squares all the individual erros */ A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAAB AC AD square errors SE /rvnfv~fv copies fv to nfv after converting fv to values. {goto}L~{edit}+.005~ increments L by.005 /XImse<omse ~/xg\a~ IF mse still decreasing, recalc mse with new L.001~ Reset L=.001 for next round /xg\a~ Start over with next round {goto}se~/rvfv~{end}{down}{down}~ "value copy" fv to output list Notes: In 2 rounds mse is as low as Funk gets it in 2000 rounds. After 5 rounds mse is lower than ever before (and appears to be bottoming out). I know I shouldn't hardcode parameters! Experiments should be done to optimize this line search (e.g., with some binary search for a low mse). Since we have the resulting individual square_errors for each training pair, we could run this, then for mask the pairs with se(u,m) > Threshold. Then do it again after masking out those that have already achieved a low se. But what do I do with the two resulting feature vectors? Do I treat it like a two feature SVD or do I use some linear combo of the resulting predictions of the two (or it could be more than two)? We need to test out which works best (or other modifications) on Netflix data. Maybe on those test pairs for which the training row and column have some high errors, we apply the second feature vector instead of the first? Maybe we invoke CkNN for test pairs in this case (or use all 3 and a linear combo?) This is powerful! We need to optimize the calculations using pTrees!!!

A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AAABACADAEAFAGAHAIAJ AKALAMANAOAPAQARASATAUAVAWAXAY BA A B C D E F G H I J K L M N O P Q R S T U V W X Y Z AA AB AC AD AE AF AG AH AI AJ AK AL AM AN AO AP AQ AR AS AT AU AV AW AX AY AZ BA BB BC BD BE BF BG BH BI BJ BK BL BM BN BO BP BQ BR BS BT Lrate MSE A larger example: 20 movies, 51 users (same as last time except I found errors in my code, which I corrected. The last two red lines are printouts of the two steps in the initial line search (on the way to the first result line at MSE= ). The two vectors should be co-linear (generate the same line) or else I am not doing line search!! They are clearly not co-linear. Thus I have a more code mistake. This is why a C# versions is desparately needed!! How is that coming?

Where are we now wrt PSVD? Clearly line search is a good idea. How good? (speedup?, accuracy comparisons?) What about 2nd [3rd?, 4th?,...] feature vector training? How to generate those? (Probably just a matter of understanding Funk's code). What "retraining under mask" steps are breakthroughs? improve accuracy markedly? improve speed markedly? What speedup shortcuts can we [as mindless engineers ;-) ] come up with. By "mindless" I mean only that trial and error is probably the best way to find these speedups, unless you can understand the mathematics). Maybe Dr. Ubhaya? What speedup shortcuts can we come up with to execute Md's PTreeSet Algebra Procedures? These speedups can be "mindless" or "magic" - we'll take them anyway!. Again, by "mindless" I mean that trial and error is used to find lucky speedups - unless you can fully understand the mathematics, it's mindless ;-) Maybe Dr. Ubhaya can do the math for us? I will suggest the following: "The more the Mathematics is understood the better the mindless engineering tricks work!" What speedup shortcuts can we come up with? Involving Md's PTreeSet Algebra? These speedups can be "mindless" or "magic", we'll take them anyway!. By "mindless" I mean that trial and error is used to find lucky speedups - unless you can fully understand the mathematics, it's mindless ;-) Maybe Dr. Ubhaya can do the math for us? I will suggest the following: "The more the Mathematics is understood the better the mindless engineering tricks work!" In RECOMMENDERs, we have people (users, customers, websearchers...) and things (products, movies, items, documents, webpages or?) We also often have text (product descriptions, movie features, item descriptions, document contents, webpage contents...), which can be handled as entity description columns or by introducing a third entity, terms (content terms, stems of content terms,...). So we have three entities and three relationships in a cyclic 2 hop rolodex structure (or what we called BUP "Bi-partite, Uni-partite on Part" structure). A lifetime of fruitful research lurks in this arena. We can use one relationship to restrict (mask entities instances in) an adjacent relationship. I firmly believe pTree structuring is the way to do this. We can add a people-to-people relationship also (ala, facebook friends) and richen the information content significantly. We should add tweats to this somehow. Since I don't tweat, I'm probably not the one to suggest how this should fit in, but I will anyway ;-) Tweats (seem to be) mini-documents describing documents or mini-documents describing people, or possibly even mini-documents describing terms (e.g, if a buzzword becomes hot in the media, people tweat about it????) Let's call this research arena the VERTICAL RECOMMENDER arena. It's hot! Who's going to be the Master Chef in this Hell's Kitchen?