EDU=E S# C# SNAME AGE CNAME SITE GRADE 17 5 BAID 19 3UA ND 96 25 6 CLAY 21 3UA NJ 76 25 7 CLAY 21 CUS ND 68 32 6 THAISZ 18 3UA NJ 62 32 7 THAISZ 18 CUS.

Slides:



Advertisements
Similar presentations
Data mining and statistical learning - lecture 6
Advertisements

The loss function, the normal equation,
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Motion Analysis (contd.) Slides are from RPI Registration Class.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Designed by David Jay Hebert, PhD Problem: Add the first 100 counting numbers together … We shall see if we can find a fast way of doing.
Computational Stochastic Optimization: Bridging communities October 25, 2012 Warren Powell CASTLE Laboratory Princeton University
G = (  n  SUPu 1 e(u 1,n)FM n,...,  n  SUPu lastu e(u lastu,n)FM n,...,  v  SUPm 1 e(v,m 1 )UF v,...,  v  SUPlastm 1 e(v,m lastm )UF v ) 0 = dsse(t)/dt.
SPANISH CRYPTOGRAPHY DAYS (SCD 2011) A Search Algorithm Based on Syndrome Computation to Get Efficient Shortened Cyclic Codes Correcting either Random.
Trees Example More than one variable. The residual plot suggests that the linear model is satisfactory. The R squared value seems quite low though,
Yue Xu Shu Zhang.  A person has already rated some movies, which movies he/she may be interested, too?  If we have huge data of user and movies, this.
Knowledge Discovery in Protected Vertical Information Dr. William Perrizo University Distinguished Professor of Computer Science North Dakota State University,
R r vv r m R r v v v v r r v m V v r v v r v Oblique FAUST Clustering P R = P (X dot d)
Matrix Factorization & Singular Value Decomposition Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
GOVT 201: Statistics for Political Science
Chapter 13 Simple Linear Regression
Trigonometric Identities
The simple linear regression model and parameter estimation
Social Media Analytics (prediction and anomaly detection for s, tweets, phone calls, SMS texting, etc.) Distinguishing Senders and Receivers; i.e.,
Pythagorean Theorem MCC8.G.6-8: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Hypothesis Testing Hypothesis testing is an inferential process
Chapter 7. Classification and Prediction
Gradient descent David Kauchak CS 158 – Fall 2016.
Applications of the Derivative
QM222 Class 16 & 17 Today’s New topic: Estimating nonlinear relationships QM222 Fall 2017 Section A1.
Pythagorean Theorem MACC.8.G Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Applications of the Derivative
Mid-Term Exam preparation
Learning Sequence Motif Models Using Expectation Maximization (EM)
Data Mining Lecture 11.
Using a 3-dim DSR(Document Sender Receiver) matrix and
Pythagorean Theorem MCC8.G.6-8: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
3. Vertical Data LECTURE 2 Section 3.
Step-By-Step Instructions for Miniproject 2
Pythagorean Theorem Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems in two and.
Calculus II (MAT 146) Dr. Day Wednesday May 2, 2018
Collaborative Filtering Matrix Factorization Approach
ID1050– Quantitative & Qualitative Reasoning
DSR is binary (1 means doc was sent by Sender to Reciever)
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
3. Vertical Data LECTURE 2 Section 3.
RL for Large State Spaces: Value Function Approximation
Pythagorean Theorem MCC8.G.6-8: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pythagorean Theorem MACC.8.G Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Pythagorean Theorem MCC8.G.6-8: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Solving Equations by Factoring
Spreadsheets, Modelling & Databases
The loss function, the normal equation,
CS 188: Artificial Intelligence Fall 2008
Mathematical Foundations of BME Reza Shadmehr
Maths for Signals and Systems Linear Algebra in Engineering Lecture 6, Friday 21st October 2016 DR TANIA STATHAKI READER (ASSOCIATE PROFFESOR) IN SIGNAL.
One-Factor Experiments
Mathematical Models of Control Systems
Pythagorean Theorem GOAL: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems in.
DEPARTMENT OF MECHANICAL ENGINEERING
Kalman Filter: Bayes Interpretation
Implementation of Learning Systems
Pythagorean Theorem MCC8.G.6-8: Apply the Pythagorean Theorem to determine unknown side lengths in right triangles in real-world and mathematical problems.
Let’s explore angles in triangles.
Pythagorean Theorem Title 1
Dr. Arslan Ornek MATHEMATICAL MODELS
Functional Dependencies and Normalization
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

EDU=E S# C# SNAME AGE CNAME SITE GRADE 17 5 BAID 19 3UA ND CLAY 21 3UA NJ CLAY 21 CUS ND THAISZ 18 3UA NJ THAISZ 18 CUS ND THAISZ 18 DSDE ND GOOD 20 3UA NJ 98 Data Cube is a table, the Universal Relation, UR S# C# SNAME AGE CNAME SITE GR BAID 19 3UA ND BAID 19 3UA Nj 17 7 BAID 19 CUS ND 17 8 BAID 19 DSDE ND 25 5 CLAY 21 3UA ND 25 6 CLAY 21 3UA NJ CLAY 21 CUS ND CLAY 21 DSDE ND 32 5 THAISZ 18 3UA ND 32 6 THAISZ 18 3UA NJ THAISZ 18 CUS ND THAISZ 18 DSDE ND GOOD 19 3UA ND 38 6 GOOD 19 3UA NJ GOOD 19 CUS ND 38 8 GOOD 19 DSDE ND 57 5 BROWN 20 3UA ND 57 6 BROWN 20 3UA NJ 57 7 BROWN 20 CUS ND 57 8 BROWN 20 DSDE ND We can convert this UR to pTrees. For numeric cols, blanks are set to 0 since 0 does not contribute to sums and for avg we need nonblank count only (meta-data). Beyond that, watch out for interval masks (e.g., P x>a ). Don't mask zero if blanks are to be excluded. GRADE=G S# C# GR COURSE=C C# CNAME SITE 5 3UA ND 6 3UA NJ 7 CUS ND 8 DSDE ND STUDENTS=S S# SNAME AGE 17 BAID CLAY THAISZ GOOD BROWN 20 Boyce Codd normal [relational] form G S#\C# S SNAME AGE BAID 19 CLAY 21 THAISZ 18 GOOD 19 BROWN 20 C CNAME SITE 3UA ND 3UA NJ CUS ND DSDE ND Data Cube form IT'S ALL TABLES! For AGE: P 4, For level-1 pTrees use predicate, stride P(gte50%,4) 4, P(gte50%,4) 4, P(gte50%,4) 4, P(gte50%,4) 4, These level-1 UR pTrees are the basic S.AGE pTrees! What about GRADE? Since 0 is a grade, use GR+! column? P 7, P  ½,4 7, P  ½,4 7, P  ½,4 7, P  ½,4 7, P  ½,4 7, P  ½,4 7, P  ½,4 7, P  ½GR Level-1 pTrees for GR+1 are not as useful. S# = 17,38,57, GR=0 AvgGR is close: truAvgGR=82.8. L1AvgGR=84.5 P(gte50%,4) 4,0 Try Lev1 pTrees for GR+1 by applying a nonblank mask prior to evaluating the predicate So, best use: P  ½(nbMask),4 Conclusion: pTreeize DataCube as a rotatable table. Create pTree for both rotations and include pTrees of approp entity table with each. Since the lev0 pTrees of UR hold no info (pure strides) and lev1 are exactly the above entity pTrees, all useful pTrees are contained in DataCube set

DSRDSR 3D Social Media Communications Analytics (prediction/anomaly detection for s, tweets, phone, text fSfS 0101 fDfD 0101 fTfT fTfT 0101 FUFU U D T 2 fDfD TDTD UTUT We do pTrees conversions and train F in the CLOUD; then download the resulting F to user's personal devices for predictions, anomaly detections. The same setup should work for phone record Documents, tweet Documents (in the US Library of Congress) and text Documents, etc. fRfR fDfD fDfD fTfT fTfT fUfU f S f 1,S f 2,S fRfR DSR U TD D T 2 UT  sender   rec  Using a 3-dimensional DSR matrix (Document Sender Receiver) with 2-dimensional TD (Term Doc) and UT (User Term) matrixes. The pSVD trick is to replace these massive relationship matrixes with small feature matrixes. Use GradientDescent+LineSearch to minimize sum of square errors, sse, where sse is the sum over all nonblanks in TD, UT and DSR. Should we train User feature segments separately (train f U with UT only and train f S and f R with DSR only?) or train U with UT and DSR, then let f S = f R = f U, so f = This will be called 3D f Distinguishing Senders and Receivers; i.e., replace 2 dimensional DU (Document User) matrix with a 3 dimensional DSR (Doc Sender Receiver) is emphasized, but the same analytics and data structures apply to phone_records/tweets/SMStext (distinguish senders and receivers) Or training User the feature segment just once, f = This will be called 3DTU f Replace UT with f U and f T feature matrixes (2 features) Replace TD with f T and f D Replace DSR with f D, f S, f R Using just one feature, replace with vectors, f=f D f T f U f S f R or f=f D f T f U

3DTU: Structure relationship as a rotatable matrix, then create PTreeSets for each rotation (attach entity tbl PTreeSet to its rotation Always treat an entity as an attr of another entity if possible? Rather than add it as a new dimension of a matrix? E.g., Treat Sender as a Document attribute instead of as the 3rd dim of matix DSR. The reason: Sender is a candidate key for Doc (while Receiver is not). (Problem to solve: mechanism for SVD prediction of Sender?) DR U D T UT TD Sender CT LN DT pDT T1, pDT T1, pDT T1, pDT T2, pDT T2, pDT T2, pDT T2,Mask 1010 pDT T1,Mask 1111 pDT T3, pDT T3, pDT T3, pDT T3,Mask 0101 pD Sh, pD S, pTD D1, pTD D1, pTD D1, pTD D1,Mask pTD D2, pTD D2, pTD D2, pTD D2,Mask TU pTU U1, pTU U1, pTU U1, pTU U1,Mask pTU U2, pTU U2, pTU U2, pTU U2,Mask pDR R pDR R pDR Mask 1111 pUT T1, pUT T1, pUT T1, pUT T1,Mask 1111 pUT T2, pUT T2, pUT T2, pUT T3, pUT T3, pUT T3, pUT T2,Mask RD pRD D1, pRD D2, pRD Mask 1111 Only provide blankmask when blanks pTrees might be provided for DST (SendTime) and D(LN (Length): pD CT, pD CT, pD LN, pD LN, Next: Create the scalar trees also? sDT T sDT T2 5 sDT T3 3 sTD T sTD T sTU T sTU T sUT T sUT T sUT T Next: Train feature vectors from these pTrees?

f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse f*t DT _________ e fD e*fDe*fT * * ** fT UT e fU e*fUe*fT *** *** fT DS e fS e*fS e*fD * ** * ** fD DR e fR e*fR e*fD * ** * ** fD Gradient 3D f: f = (f T, f D, f U, f S, f R ) with gradient descent to minimize sse taken over 2D matrixes only, DT, UT, DS, DR [em6] f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 ss f+t DT _________ e fD e*fDe*fT * * ** fT UT e fU e*fUe*fT *** *** fT DS e fS e*fS e*fD * ** * ** fD DR e fR e*fR e*fD * ** * ** fD Gradient f t 0.09 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 ss f+t DT _________ e fD e*fDe*fT * * ** fT UT e fU e*fUe*fT *** *** fT DS e fS e*fS e*fD * ** * ** fD DR e fR e*fR e*fD * ** * ** fD Gradient f t 0.04 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 ss f+t DT _________ e fD e*fDe*fT * * ** fT UT e fU e*fUe*fT *** *** fT DS e fS e*fS e*fD * ** * ** fD DR e fR e*fR e*fD * ** * ** fD Gradient f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 ss f+t DT _________ e fD e*fDe*fT * * ** fT UT e fU e*fUe*fT *** *** fT DS e fS e*fS e*fD * ** * ** fD DR e fR e*fR e*fD * ** * ** fD Gradient

Here we try a comprehensive comparison of the 3 alternatives, 3D (DSR); 2D (DS, DR); DTU(2D) [em9 em10] DT UT 5 5 u1 5 u2 DSR s1 s2 s1 s2 1 d1 r1 d1 r2 d2 1 d sseDTU tDSU 1.1 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D 0.1 DT UT 5 5 u1 5 u2 t1 t2 t3 DSR s1 s2 s1 s2 1 d1 r1 d1 r2 d2 1 d sseDTU tDSU 1.12 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D 89 t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D 0.09 DT UT u u2 DSR s1 s2 s1 s2 1 d1 r1 d1 r2 1 d2 1 d sseDTU tDSU 1.12 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D 0.06 DT UT 5 5 u1 5 u2 DSR s1 s2 s1 s2 1 d1 r1 d1 r2 d2 1 d sseDTU tDSU 1.12 T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D sseDTU tDSU T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse2D t2D sse3D t3D -0.02

DSRDSR fSfS 0101 fDfD 0101 fTfT fTfT 0101 fUfU U D T 2 fDfD TDTD UTUT fRfR DTU f: train User feature seg just once makes sense, f = (assuming DSR(duu)=0 always - noone sends to self) p DSR =dsr, p TD =td, p UT =ut, where d=f D (d), s=f S (s), r=f R (r), t=f T (t), u=f U (u) sse =  nonblankDSR (dsr-DSR dsr ) 2 +  nonblankTD (td-TD td ) 2 +  nonblankUT (ut-UT ut ) 2  sse/  d = 2[  sr  Supp DSR (d) sr(dsr-DSR dsr )+  t  Supp TD (d) t(td-TD td )]  sse/  t = 2[  d  Supp TD (t) d(td-TD td )+  u  Supp TD (t) u(ut-TD ut )]  sse/  u = 2[  dr  Supp DSR (s=u) dr(dur-DSR dur )+  ds  Supp DSR (r=u) ds(dsu-DSR dau ) +  t  Supp UT (u) t(ut-UT ut )]

3DTU: f = (f T, f D, f U ) using gradient descent to minimize sse taken over DT, UT, DSR (equating S=R=U) f t fT1 fT2 fT3 fD1 fD2 fU1 fU2 sse DT________ error fD e*fD e*fT eeDT fT UT____ error fU e*fU e*fT eeUT fT D(US)(UR1) error fU fU e*fU e*fD eeDUSUR ** fD D(US)(UR2) error fU fU e*fU e*fD eeDUSUR ** ** fU grad t 0.02 T1 T2 T3 D1 D2 U1 U2 sse f+gt DT________ e fD e*fD e*fT eeDT fT UT____ e fU e*fU e*fT eeUT fT D(US)(UR1) e fU fU e*fU e*fD eeDUSUR ** fD D(US)(UR2) e fU fU e*fU e*fD eeDUSUR ** ** fU Gradient of sse / grad t T1 T2 T3 D1 D2 U1 U2 sse f+gt DT________ e fD e*fD e*fT eeDT fT UT____ e fU e*fU e*fT eeUT fT D(US)(UR1) e fU fU e*fU e*fD eeDUSUR ** fD D(US)(UR2) e fU fU e*fU e*fD eeDUSUR ** ** fU Gradient of sse / grad t T1 T2 T3 D1 D2 U1 U2 sse f+gt DT________ e fD e*fD e*fT eeDT fT UT____ e fU e*fU e*fT eeUT fT D(US)(UR1) e fU fU e*fU e*fD eeDUSUR ** fD D(US)(UR2) e fU fU e*fU e*fD eeDUSUR ** ** fU Gradient of sse / grad t 0 T1 T2 T3 D1 D2 U1 U2 sse f+gt DT________ e fD e*fD e*fT eeDT fT UT____ e fU e*fU e*fT eeUT fT D(US)(UR1) e fU fU e*fU e*fD eeDUSUR ** fD D(US)(UR2) e fU fU e*fU e*fD eeDUSUR ** ** fU Gradient of sse / 2 For this data, f does not train up well (to represnt the matrixes) when equating S=R=U. The data is random (i.e., S and R portions are not reflective of U necessarily. In real data, they may be more so and the training may be more successful) In this tiny example, we walk through the training process when S=R=U. There are 2 documents, 3 terms, 2 users. So f = (fT1, fT2, fT3, fD1, fD2, FU1, fU2) DT term1 term2 term3 doc1 1 3 doc2 4 5 UT term1 term2 term3 user user DSR sender1 sender2 doc1 0 1 receiver1 doc2 1 0 sender1 sender2 0 0 receiver =g (gradient)

The training is much more successful! The line search formula in t is degree=6 with derivative of degree=5. It is known that there is no closed form pentic formula (for the roots of a degree=5 equation). So we find the t that minimizes sse by line search, since it is known that there is no closed form solution for roots of degree=5 polynomials =f t fT1 fT2 fT3 fD1 fD2 fU1 fU2 fS1 fS2 fR1 fR2 sse =f1=f*t1 DT _______ e fD e*fD e*fT eeDT fT UT e fU e*fU e*fT eeUT fT DSR1 e fS fR1e*fSR1 e*fDfR1 e*fDfS(1)eeDSR fD DSR2 e fS fR2e*fSR2 e*fDfR2 e*fDfS(2)eeDSR fD =g (gradient) f T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse f2=f1+t2G DT _______ e fD e*fD e*fT eeDT fT UT e fU e*fU e*fT eeUT fT DSR1 e fS fR1e*fSR1 e*fDfR1 e*fDfS(1)eeDSR fD DSR2 e fS fR2e*fSR2 e*fDfR2 e*fDfS(2)eeDSR fD Gradient of sse / f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse f3=f2+t3G DT _______ e fD e*fD e*fT eeDT fT UT e fU e*fU e*fT eeUT fT DSR1 e fS fR1e*fSR1 e*fDfR1 e*fDfS(1)eeDSR fD DSR2 e fS fR2e*fSR2 e*fDfR2 e*fDfS(2)eeDSR fD Gradient of sse / f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse f4=f3+t4G DT _______ e fD e*fD e*fT eeDT fT UT e fU e*fU e*fT eeUT fT DSR1 e fS fR1e*fSR1 e*fDfR1 e*fDfS(1)eeDSR fD DSR2 e fS fR2e*fSR2 e*fDfR2 e*fDfS(2)eeDSR fD Gradient of sse / f t T1 T2 T3 D1 D2 U1 U2 S1 S2 R1 R2 sse f5=f4+t5G DT _______ e fD e*fD e*fT eeDT fT UT e fU e*fU e*fT eeUT fT DSR1 e fS fR1e*fSR1 e*fDfR1 e*fDfS(1)eeDSR fD DSR2 e fS fR2e*fSR2 e*fDfR2 e*fDfS(2)eeDSR fD Gradient of sse / 2 3D f: f = (f T, f D, f U, f S, f R ) with gradient descent to minimize sse taken over DT,UT,DSR, not equating S=R=U.

Using just DT, train f = (f T, f D ) and using gradient descent to minimizing sse over DT, but this time we use a vector of t values rather than just one t value. T=(t T1, t T2, t T3, t D1, t D2 ) After many rounds, we optimize the t i 's one at a time according to a sequencing of the nonblanks. This approach still needs to be formulated mathematically? In this simple example we are able to zero out all square errors, first e(T1, D1), then e(T1, D2), then e(T2, D2), then e(T3, D1) f t 1.8 T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 f t 0.12 T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 t 0.38 T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 t 0.08 T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 t 0.69 T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 t T1 T2 T3 D1 D2 sse f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 Etc., until we get down to the following after 26 rounds : T1 T2 T3 D1 D2 sse= t= f*t DT __________ e fe*fDe*fT eeDT 1 3 * *** * * ** *** * Gradient of sse / 2 Here we use t' = t T1 to zero out e(T1,D1) only =t' t=0.08 T1 T2 T3 D1 D2 sse= f'=f*(t+t')G DT __________ eeDT 1 3 * *** * ** *** Gradient of sse / 2 Note that sse shoots up here, but be patient! Next we use t'=t D2 to xero out e(T1,D2): =t' t=0 T1 T2 T3 D1 D2 sse= =f'=f*(t+t')G DT __________ eeDT 1 3 * *** * ** *** Gradient of sse / 2 Next we use t' to xero out e(T2,D2): =t' t=0 T1 T2 T3 D1 D2 sse= DT __________ eeDT 1 3 * *** * ** *** Gradient of sse / 2 Next we use t' to xero out e(T3,D1): =t' t=0 T1 T2 T3 D1 D2 f sse= DT __________ eeDT 1 3 * *** * ** *** Gradient of sse / 2 We zero out all error using t'=(4.2,.074, 1.183, 1,.86). f = (1.088, 1.361, 3.258, 0.920, 3.672) is a lossless DT representation. Do we need the initial 26 rounds at all? No! Next slide. After the 26 rounds of gradient descent and line search is f= After using the t' vector method it is f= What this tells us is that we probably would have reached zero sse eventually with more gd+ls rounds, sincce we seem to be going toward the same vector.

Using just TD, train f = (fT, fD) using a vector of t' values rather than just one t value and we use it right away (not after 26 standard gd+ls rounds). We optimize the ti's one at a time according to a sequencing of the nonblanks. We are able to optimize to zero out all square errors G ZERO OUT SQ ERR AT EE(D1,T1) t' DT____________ sse T1 T2 T3 D1 D f+G*t' 4 5 e______________ fD__ e*fD__________ e*fT__________ eeDT__________ fT G ZERO OUT SQ ERR AT EE(D2,T1) t' DT____________ sse T1 T2 T3 D1 D f+G*t' 4 5 e______________ fD__ e*fD___________ e*fT__________ eeDT__________ fT G ZERO OUT SQ ERR AT EE(D2,T2) t' DT____________ sse T1 T2 T3 D1 D f+G*t' 4 5 e______________ fD__ e*fD_ _________ e*fT__________ eeDT__________ fT G ZERO OUT SQ ERR AT EE(D2,T3) t' DT____________ sse T1 T2 T3 D1 D f+G*t' 4 5 e______________ fD__ e*fD_ _________ e*fT__________ eeDT__________ fT There seems to be something fishy here. We use the same gradient in every round so we aren't using gradient decent. We could start with any vector instead of G here. Then just tune one ti at a time to get that error to zero(don't need to square errors either). This would require a round for every nonblank cell (looks good when the data is toy small but what about netflix sized data?) When it is possible to find a sequence through the noblank cells so that the ith one can be zeroed by the right choice of ti, we can find a f such that sse=0. Netflix is mostly blanks (98%) - may be possible? It seems productive to explore doing standard gradient descent until it converges and then to try introducing a this t' vectorized method to further reduce only the high error individual cells?? The other thing that comes to mind is that we may delete away all but the "pertinent" cells for a particular difficult prediction, and do it so that it IS possible to find a t' that zeros out the sse???

pDT T1, pDT T1, pDT T1, pDT T2, pDT T2, pDT T2, pDT T2,Mask 1010 pDT T1,Mask 1111 pDT T3, pDT T3, pDT T3, pDT T3,Mask 0101 pD Sender, pD Sender, pTD D1, pTD D1, pTD D1, pTD D1,Mask pTD D2, pTD D2, pTD D2, pTD D2,Mask pTU U1, pTU U1, pTU U1, pTU U1,Mask pTU U2, pTU U2, pTU U2, pTU U2,Mask pDR R pDR R pDR Mask 1111 pUT T1, pUT T1, pUT T1, pUT T1,Mask 1111 pUT T2, pUT T2, pUT T2, pUT T3, pUT T3, pUT T3, pUT T2,Mask UT DR U 2 1 T TD D 2 1 Sender Time Length pRD D2, pRD D1, pRD Mask 1111 pD Time, pD Time, pD Length, pD Length, f=(f D f T f U f S f R ), sse(f) =  dr  DR (f d f r -DR dr ) 2 +  td  TD (f t f d -TD td ) 2 +  ut  UT (f u f t -UT ut ) 2 +  s  D.S (f s -D s ) 2 sse(f+xG) =  dr  DR ((f d +xG d )(f r +xG r )-DR dr ) 2 +  s  D.S (f s +xG s -D s ) 2  td  TD ((f t +xG t )(f d +xG d )-DR td ) 2 +  ut  UT ((f u +xG u )(f t +xG t )-DR ut ) 2 + G=(  sse/  f d  sse/  f t  sse/  f u  sse/  f s  sse/  f r ) = 2 (  dr  DR f r e dr +  dt  DT f r e dt  td  TD f d e td +  ut  UT f u e ut  ut  UT f t e ut  s  D.S e s  rd  RD f d e rd ) but what us f S ? There is no such thing! An alternative possibility is to have DR and a separate DS matrixes. Next slide.

pDT T1, pDT T1, pDT T1, pDT T2, pDT T2, pDT T2, pDT T2,m 1010 pDT T1,m 1111 pDT T3, pDT T3, pDT T3, pDT T3,m 0101 pD Sender, pD Sender, pTD D1, pTD D1, pTD D1, pTD D1,Mask pTD D2, pTD D2, pTD D2, pTD D2,Mask pTU U1, pTU U1, pTU U1, pTU U1,Mask pTU U2, pTU U2, pTU U2, pTU U2,Mask pDR R pDR R pUT T1, pUT T1, pUT T1, pUT T1,Mask 1111 pUT T2, pUT T2, pUT T2, pUT T3, pUT T3, pUT T3, pUT T2,Mask UT DR U 2 1 T TD D 2 1 pD Time, pD Time, pD Length, pD Length, f=(f D f T f U f S f R ) sse(f) =  dr  DR (f d f r -DR dr ) 2 +  td  TD (f t f d -TD td ) 2 +  ut  UT (f u f t -UT ut ) 2 +  ds  DS (f ds -DS ds ) 2 G=(  sse/  f d  sse/  f t  sse/  f u  sse/  f s  sse/  f r ) = 2(  dr  DR f r e dr +  dt  DT f t e dt  td  TD f d e td +  tu  TU f u e tu  ut  UT f t e ut  sd  SD f d e sd  rd  RD f d e rd ) DS Time Length R 1 2 S 1 2 pDS S pDS S pSD S pSD S pRD R pRD R2 1111