Download presentation
Presentation is loading. Please wait.
Published byAmari Peyser Modified over 9 years ago
1
Rocchio’s Algorithm 1
2
Motivation Naïve Bayes is unusual as a learner: – Only one pass through data – Order doesn’t matter 2
3
Rocchio’s algorithm Relevance Feedback in Information Retrieval, SMART Retrieval System Experiments in Automatic Document Processing, 1971, Prentice Hall Inc. 3
4
Rocchio’s algorithm Many variants of these formulae …as long as u(w,d)=0 for words not in d! Store only non-zeros in u ( d), so size is O(| d | ) But size of u ( y) is O(| n V | ) 4
5
Rocchio’s algorithm Given a table mapping w to DF(w), we can compute v ( d ) from the words in d… and the rest of the learning algorithm is just adding… 5
6
Rocchio v Bayes id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... X=w 1 ^Y=sports X=w 1 ^Y=worldNews X=.. X=w 2 ^Y=… X=… … 5245 1054 2120 37 3 … Train data Event counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … C[X=w 1,1 ^Y=sports]=5245, C[X=w 1,1 ^Y=..],C[X=w 1,2 ^…] C[X=w 2,1 ^Y=….]=1054,…, C[X=w 2,k2 ^…] C[X=w 3,1 ^Y=….]=… … Recall Naïve Bayes test process? Imagine a similar process but for labeled documents… 6
7
Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … 12 1054 2120 37 3 … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( w 1,1, id1), v ( w 1,2, id1)…v ( w 1,k1, id1) v (w 2,1, id2), v ( w 2,2, id2)… … 7
8
Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … 12 1054 2120 37 3 … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( id 1 ) v ( id 2 ) … 8
9
Rocchio…. id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v( w 1,1 w 1,2 w 1,3 …. w 1,k1 ), the document vector for id 1 v( w 2,1 w 2,2 w 2,3 ….)= v(w 2,1,d), v(w 2,2,d), … … For each ( y, v), go through the non-zero values in v … one for each w in the document d …and increment a counter for that dimension of v ( y ) Message : increment v ( y1 )’s weight for w 1,1 by α v(w 1,1,d) /|C y | Message : increment v ( y1 )’s weight for w 1,2 by α v(w 1,2,d) /|C y | 9
10
Rocchio at Test Time id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … id 5 y5 w 5,1 w 5,2 …... aardvark agent … v(y1,w)=0.0012 v(y1,w)=0.013, v(y2,w)=….... … Train data Rocchio: DF counts id 1 y1 w 1,1 w 1,2 w 1,3 …. w 1,k1 id 2 y2 w 2,1 w 2,2 w 2,3 …. id 3 y3 w 3,1 w 3,2 …. id 4 y4 w 4,1 w 4,2 … v ( id 1 ), v(w 1,1,y1),v(w 1,1,y1),….,v(w 1,k1,yk),…,v(w 1,k1,yk) v ( id 2 ), v(w 2,1,y1),v(w 2,1,y1),…. … 10
11
Rocchio Summary Compute DF – one scan thru docs Compute v ( id i ) for each document – output size O(n) Add up vectors to get v ( y ) Classification ~= disk NB time: O(n), n=corpus size – like NB event-counts time: O(n) – one scan, if DF fits in memory – like first part of NB test procedure otherwise time: O(n) – one scan if v ( y )’s fit in memory – like NB training otherwise 11
12
Rocchio results… Joacchim ’98, “A Probabilistic Analysis of the Rocchio Algorithm…” Variant TF and IDF formulas Rocchio’s method (w/ linear TF) 12
13
13
14
Rocchio results… Schapire, Singer, Singhal, “Boosting and Rocchio Applied to Text Filtering”, SIGIR 98 Reuters 21578 – all classes (not just the frequent ones) 14
15
A hidden agenda Part of machine learning is good grasp of theory Part of ML is a good grasp of what hacks tend to work These are not always the same – Especially in big-data situations Catalog of useful tricks so far – Brute-force estimation of a joint distribution – Naive Bayes – Stream-and-sort, request-and-answer patterns – BLRT and KL-divergence (and when to use them) – TF-IDF weighting – especially IDF it’s often useful even when we don’t understand why 15
16
One more Rocchio observation Rennie et al, ICML 2003, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers” NB + cascade of hacks 16
17
One more Rocchio observation Rennie et al, ICML 2003, “Tackling the Poor Assumptions of Naïve Bayes Text Classifiers” “In tests, we found the length normalization to be most useful, followed by the log transform…these transforms were also applied to the input of SVM”. 17
18
One? more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 DFs -1 DFs - 2 DFs -3 DFs Split into documents subsets Sort and add counts Compute DFs 18
19
One?? more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s 19
20
O(1) more Rocchio observation Documents/labels Documents/labels – 1 Documents/labels – 2 Documents/labels – 3 v-1 v-2 v-3 DFs Split into documents subsets Sort and add vectors Compute partial v ( y)’s v ( y )’s DFs We have shared access to the DFs, but only shared read access – we don’t need to share write access. So we only need to copy the information across the different processes. 20
21
Abstract Implementation: TFIDF data = pairs (docid,term) where term is a word appears in document with id docid operators: DISTINCT, MAP, JOIN GROUP BY …. [RETAINING …] REDUCING TO a reduce step docFreq = DISTINCT data | GROUP BY λ (docid,term):term REDUCING TO count /* (term,df) */ docIds = MAP DATA BY= λ (docid,term):docid | DISTINCT numDocs = GROUP docIds BY λ docid:1 REDUCING TO count /* (1,numDocs) */ dataPlusDF = JOIN data BY λ (docid, term):term, docFreq BY λ (term, df):term | MAP λ ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */ unnormalizedDocVecs = JOIN dataPlusDF by λ row:1, numDocs by λ row:1 | MAP λ ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df)) /* (docId, term, weight-before-normalizing) : u */ 1/2 docIdterm d123found d123aardvark keyvalue found(d123,found),(d134,found),… aardvark(d123,aardvark),… keyvalue 112451 keyvalue found(d123,found),(d134,found),… 2456 aardvark(d123,aardvark),… 7 21
22
Abstract Implementation: TFIDF normalizers = GROUP unnormalizedDocVecs BY λ (docId,term,w):docid RETAINING λ (docId,term,w): w 2 REDUCING TO sum /* (docid,sum-of-square-weights) */ docVec = JOIN unnormalizedDocVecs BY λ (docId,term,w):docid, normalizers BY λ (docId,norm):docid | MAP λ ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm)) /* (docId, term, weight) */ 2/2 key d1234(d1234,found,1.542), (d1234,aardvark,13.23),… 37.234 d3214…. key d1234(d1234,found,1.542), (d1234,aardvark,13.23),… 37.234 d3214…. 29.654 docIdtermw d1234found1.542 d1234aardvark13.23 docIdw d123437.234 d123437.234 22
23
GuineaPig: demo Pure Python (< 1500 lines) Streams Python data structures – strings, numbers, tuples (a,b), lists [a,b,c] – No records: operations defined functionally Compiles to Hadoop streaming pipeline – Optimizes sequences of MAPs Runs locally without Hadoop – compiles to stream-and-sort pipeline – intermediate results can be viewed Can easily run parts of a pipeline http://curtis.ml.cmu.edu/w/courses/index.php/G uinea_Pig http://curtis.ml.cmu.edu/w/courses/index.php/G uinea_Pig 23
24
Actual Implementation 24
25
Full Implementation 25
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.