SLIDE 1IS 202 – FALL 2004 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall SIMS 202: Information Organization and Retrieval Math Tutorial
SLIDE 2IS 202 – FALL 2004 Summation
SLIDE 3IS 202 – FALL 2004 Program of that summation public class Sumup { int n=10; int s = 0; int i = 0; while (i <= (n-1)) { s = s+i; i = i + 1; } or… public class Sumup2 { int n=10; int s = 0; int i; for (i = 0; i <= n-1; i++) s = s + i; }
SLIDE 4IS 202 – FALL 2004
SLIDE 5IS 202 – FALL 2004 public class multup { int n=10; int s = 0; int i = 1; int a[] = {0,1,2,3,4,5,6,7,8,9,10,11}; while (i <= n) { s = s + (a[i] * a[i+1]); i = i + 1; } or… public class multup2 { int n=10; int s = 0; int i; int a[] = {0,1,2,3,4,5,6,7,8,9,10,11}; for (i = 1; i <= n; i++) s = s + (a[i] * a[i+1]); } The value of S depends on the values for the array “a”
SLIDE 6IS 202 – FALL 2004 Simple tf*idf
SLIDE 7IS 202 – FALL 2004 Inverse Document Frequency IDF provides high values for rare words and low values for common words For a collection of documents (N = 10000)
SLIDE 8IS 202 – FALL 2004 Similarity Measures Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient
SLIDE 9IS 202 – FALL 2004 tf*idf Normalization Normalize the term weights (so longer vectors are not unfairly given more weight) –Normalize usually means force all values to fall within a certain range, usually between 0 and 1, inclusive Additional Parentheses added to clarify the order of operations
SLIDE 10IS 202 – FALL 2004 Vector Space Similarity Now, the similarity of two documents is: This is also called the cosine normalized inner product –The normalization was done when weighting the terms
SLIDE 11IS 202 – FALL 2004 Vector Space Similarity Measure Combine tf and idf into a similarity measure
SLIDE 12IS 202 – FALL 2004 All in one equation is… Extra parentheses added to clarify order of operations
SLIDE 13IS 202 – FALL 2004 Computing Similarity Scores
SLIDE 14IS 202 – FALL 2004 What’s Cosine Anyway? “One of the basic trigonometric functions encountered in trigonometry. Let theta be an angle measured counterclockwise from the x-axis along the arc of the unit circle. Then cos(theta) is the horizontal coordinate of the arc endpoint. As a result of this definition, the cosine function is periodic with period 2pi.” From
SLIDE 15IS 202 – FALL 2004 Cosine vs. Degrees CosineCosine Degrees
SLIDE 16IS 202 – FALL 2004 Computing a Similarity Score
SLIDE 17IS 202 – FALL 2004 Vector Space Matching D2D2 D1D1 Q Term B Term A D i =(d i1,w di1 ;d i2, w di2 ;…;d it, w dit ) Q =(q i1,w qi1 ;q i2, w qi2 ;…;q it, w qit ) Q = (0.4,0.8) D1=(0.8,0.3) D2=(0.2,0.7)