Information Retrieval Basic Document Scoring
Similarity between binary vectors Document is binary vector X,Y in {0,1} v Score: overlap measure What’s wrong ?
Normalization Dice coefficient (wrt avg #terms) : Jaccard coefficient (wrt possible terms) : Cosine measure: Cos-measure less insensitive to doc’s sizes OK, triangular NO, triangular
What’s wrong in doc-similarity ? Overlap matching doesn’t consider: Term frequency in a document Talks more of t ? Then t should be weighted more. Term scarcity in collection of commoner than baby bed Length of documents score should be normalized
A famous “weight”: tf x idf where n t = #docs containing term t n = #docs in the indexed collection log Frequency of term t in doc d = #occ t / |d| n n idf tf t t t,d Sometimes we smooth the absolute term frequency:
Term-document matrices (real valued) Note can be >1! Bags-of-words view implies that the doc “Paolo loves Marylin” is indistinguishable from the doc “Marylin loves Paolo”. A doc is a vector of tf idf values, one component per term Even with stemming, we may have 20,000+ dims
A graphical example Postulate: Documents that are “close together” in the vector space talk about the same things. Euclidean distance sensible to vector length !! t1t1 d2d2 d1d1 d3d3 d4d4 d5d5 t3t3 t2t2 φ Euclidean distance vs Cosine similarity (no triangular inequality)
Doc-Doc similarity If normalized, cosine of angle between doc-vectors sim = cosine Euclidean dist. dot product Recall that ||u - v|| 2 = ||u|| 2 + ||v|| u v
Query-Doc similarity We choose the dot product as measure of proximity between document and query (seen as a short doc) Note: 0 if orthogonal (no words in common) MG-book proposes a different weighting scheme
MG book |d| may be precomputed, |q| may be considered as a constant; Hence, normalization does not depend on the query
Vector spaces and other operators Vector space OK for bag-of-words queries Clean metaphor for similar-document queries Not a good combination with operators: Boolean, wild-card, positional, proximity First generation of search engines Invented before “spamming” web search
Relevance Feedback: Rocchio User sends his/her query Q Search engine returns its best results User marks some results as relevant and resubmits query plus marked results (repeat the query !!) Search engine exploit this refined knowledge of the user need to return more relevant results.
Information Retrieval Efficient cosine computation
Find k top documents for Q IR solves the k-nearest neighbor problem for each query For short queries: standard indexes are optimized For long queries or doc-sim: we have high- dimensional spaces Locality-sensitive hashing…
Encoding document frequencies #docs(t) useful to compute IDF term freq useful to compute tf t,d Unary code is very effective for tf t,d abacus 8 aargh 12 acacia 35 1,2 7,3 83,1 87,2 … 1,1 5,1 13,1 17,1 … 7,1 8,2 40,1 97,3 … IDF ? TF ?
Computing a single cosine Accumulate component-wise sum Three problems: #sums per doc = #vocabulary terms #sim-computations = #documents #accumulators = #documents
Two tricks Sum only for terms in Q [w t,q ≠ 0] Compute Sim only for docs in IL [w t,d ≠ 0] On the query aargh abacus would only do accumulators 1,5,7,13,17,….,83,87,… abacus 8 aargh 12 acacia 35 1,2 7,3 83,1 87,2 … 1,1 5,1 13,1 17,1 … 7,1 8,2 40,1 97,3 … We could restrict to docs in the intersection!!
Advanced #1: Approximate rank-search Preprocess: Assign to each term, its m best documents (according to the TF-IDF). Lots of preprocessing Result: “preferred list” of answers for each term Search: For a q-term query, take the union of their q preferred lists – call this set S, where |S| mq. Compute cosines from the query to only the docs in S, and choose the top k. Need to pick m>k to work well empirically.
Advanced #2: Clustering Query Leader Follower
Advanced #2: Clustering Recall that docs ~ T-dim vector Pre-processing phase on n docs: pick L docs at random: call these leaders For each other doc, pre-compute nearest leader Docs attached to a leader: its followers; Likely: each leader has ~ n/L followers. Process a query as follows: Given query Q, find its nearest leader. Seek k nearest docs from among its followers.
Advanced #3: pruning Classic approach: scan docs and compute their sim(d,q). Accumulator approach: all sim(d,q) are computed in parallel. Build an accumulator array containing all sim(d,q), d,q. Exploit IL so that t in Q : d in IL t, compute TF-IDF t,d and sum it to sim(d,q) We RE-structure the computation: Terms of Q are considered for decreasing IDF (i.e. smaller ILs first) Documents in IL lists are ordered by decreasing TF This way, when a term is picked and its IL is scanned, the TF-IDF are computed by decreasing value, so that we can apply some pruning
Advanced #4: Not only term weights Current search engines use various measures for estimating the relevance of a page wrt a query Relevance(Q,d) = h(d, t 1, t 2,…,t q ) PageRank [Google] is one of these methods and denotes the relevance taking into account the hyperlinks in the Web graph (more later) Google tf-idf PLUS PageRank (PLUS other weights) Google toolbar suggests that PageRank is crucial
Advanced #4: Fancy-hits heuristic Preprocess: Fancy_hits(t) = its m docs with highest tf-idf weight Sort IL(t) by decreasing PR weight Idea: a document that scores high should be in FH or the front of IL Search for a t-term query: First FH: Take the common docs of their FH compute the score of these docs and keep the top-k docs. Then IL: scan ILs and check the common docs of ILs FHs Compute the score of these docs and possibly insert them into the current top-k. Stop when m docs have been checked or the PR score goes below some threshold.
Advanced #5: high-dim space Binary vectors are easier to manage: Map unit vector u to {0,1} r is drawn from the unit sphere. The h is locality sensitive: Map u to h1(u), h2(u),..., hk(u) Repeat g times, to control error. If A & B sets
Information Retrieval Recommendation Systems
Recommendations We have a list of restaurants with and ratings for some Which restaurant(s) should I recommend to Dave?
Basic Algorithm Recommend the most popular restaurants say # positive votes minus # negative votes What if Dave does not like Spaghetti?
Smart Algorithm Basic idea: find the person “most similar” to Dave according to cosine-similarity (i.e. Estie), and then recommend something this person likes. Perhaps recommend Straits Cafe to Dave Do you want to rely on one person’s opinions?