Information Retrieval Search Engine Technology (4) Prof. Dragomir R. Radev
SET/IR – W/S 2009 … 7. Approximate string matching …
Levenshtein edit distance Examples: –Theatre-> theater –Ghaddafi->Qadafi –Computer->counter Edit distance (inserts, deletes, substitutions) –Edit transcript Done through dynamic programming
Recurrence relation Three dependencies –D(i,0)=i –D(0,j)=j –D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-1)+t(i,j)] Simple edit distance: –t(i,j) = 0 iff S1(i)=S2(j)
Example Gusfield 1997 WRITERS V11 I22 N33 T44 N55 E66 R77
Example (cont’d) Gusfield 1997 WRITERS V I N T44444* N55 E66 R77
Tracebacks Gusfield 1997 WRITERS V I N T44444* N55 E66 R77
Weighted edit distance Used to emphasize the relative cost of different edit operations Useful in bioinformatics –Homology information –BLAST –Blosum – heidelberg.de:8000/misc/mat/blosum50.htmlhttp://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html
Links Web sites: – – Demo: –/home/cs6998/tools/editDistance/dp/l.pl theater theatre – h.htmlhttp://nayana.ece.ucsb.edu/imsearch/imsearc h.html
Other methods Cosine Generation probabilities (language modeling) (exp)KL-divergence
SET/IR – W/S 2009 … 8. Query expansion Relevance feedback …
Query expansion
Corpus-based: mine query logs NLP-based Vector-space relevance feedback
Relevance feedback Problem: initial query may not be the most appropriate to satisfy a given information need. Idea: modify the original query so that it gets closer to the right documents in the vector space
Relevance feedback Automatic Manual Method: identifying feedback terms Q’ = a 1 Q + a 2 R - a 3 N Often a 1 = 1, a 2 = 1/|R| and a 3 = 1/|N|
Example Q = “safety minivans” D 1 = “car safety minivans tests injury statistics” - relevant D 2 = “liability tests safety” - relevant D 3 = “car passengers injury reviews” - non- relevant R = ? S = ? Q’ = ?
Pseudo relevance feedback Automatic query expansion –Thesaurus-based expansion (e.g., using latent semantic indexing – later…) –Distributional similarity –Query log mining
Examples Book: publication, product, fact, dramatic composition, record Computer: machine, expert, calculator, reckoner, figurer Fruit: reproductive structure, consequence, product, bear Politician: leader, schemer Newspaper: press, publisher, product, paper, newsprint Distributional clustering: Lexical semantics (Hypernymy): Book: autobiography, essay, biography, memoirs, novels Computer: adobe, computing, computers, developed, hardware Fruit: leafy, canned, fruits, flowers, grapes Politician: activist, campaigner, politicians, intellectuals, journalist Newspaper: daily, globe, newspapers, newsday, paper
Examples (query logs) Book: booksellers, bookmark, blue Computer: sales, notebook, stores, shop Fruit: recipes cake salad basket company Games: online play gameboy free video Politician: careers federal office history Newspaper: online website college information Schools: elementary high ranked yearbook California: berkeley san francisco southern French: embassy dictionary learn
[Otterbacher et al. HLT EMNLP 2005]
Readings 4: MRS15, MRS16 5: MRS17 6: MRS18, MRS19