EFFICIENT ALGORITHMS FOR APPROXIMATE MEMBER EXTRACTION By Swapnil Kharche and Pavan Basheerabad
INTRODUCTION AME How to efficiently extract a substring from a text document that approximately match some strings in the given dictionary. Applications – named entity recognition, data cleaning Two Steps Filtration – filter out strings from dictionary which are very different from substring Verification – each candidate string is verified to decide whether the substring should be extracted 2
INTRODUCTION: AN EXAMPLE A Dictionary of strings we are interested in E.g. Conference names, author names etc. We are going to locate their “approximate appearances” in a series of documents. 3
PROBLEM DEFINITION Given a dictionary R of strings and a similarity threshold δ ∈ [0,1], then a query M is submitted. Here, M represents a relatively long string (e.g. a text file). The task of AME is to extract all M’s substrings m, such that there exists some r ∈ R satisfying Sim(m,r) ≥ δ. r is a piece of evidence for m Sim() is a function measuring the similarity of two strings An example of similarity measure Jaccard Similarity: 4
APPROACH When the input is given, we need to decide whether a substring m should be extracted Simple verification on all dictionary strings may be inefficient Pre-pruning and post-verifying is beneficial But should it be running-speed-oriented or filtering-power-oriented? Less time or less survivors? 5
FILTRATION-VERIFICATION 6 Filtration R Verification Potential Matches True Matches Wrong Matches Input Query M
FILTRATION-VERIFICATION(CONT’D) We need to balance between the two stages 7 More(less) filtration time Strong(weak) Filtration power Fewer(more) candidates Less(more) verification time Overall performance =Tf+Tv ??
TECHNIQUES If Sim(m,r) ≥ δ, what do we have ? wt(Sig(m)∩Sig(r)) ≥ τ (m) Existing techniques wt(Sig(m)∩Sig(r)) ≥ min{ τ (m), τ (r)} Technique used Where, Sig(m) is a prefix signature set of string m τ (m) is wt(Sig(m))-(1- δ )wt(m) So the threshold does not remain constant Use inverted lists to count sig-token overlapping Using IDF weights (Inverse Document Frequency) 8
SIGNATURE-BASED INVERTED LISTS(SIL) Lists indexed by sig-tokens Each sig-token of a string creates a node (containing the string’s id) in the corresponding list. E.g. R = { r1 = “canon eos 5d digital camera”, r2 =“Nikon digital slr camera”, r3 = “canon slr” }. wt(5d, eos, slr, Nikon, canon, camera, digital) = (9, 7, 2, 2, 2, 1, 1) 9
SIL (CONT’D) 10 ridStringSignature Set 1“canon eos 5d digital camera” {“canon”,”eos”, “5d”} 2“Nikon digital slr camera” {“nikon”, “slr”, “camera”} 3“canon slr”{“canon”, “slr”} SignatureString rids 5d(1) “canon”(1), (3) “camera”(2) “eos”(1) “Nikon”(2) “slr”(2), (3) Signature sets of R’s strings SIL
EvSCAN ALGORITHM BY SIL Compute the overlapped sig weight using wt(Sig(m)∩Sig(r)) The best matched string will be the one which satisfy the condition wt(Sig(m)∩Sig(r)) ≥ min{ τ (m), τ (r)} E.g. m=“canon eos digital camera”, δ= ridwt(Sig(m)∩Sig(r))min{τ(m),τ(r)}
EvITER Algorithm – Progressive Computation Recall we are checking all substrings Some of them are quite similar, indicating that they share duplicate computation This means that, if m have potential evidence r, then m t is very likely to match r Formally we proved that Let ES(m) be the set of “potential evidence” for m, list[t]={s| all dictionary strings that contain token t} We have ES(m t) ES(m) ∪ list[t] ES(m) = { r ∈ R | wt(m ∩ sig(r)) ≥ min{ δ * wt(m), τ (r)}} 12
EXAMPLE Document M: m t “…. cannon eos digital camera lens…” We know that only r1, r22, r53 are possible to match “cannon eos digital camera lens” ES(m) {r1} … lens, 3.0 … 2253 List[t] 13
FLOW OF EVIDENCE EvITER for “Evidence ITERATION” 14
THE STATIC THRESHOLD PROBLEM How does this index work so far? -“Get ready for δ =0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ =0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ =0.9…” -“Sorry, please wait another 30min for index regeneration…” 15
THE STATIC THRESHOLD PROBLEM This One Seems Better -“Get ready for δ> =0.8 please.” -“Please wait 30min for index generation…” -“Ready!” -“Document M1, δ =0.8. Go!” -“…Extraction complete.” -“Document M2, and I want δ =0.9…” -“…Extraction complete.” 16
EXPERIMENTAL DATASETS Paper titles from the DBLP website Author names from DBLP website 17
RESULTS 18 Fig. Performance under different k ( δ = 0.85)
PERFORMANCE 19 Fig. Performance under different thresholds (k = 3)
CONCLUSION This method causes no false negatives It achieves a good balance between the two phases of filtration and verification. They proposed EvITER to eliminate duplicate computation It achieves both effective & efficient performance 20
THANK YOU! 21