Download presentation
Presentation is loading. Please wait.
Published byAlbert Lynch Modified over 9 years ago
1
Presented by: Aneeta Kolhe
2
Named Entity Recognition finds approximate matches in text. Important task for information extraction and integration, text mining and also for web search.
3
Approximate dictionary matching. Previous solution – Token based similarity constraints Proposed solution – Neighborhood generation method
4
It uses Jaccard co-efficient similarity It may miss some match. It may result in too many matches.
5
For Example: Given al-qaida *“al-qaeda” or “al-qa’ida” won’t be matched unless use low jaccard similarity of 0.33. “alqaeda” will match “al gore” as well as “al pacino” Hence we use edit distance
6
Problem Definition: For example: Given :document D, a dictionary E of entities To find: all substrings in D such that they are within edit distance from one of the entities in E Solution: Iterate through all the valid substrings of the document D Issue a similarity selection query to the dictionary to retrieve the set of entities that satisfy the constraint. Consider each substring as a query segment.
7
at least one partition with at most one edit error select k т = (т +1)/2 Example: s = [ abcdefghijkl ] s’= [ axxbcdefghxijkl ] т = 3, k т = 2 s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ]
8
Shifting the first partition s by 2 => s = [cdef] scaling it by -1 => s = [ cdefg ] Transformation rules First partition, we only need to consider scaling within the range of [−2, 2]. Last partition, we only need to consider the combination of the same amount of shifting and scaling within the range of [− т, т ] (so that the last character is always included in the resulting substring). For the rest of the partitions, we need to consider shifting within the range [− т, т ] and scaling within the range [−2, 2].
9
1st partition: 5 variations intermediate partitions: 5*(2 т +1) variations last partition: (2 т +1) variations Total amount of the 1-variants generated = O(m + 2).
11
s = [ abcdef ], [ ghijkl ] s’ = [ axxbcde ], [ fghxijkl ] segment s’ comes in second partition [ fghxijkl ], will have 1-variant match with s’s partition variation [fghijkl ] generated from s’s second partition.
12
The partition (variation) is longer than a prefix length l p, we only use its l p-prefix to generate its 1-variants. Assume l p is set to 3. Then 1-variants are generated from only the following prefixes. By setting l p ≤ m/kт – 2 Total # of 1-variants generated is further reduced to O(l p т²).
13
to index short and long entities in the dictionary, and store them in two inverted indexes, Ishort and Ilong For each entity whose length is smaller than kт lp + т lp-prefix of each partition variation is used to generate its 1-variant family, which will be indexed.
14
Algorithm : BuildIndex (E,, lp) for each e Є E do if |e| < k lp + then V GenVariants(e[1.. min(lp, |e|)], ); /* The GenVariants (s, k) function generates the k-variant family of string s */ for each v Є V do Ishort <- Ishort U { e }; if |e| ≥ k lp then P the set of k partitions of e; for each i-th partition p Є P do PT TransformPartition(p); /* according to the three transformation rules in Section 3.1 */ for each partition variations pT Є PT do V GenVariants(p[1.. lp], 1); for each v 2 V do Ilong ; return (Ishort, Ilong)
15
Algorithm : MatchDocument (D, E, т ) for each starting position p Є[1, |D| − Lmin + т + 1] do SearchLong (D[p.. p + lp − 1], E, т ); /* matching entities no shorter than kт lp */ SearchShort (D[p.. p + lp − 1], E, т ); /* matching entities of length in [lmin, kт lp) */
16
R <- ф; /* holds results */ C <- ф ; /* holds candidates */ V <- GenVariants(s, 1) ; /* gen 1-variant family */ for each v Є V do for each Є Ilongv do C ; /* duplicates removed */ 7 for each Є C do 8 S <- QuerySegmentInstantiation(e, pid); /* returns the set of query segment candidates for e */ for each seg Є S do if Verify(seg, e) = true then R Return R
17
Search short(s) We need to generate the т-variant families for each possible length l between Lmin − т and lp If the current query segment is shorter than lp, every candidate pair formed by probing the index needs to be verified Otherwise, we need to perform verification for 2 т + 1 possible query segments.
18
For example, enumerate 1-variants of the string [ abcdef ] from left to right. no variant starts with abc in the index. Algorithm still enumerate other three 1- variants containing abc. To avoid this set parameter lpp set to lp /2.
19
Consider 4 possible cases: Prefix Match Suffix Match Action Truetrueenumerate all 1-variants of q[1.. lp] False discard q as there is no match FalseTrueenumerate all 1-variants of q[1.. lpp] False enumerate all 1-variants of q[(lpp + 1).. lp]
20
Successfully reduced the size of neighborhood Proposed an efficient query processing algorithm Optimized the algorithm to share computation Avoid unnecessary variant enumeration
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.