Exploiting Dictionaries in Named Entity Extraction: Combining Semi-Markov Extraction Processes and Data Integration Methods William W. Cohen, Sunita Sarawagi Presented by: Quoc Le CS591CXZ – General Web Mining.
Motivation Information Extraction –Deriving structured data from unstructured data. –Using structured data as guidance to improve extracting unstructured sources. Name Entity Recognition –Extracting names, locations, times. –Improving NER systems with external dictionaries.
Approaches Look-up entities in (large) dictionary. –Surface from is different, prone to noise and errors. Take an existing NER system and link to an external dictionary. –High performance NER classify words to class vs. similarity of entire entity to dictionary entry.
Problem Formulation Name finding as Word Tagging –E.g.: (Fred) Person (please stop by) Other (my office) Loc (this afternoon) Time –x: sequence of words map to y: sequence of labels. (x,y) pairs. Conditional distribution of y given x (HMM):
Semi-Markovian NER Segmentation: S = : start position t j, end position u j and a label l j. –E.g.: S = {(1,1,Person), (2,4.Oth), (5,6,Loc), (7,8,Time)} Conditional semi-Markov Model (CSMM): Inference and Learning problems
Compare to other approaches CMM: CSMM predicts tag + position. Order-L CMM: CSMM uses corresponding tokens (not just previous ones). Treat external dictionary as training examples: prone to misspelling, large dictionary, different dictionary. (Good when training data is limited). N-gram classification: entities may overlap Use dictionary to bootstrap the search for extraction patterns: rule-based vs. probabilistic.
Training SMM Modified version of Collins’ perceptron-based algorithm for training HMMs. Assume local feature function f which maps pair (x,S) and an index j to a vector of features f(j,x,S). Define: Let W be the weight vector over the components of F. –Inference: Compute V(W,x) – the Viterbi decoding of x with W. –Training: Learn W that lead to best performance. Viterbi search can be done with recurrence of V x,W (i,y).
Perceptron-based SMM Learning Let SCORE(x,W,S) = W. F(x,S). For each example x t, S t : –Find K segmentations that gives highest score –Let W t+1 = W t –For each I such that SCORE(x t,W t,S i ) is greater than (1-β). SCORE(x t,W t,S t ), update W t+1 : W t+1 = W t+1 + F(x t,S t ) – F(x t,S i ) Return the average of all W t.
Features Examples: value of segment, length of segment, left window, right window etc. Most can be applied to HMM NER system. More powerful and meaningful, e.g. “X+ X+” is more indicative than “X+” for name. Distance features: similarity to words in an external dictionary
Distance Features D: dictionary; d: distance metric, e: entity name in D. e’: segment. Define: –g D/d (e’) = min d(e,e’). Distance metric: Jaccard (word-level), Jaro-Winkler (character-level), TFIDF (word-level), SoftTFIDF (hybrid measure).
Experiments HMM-VP (1) : Predicts two labels y: one for tokens inside an entity & one outside. HMM-VP (4) : Encoding scheme: Use labels with tags unique, begin, end, continue and other. –E.g.: (Fred) Person unique,please stop by the (fourth) Loc begin (floor meeting) Loc c ontinue (room) Loc end SMM (K = 2, E = 20, β = 0.05): any, first, last, exact. Data sets: address in India, student s, jobs.
Considerations Evaluate exact matching against a dictionary: low recall, errors, but provide good indication of quality of dictionary. Normalizing dictionary entries: yes and no. E.g.: “Will” & “will”. For HMMs, we could use partial distance between tokens & dictionary entries. Segment size is bounded by some number
Evaluation Combination of NER methods; without external dictionary, with binary features, with distance features. Train with only 10% data, test with the rest. Do it 7 times and record average. Partial extraction gets no credit Use precision, recall and F1.
Results SMM-VP is best: outperforms HMM-VP (4) on 13 out of 15 cases. HMM-VP (1) is worst: HMM-VP (4) outperforms HMM-VP(1) on 13 out of 15 cases. Binary dict. features are helpful but distance features are more helpful. See Table 1 (details) and 4 (short).
Effects Improvements over Collins methods – T.5. The gap between SMM and HMM-VP (4) decreases when training size increases, but still at different convergent speed. T.2 High order HMM doesn’t improve much performance. T.6 Alternative (less-related) dictionaries: Both methods seems fairly robust.
Conclusion & Questions Conclusion –Incorporate knowledge nicely. –Applicable to sequential model. –Improvements is significant, but it uses more resources and run 3-5 times slower. Questions: –What if dictionary is not super-set? Unrelated dictionary. –Harder type of data which is not easy to get named.