Download presentation
Presentation is loading. Please wait.
Published byDerrick Stokes Modified over 9 years ago
1
Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU
2
Outline 1.Named Entity Recognition (NER) 2.Domain Adaptation Problem for NER 3.Cross-domain Bootstrapping 3.1Feature Generalization with Word Clusters 3.2Instance Selection Based on Multiple Criteria 4.Conclusion NYU
3
1.Named Entity Recognition (NER) Two missions U.S. Defense Secretary Donald H. Rumsfeld discussed the resolution … NYU Identification Classification NAME GPEORGPERSON
4
2.Domain Adaptation Problem for NER NYU NER system performs well on in-domain data (F- measure 83.08) But performs poorly on out-of-domain data (F- measure 65.09) NYU Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq …
5
2.Domain Adaptation Problem for NER NYU 1.No annotated data from the target domain 2.Many words are out-of-vocabulary 3.Naming conventions are different: 1.Length: short vs long source: George Bush; Donald H. Rumsfeld target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud 2.Capitalization: weaker in target 4.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … 1.No annotated data from the target domain 2.Many words are out-of-vocabulary 3.Naming conventions are different: 1.Length: short vs long source: George Bush; Donald H. Rumsfeld target: Abdul Sattar al-Rishawi; Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud 2.Capitalization: weaker in target 4.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source- domain tagger to the target domain without annotating target domain data We want to automatically adapt the source- domain tagger to the target domain without annotating target domain data
6
3. Cross-domain Bootstrapping 1.Train a tagger from labeled source data 2.Tag all unlabeled target data with current tagger 3.Select good tagged words and add these to labeled data 4.Re-train the tagger Trained tagger Unlabeled target data Instance Selection Labeled Source data President Assad Feature Generalization Multiple Criteria NYU
7
3.1 Feature Generalization with Word Clusters The source model Sequential model, assigning name classes to a sequence of tokens One name type is split into two classes B_PER (beginning of PERSON) I_PER (continuation of PERSON) Maximum Entropy Markov Model (McCallum et al., 2000) Customary features NYU 3. Cross-domain Bootstrapping U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER
8
3.1 Feature Generalization with Word Clusters The source/seed model Customary features Extracted from context window ( t i-2, t i-1, t i, t i+1, t i+2 ) NYU 3. Cross-domain Bootstrapping U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER currentTokenDonald wordType_currentTokeninitial_capitalized previousToken_-1Secretary previousToken_-1_class O previousToken_-2Defense nextToken_+1H. ……
9
3.1 Feature Generalization with Word Clusters Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm Represent each word as a bit string NYU Bit stringExamples 110100011John, James, Mike, Steven 11010011101Abdul, Mustafa, Abi, Abdel 11010011111Shaikh, Shaykh, Sheikh, Sheik 111111110Qaeda, Qaida, qaeda, QAEDA 00011110000FBI, FDA, NYPD 000111100100Taliban
10
3.1 Feature Generalization with Word Clusters Add an additional layer of features that include word clusters currentToken = John currentPrefix3 = 100fires also for target words To avoid commitment to a single cluster: cut word hierarchy at different levels NYU
11
3.1 Feature Generalization with Word Clusters Performance on the target domain Test set contains 23K tokens PERSON/ORGANIZATION/GPE 771/585/559 instances All other tokens belong to not-a-name class 4 points improvement of F-measure NYU
12
3.2 Instance Selection Based on Multiple Criteria Single-domain bootstrapping uses a confidence measure as the single selection criterion In a cross-domain setting, the most confidently labeled instances are highly correlated with the source domain contain little information about the target domain. We propose multiple criteria Criterion 1: Novelty– prefer target-specific instances Promote Abdul instead of John NYU
13
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence - prefer confidently labeled instances Local confidence: based on local features NYU
14
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence Global confidence: based on corpus statistics NYU 1PrimeMinisterAbdulKarimKabaritiPER 2warlordGeneralAbdulRashidDostumPER 3PresidentA.P.J.AbdulKalamwillPER 4PresidentA.P.J.AbdulKalamhasPER 5AbdullahbinAbdulAziz,PER 6atKingAbdulAzizUniversityORG 7NawabMohammedAbdulAli,PER 8DrAliAbdulAzizAlPER 9NayefbinAbdulAzizsaidPER 10leaderGeneralAbdulRashidDostumPER
15
3.2 Instance Selection Based on Multiple Criteria Criterion 2: Confidence Global confidence Combined confidence: product of local and global confidence NYU
16
3.2 Instance Selection Based on Multiple Criteria Criterion 3: Density - prefer representative instances which can be seen as centroid instances NYU
17
3.2 Instance Selection Based on Multiple Criteria Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances “, said * in his” Highly confident instance High density, representative instance BUT, continuing to promote such instance would not gain additional benefit NYU
18
3.2 Instance Selection Based on Multiple Criteria Putting all criteria together 1.Novelty: filter out source-dependent instances 2.Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set 3.Density: rank instances in the candidate set in descending order of density 4.Diversity: 1.accepts the first instance (with the highest density) in the candidate set 2.and selects other candidates based on the diff measure. NYU
19
3.2 Instance Selection Based on Multiple Criteria Results NYU
20
4. Conclusion Proposed a general cross-domain bootstrapping algorithm for adapting a model trained only on a source domain to a target domain Improved the source model’s F score by around 7 points This is achieved 1.without using any annotated data from the target domain 2.without explicitly encoding any target-domain-specific knowledge into our system The improvement is largely due to 1.the feature generalization of the source model with word clusters 2.the multi-criteria-based instance selection method NYU
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.