Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University.

Slides:

Advertisements

Similar presentations

Active Learning with Feedback on Both Features and Instances H. Raghavan, O. Madani and R. Jones Journal of Machine Learning Research 7 (2006) Presented.

Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Detecting Malicious Flux Service Networks through Passive Analysis of Recursive DNS Traces Roberto Perdisci, Igino Corona, David Dagon, Wenke Lee ACSAC.

Huffman code and ID3 Prof. Sin-Min Lee Department of Computer Science.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.

Semi-supervised Relation Extraction with Large-scale Word Clustering Ang Sun Ralph Grishman Satoshi Sekine New York University June 20, 2011 NYU.

Comparing Twitter Summarization Algorithms for Multiple Post Summaries David Inouye and Jugal K. Kalita SocialCom May 10 Hyewon Lim.

Sequence Clustering and Labeling for Unsupervised Query Intent Discovery Speaker: Po-Hsien Shih Advisor: Jia-Ling Koh Source: WSDM’12 Date: 1 November,

NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.

Cross-Domain Bootstrapping for Named Entity Recognition Ang Sun Ralph Grishman New York University July 28, 2011 Beijing, EOS, SIGIR 2011 NYU.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Original slides by Iman Sen Edited by Ralph Grishman.

Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.

Ang Sun Ralph Grishman Wei Xu Bonan Min November 15, 2011 TAC 2011 Workshop Gaithersburg, Maryland USA.

Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.

CS Word Sense Disambiguation. 2 Overview A problem for semantic attachment approaches: what happens when a given lexeme has multiple ‘meanings’?

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

A Self Learning Universal Concept Spotter By Tomek Strzalkowski and Jin Wang Presented by Iman Sen.

Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun

Course Summary LING 572 Fei Xia 03/06/07. Outline Problem description General approach ML algorithms Important concepts Assignments What’s next?

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Mining the Medical Literature Chirag Bhatt October 14 th, 2004.

Finding Advertising Keywords on Web Pages Scott Wen-tau YihJoshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Text Classification, Active/Interactive learning.

Learning Phonetic Similarity for Matching Named Entity Translation and Mining New Translations Wai Lam, Ruizhang Huang, Pik-Shan Cheung ACM SIGIR 2004.

Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Named Entity Recognition based on Bilingual Co-training Li Yegang School of Computer, BIT.

Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.

Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

Today Ensemble Methods. Recap of the course. Classifier Fusion

BioSnowball: Automated Population of Wikis (KDD ‘10) Advisor: Dr. Koh, Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/11/30 1.

Anomaly Detection in Data Mining. Hybrid Approach between Filtering- and-refinement and DBSCAN Eng. Ştefan-Iulian Handra Prof. Dr. Eng. Horia Cioc ârlie.

1 Intelligente Analyse- und Informationssysteme Frank Reichartz, Hannes Korte & Gerhard Paass Fraunhofer IAIS, Sankt Augustin, Germany Dependency Tree.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.

Prototype-Driven Learning for Sequence Models Aria Haghighi and Dan Klein University of California Berkeley Slides prepared by Andrew Carlson for the Semi-

Ang Sun Director of Research, Principal Scientist, inome

Paired Sampling in Density-Sensitive Active Learning Pinar Donmez joint work with Jaime G. Carbonell Language Technologies Institute School of Computer.

National Taiwan University, Taiwan

Boundary Detection in Tokenizing Network Application Payload for Anomaly Detection Rachna Vargiya and Philip Chan Department of Computer Sciences Florida.

Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Relational Duality: Unsupervised Extraction of Semantic Relations between Entities on the Web Danushka Bollegala Yutaka Matsuo Mitsuru Ishizuka International.

Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.

FILTERED RANKING FOR BOOTSTRAPPING IN EVENT EXTRACTION Shasha Liao Ralph York University.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

Example Apply hierarchical clustering with d min to below data where c=3. Nearest neighbor clustering d min d max will form elongated clusters!

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

Semi-Supervised Clustering

Sofus A. Macskassy Fetch Technologies

Relation Extraction CSCI-GA.2591

Erasmus University Rotterdam

Introduction Task: extracting relational facts from text

Automatic Extraction of Hierarchical Relations from Text

Text Categorization Berlin Chen 2003 Reference:

Clustering Techniques

Presentation transcript:

Rapid Training of Information Extraction with Local and Global Data Views Dissertation Defense Ang Sun Computer Science Department New York University April 30, 2012  Committee Prof. Ralph Grishman Prof. Satoshi Sekine Prof. Heng Ji Prof. Ernest Davis Prof. Lakshminarayanan Subramanian

Outline I.Introduction II.Relation Type Extension: Active Learning with Local and Global Data Views III.Relation Type Extension: Bootstrapping with Local and Global Data Views IV.Cross-Domain Bootstrapping for Named Entity Recognition V.Conclusion

Part I Introduction

Tasks 1.Named Entity Recognition (NER) 2.Relation Extraction (RE) i.Relation Extraction between Names ii.Relation Mention Extraction

NER NameType Bill GatesPERSON SeattleLOCATION MicrosoftORGANIZATION Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. NER Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft. Bill Gates, born October 28, 1955 in Seattle, is the former chief executive officer (CEO) and current chairman of Microsoft.

RE i.Relation Extraction between Names NER Adam, a data analyst for ABC Inc. ABC Inc.Adam Employment RE

i.Relation Mention Extraction Entity Extraction Entity MentionEntity Adam{Adam, a data analyst} a data analyst{Adam, a data analyst} ABC Inc.{ABC Inc.} Adam, a data analyst for ABC Inc.

RE i.Relation Mention Extraction RE Adam, a data analyst for ABC Inc. ABC Inc. a data analyst Employment

Prior Work – Supervised Learning Learn with labeled data – –, Employment >

Prior Work – Supervised Learning O.J.Simpsonwas arrestedandchargedwith murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. POOO Expensive!

Expensive A trained model is typically domain-dependent – Porting it to a new domain usually involves annotating data from scratch Prior Work – Supervised Learning Domains

O.J.Simpsonwas PPPO arrestedandchargedwith OOOO murderinghisex-wife, OOOO NicoleBrownSimpson, PPPO andherfriendRonald OOOP Goldmanin1994. POOO Annotation is tedious! 15 minutes 1 hour 2 hours Prior Work – Supervised Learning

Prior Work – Semi-supervised Learning Learn with both – labeled data – Unlabeled data The learning is an iterative process 1. Train an initial model with labeled data 2. Apply the model to tag unlabeled data 3. Select good tagged examples as additional training examples 4. Re-train the model 5. Repeat from Step2. Small Large

Prior Work – Semi-supervised Learning Problem 1: Semantic Drift Example1:  Learner for PERSON names ends up learning flower names.  Because women's first names intersect with names of flowers (Rose,...) Example 2:  Learner for LocatedIn relation patterns ends up learning patterns for other relations (birthPlace, governorOf, …)

Prior Work – Semi-supervised Learning Problem 2: Lacks a good stopping criterion Most systems – either use a fixed number of iterations – or use a labeled development set to detect the right stopping point

Prior Work – Unsupervised Learning Learn with only unlabeled data Unsupervised Relation Discovery – Context based clustering – Group pairs of named entities with similar context to the same relation cluster

Prior Work – Unsupervised Learning Unsupervised Relation Discovery (Hasegawa et al., (04))

Prior Work – Unsupervised Learning Unsupervised Relation Discovery – The semantics of clusters are usually unknown – Some clusters are coherent  can consistently label them – Some are mixed, containing different topics  difficult to label them

PART II Relation Type Extension: Active Learning with Local and Global Data Views

Relation Type Extension Extend a relation extraction system to new types of relations ACE 2004 Relations TypeExample EMP-ORGthe CEO of Microsoft PHYSa military base in Germany GPE-AFFU.S. businessman PER-SOChis ailing father ARTUS helicopters OTHER-AFFCuban-American people Multi-class Setting: Target relation: one of the ACE relation types Labeled data: 1) a few labeled examples of the target relation (possibly by random selection). 2) all labeled auxiliary relation examples. Unlabeled data: all other examples in the ACE corpus Target Labeled

Relation Type Extension Extend a relation extraction system to new types of relations ACE 2004 Relations TypeExample EMP-ORGthe CEO of Microsoft PHYSa military base in Germany GPE-AFFU.S. businessman PER-SOChis ailing father ARTUS helicopters OTHER-AFFCuban-American people Binary Setting: Target relation: one of the ACE relation types Labeled data: a few labeled examples of the target relation (possibly by random selection). Unlabeled data: all other examples in the ACE corpus Target Un- labeled

LGCo-Testing LGCo-Testing := co-testing with local and global views The general idea 1.Train one classifier based on the local view (the sentence that contains the pair of entities) 2.Train another classifier based on the global view (distributional similarities between relation instances) 3.Reduce annotation cost by only requesting labels of contention data points

The local view President Clinton traveled to the Irish border for an evening ceremony. Words before entity 1{NIL} Words between{travel, to} Words after entity 2{for, an} # words between2 Token pattern coupled with entity typesPERSON_traveled_to_LOCATION Token Sequence Syntactic Parsing Tree Path of phrase labels connecting E1 and E2 augmented with the head word of the top phrase NP--S--traveled--VP--PP

The local view President Clinton traveled to the Irish border for an evening ceremony. Dependency Parsing Tree Shortest path connecting the two entities coupled with entity types PER_nsubj'_traveled_prep_to_LOC

The local view The local view classifier – Binary Setting: MaxEnt binary classifier – Multi-class Setting: MaxEnt multiclass classifier The local view classifier – Binary Setting: MaxEnt binary classifier – Multi-class Setting: MaxEnt multiclass classifier

The global view Corpus of 2,000,000,000 tokens Corpus of 2,000,000,000 tokens * * * * * * * (7-grams) 1.Compile corpus to database of 7-grams 2.Represent each relation instance as a relational phrase 3.Compute distributional similarities between phrases in the 7-grams database 4.Build a relation classifier based on the K-nearest neighbor idea Clinton traveled to the Irish border for … The General Idea traveled to Relation InstanceRelational Phrase … his brother said that ….his brother

Compute distributional similarities President Clinton traveled to the Irish border for an evening ceremony. * * * * * traveled totraveled to * * * * * * * * * traveled to ** traveled to * * * * * * * traveled to * ** * traveled to * * * > * * * traveled to * * 3 's headquarters here traveled to the U.S. 4 laundering after he traveled to the country 3, before Paracha traveled to the United 3 have never before traveled to the United 3 had in fact traveled to the United 4 two Cuban grandmothers traveled to the United 3 officials who recently traveled to the United 6 President Lee Teng-hui traveled to the United , Clinton traveled to the United 4 commission members have traveled to the United 4 De Tocqueville originally traveled to the United 4 Fernando Henrique Cardoso traveled to the United 3 Elian 's grandmothers traveled to the United

Compute distributional similarities Ang arrived in Seattle on Wednesday. > * * * arrived in * * 4 Arafat, who arrived in the other 5of sorts has arrived in the new 5 inflation has not arrived in the U.S. 3 Juan Miguel Gonzalez arrived in the U.S. 3it almost certainly arrived in the New 44 to Brazil, arrived in the country 4 said Annan had arrived in the country 21 he had just arrived in the country 5 had not yet arrived in the country 3 when they first arrived in the country 3 day after he arrived in the country 5 children who recently arrived in the country 4 Iraq Paul Bremer arrived in the country 3 head of counterterrorism arrived in the country 3 election monitors have arrived in the country

Compute distributional similarities – Represent each phrase as a feature vector of contextual tokens – Compute cosine similarity between two feature vectors – Feature weight? President Clinton traveled to the Irish border

Features for traveled to (sorted by frequency) R1_the L1_haveR4_toR3_in L2_, R2_andR1_WashingtonL2_and R2_to R2_inR1_NewL1_He L1_who L3_.R4_,R1_a R2_, L1_andR2_onL1_also L1_, R3_toR4_theR3_a L1_had L2_whoR1_ChinaR2_with L1_he R2_forL4_,L3_the L3_, L4_.L2_theL2_when L1_has R3_theR3_,L1_then Features for arrived in (sorted by frequency) R1_the R1_BeijingR2_inR3_a R2_on L1_hadR3_,R4_a L1_who R2_toR2_forR4_the L2_, R3_onR3_forL3_, L1_, L4_.R2_fromR1_a L3_. R3_theR4_forL1_they R2_, R1_NewR3_toR4_to L1_he R2_.L3_theR1_Moscow L1_has L2_whenR3_capitalL5_. L1_have R4_,L2_theL3_The Feature Weight Feature Weight Use Frequency ? Use Frequency ?

Feature Weight Feature Weight Use tf-idf Use tf-idf tf the number of corpus instances of P having feature f divided by the number of instances of P idf the total number of phrases in the corpus divided by the number of phrases with at least one instance with feature f

Feature Weight Feature Weight Use tf-idf Use tf-idf Features for traveled to (sorted by tf-idf) L1_had L1_HeL1_thenR1_Beijing L1_who R1_NewL1_sheR1_London L1_he L2_whoL1_alsoR2_for L1_has R1_ChinaR2_YorkR2_in L1_have R2_,R1_AfghanistanL2_when R2_to L1_recentlyL1_ZerhouniR1_Baghdad L1_, R1_TheniaL1_ClintonR1_Mexico R1_the L1_andL1_theyL2_He R1_Washington R1_EuropeL3_NouredineR4_to L2_, R1_CubaR2_andR2_United Features for arrived in (sorted by tf-idf) L1_who R1_BaghdadL1_theyR1_Seoul R2_on R1_MoscowR2_SundayL5_. R1_Beijing L1_delegationR2_TuesdayR1_Damascus L1_has R3_capitalR1_WashingtonR2_, L1_he R1_NewR3_MondayR3_Wednesday L1_have L3_.R2_WednesdayR3_Thursday L1_had L2_,R3_SundayR2_from L1_, L1_HeR2_ThursdayR1_Amman R1_Cairo L2_whenR3_TuesdayL3_Minister R1_the R2_MondayR2_YorkR1_Belgrade

Compute distributional similarities traveled tohis family PhraseSim.PhraseSim. visited 0.779his staff0.792 arrived in 0.763his brother0.789 worked in 0.751his friends0.780 lived in 0.719his children0.769 served in 0.686their families0.753 consulted with 0.672his teammates0.746 played for 0.670his wife0.725 Sample of similar phrases.

The global view classifier traveled tohis family PhraseSim.PhraseSim. visited 0.779his staff0.792 arrived in 0.763his brother0.789 worked in 0.751his friends0.780 lived in 0.719his children0.769 served in 0.686their families0.753 consulted with 0.672his teammates0.746 played for 0.670his wife0.725 k-nearest neighbor classifier: classify an unlabeled example based on closest labeled examples President Clinton traveled to the Irish border PHYS-LocatedIn Ang Sun arrived in Seattle on Wednesday. ? PHYS-LocatedIn … his brother said that …PER-SOC sim(arrived in, traveled to) = 0.763sim(arrived in, his brother) = 0.012

LGCo-Testing Procedure in Detail Use KL-divergence to quantify the disagreement between the two classifiers KL-divergence:  0 for identical distributions  max when distributions are peaked and prefer different class labels KL-divergence:  0 for identical distributions  max when distributions are peaked and prefer different class labels  Rank instances by descending order of KL-divergence  Pick the top 5 instances to request human labels during a single iteration  Rank instances by descending order of KL-divergence  Pick the top 5 instances to request human labels during a single iteration

Active Learning Baselines RandomAL UncertaintyAL – Local view classifier – Sample selection: UncertaintyAL+ – Local view classifier (with phrase cluster features) – Sample selection: SPCo-Testing – Co-Testing ( sequence view classifier and parsing view classifier ) – Sample selection: KL-divergence

 Annotation speed:  4 instances per minute  200 instances per hour (annotator takes a 10-mins break in each hour) Supervised 36K instances 180 Hours Supervised 36K instances 180 Hours LGCo-Testing 300 instances 1.5 Hour LGCo-Testing 300 instances 1.5 Hour Results for PER-SOC (Multi-class Setting)  Results for other types of relations have similar trends (in both binary and multiclass settings)

Precision-recall Curve of LGCo-Testing (Multi-class setting)

Comparing LGCo-Testing with the Two Settings F1 difference (in percentage) = F1 of active learning minus F1 of supervised learning the reduction of annotation cost by incorporating auxiliary types is more pronounced in early learning stages (#labels < 200) than in later ones

Part I Part III Relation Type Extension: Bootstrapping with Local and Global Data Views

Basic Idea Consider a bootstrapping procedure to discover semantic patterns for extracting relations between named entities

Basic Idea It starts from some seed patterns which are used to extract named entity (NE) pairs, which in turn result in more semantic patterns learned from the corpus.

Basic Idea Semantic drift occurs because 1)a pair of names may be connected by patterns belonging to multiple relations 2)the bootstrapping procedure is looking at the patterns in isolation Named Entity 1 Pattern Named Entity 2 Bill Clinton visit Arkansas born in fly to governor of arrive in campaign in …

Unguided Bootstrapping Guided Bootstrapping NE Pair Ranker  Use local evidence  Look at the patterns in isolation NE Pair Ranker  Use local evidence  Look at the patterns in isolation NE Pair Ranker  Use global evidence  Take into account the clusters (C i ) of patterns NE Pair Ranker  Use global evidence  Take into account the clusters (C i ) of patterns

Unguided Bootstrapping Initial Settings: – The seed patterns for the target relation R have precision 1 and all other patterns 0. – All NE pairs have confidence of 0

Unguided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs – if many of the k patterns connecting the two names are high-precision patterns – then the name pair should have a high confidence. – The confidence of NE pairs is estimated as – Problem: over-rate NE pairs which are connected by patterns belonging to multiple relations

Unguided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns – Similarly, for a pattern p, – if many of the NE pairs it matches are very confident – then p has many supporters and should have a high ranking – Estimation of the confidence of patterns the number of unique NE pairs matched by p sum of the support from the |H| pairs sum of the support from the |H| pairs

Unguided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns – Sup(p) is the sum of the support it can get from the |H| pairs – The precision of p is given by the average confidence of the NE pairs matched by p It normalizes the precision to range from 0 to 1 As a result the confidence of each NE pair is also normalized to between 0 and 1

Unguided Bootstrapping Step 3: Accept patterns – accept the K top ranked patterns in Step 2 Step 4: Loop or stop – The procedure now decides whether to repeat from Step 1 or to terminate. – Most systems simply do NOT know when to stop

Guided Bootstrapping Pattern Clusters-- Clustering steps: I.Extract features for patterns II.Compute the tf-idf value of extracted features III.Compute the cosine similarity between patterns IV.Build a pattern hierarchy by complete linkage Sample features for “X visited Y” as in “Jordan visited China”

Guided Bootstrapping Pattern Clusters – We use to cut the pattern hierarchy to generate clusters – This ‘cutoff’ is decided by trying a series of thresholds searching for the maximal one that is capable of placing the seed patterns for each relation into a single cluster – We define target cluster C t as the one containing the seeds

Guided Bootstrapping Pattern cluster example – Top 15 patterns in the Located-in Cluster

Guided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs the total number of pattern instances matching N i the number of times p matches N i the number of times p matches N i Degree of association between N i and C t

Guided Bootstrapping Step 1: Use seed patterns to match new NE pairs and evaluate NE pairs Why it gives better confidence estimation?  for the Located-in relation  Local_Conf(N i ) is very high  Global_Conf(N i ) is very low (less than 0.1)  Conf(N i ) is low, high Local_Conf(N i ) is discounted by low Global_Conf(N i ) Why it gives better confidence estimation?  for the Located-in relation  Local_Conf(N i ) is very high  Global_Conf(N i ) is very low (less than 0.1)  Conf(N i ) is low, high Local_Conf(N i ) is discounted by low Global_Conf(N i )

Guided Bootstrapping Step 2: Use NE pairs to search for new patterns and rank patterns. – All the measurement functions are the same as those used in the unguided bootstrapping. – However, with better ranking of NE pairs in Step 1 – the patterns are also ranked better Step 3: Accept patterns – We also accept the K top ranked patterns

Guided Bootstrapping Step 4: Loop or stop Since each pattern in our corpus has a cluster membership, we can – monitor the semantic drift easily – and naturally stop it drifts when the procedure tries to accept patterns which do not belong to the target cluster we can stop when the procedure tends to accept more patterns outside of the target cluster

Experiments Pattern clusters: – Computed from a corpus of 1.3 billion tokens Evaluation data: – ACE 2004 training data (no relation annotation between each pair of names) – We take advantage of entity co-reference information to automatically re-annotate the relations – Annotation was reviewed by hand Evaluation method: – direct evaluation – strict pattern match

Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

Experiments Red: guided bootstrapping Blue: unguided bootstrapping Red: guided bootstrapping Blue: unguided bootstrapping drift : the percentage of false positives belonging to ACE relations other than the target relation

Experiments Guided bootstrapping terminates when the precision is still high while maintaining a reasonable recall It also effectively prevented semantic drift

Part I Part IV Cross-Domain Bootstrapping for Named Entity Recognition NER Semi-supervised learning NER Source Domain Target Domain

NER Model  Maximum Entropy Markov Model (McCallum et al., 2000)  Split a name type into two classes  B_PER (beginning of PERSON)  I_PER (continuation of PERSON) U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER T1T1 T2T2 T3T3 T4T4 T5T5 T6T6 S1S1 S2S2 S3S3 S4S4 S5S5 S6S6 Goal MEMM Maximum Entropy Classifier Viterbi Algorithm

NER Model  Estimate the name class of each individual token t i  Extract a feature vector from the local context window ( t i-2, t i-1, t i, t i+1, t i+2 )  Learn feature weights using Maximum Entropy model U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER FeatureValue currentTokenDonald wordType_currentTokeninitial_capitalized previousToken_-1Secretary previousToken_-1_class O previousToken_-2Defense nextToken_+1H. ……

NER Model  Estimate the name classes of the token sequence  Search the most likely path argmax ( )  Use dynamic programming ( possible paths)  N := number of name classes  L := length of the token sequence U.S.DefenseSecretaryDonaldH.Rumsfeld B-PER I-PER B-ORG I-ORG B-GPE I-GPE O U.S.DefenseSecretaryDonaldH.Rumsfeld B_GPEB_ORGOB_PERI_PER

Domain Adaptation Problems Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports on terrorism) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data

Domain Adaptation Problems Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Source domain (news articles) George Bush Donald H. Rumsfeld … Department of Defense … Target domain (reports ) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … Target domain (reports ) Abdul Sattar al-Rishawi Fahad bin Abdul Aziz bin Abdul Rahman Al-Saud … Al-Qaeda in Iraq … 1.Many words are out-of-vocabulary 2.Naming conventions are different: 1.Length: short vs long 2.Capitalization: weaker in target 3.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … 1.Many words are out-of-vocabulary 2.Naming conventions are different: 1.Length: short vs long 2.Capitalization: weaker in target 3.Name variation occurs often in target Shaikh, Shaykh, Sheikh, Sheik, … We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data We want to automatically adapt the source-domain tagger to the target domain without annotating target domain data

The Benefits of Incorporating Global Data View -- Feature Generalization Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Q(Target domain): What is the weight of the feature currentToken=Abdul A(Source domain): Sorry, I don’t know. I’ve never seen this guy in my training data Bit stringExamples John, James, Mike, Steven Abdul, Mustafa, Abi, Abdel Shaikh, Shaykh, Sheikh, Sheik Qaeda, Qaida, qaeda, QAEDA FBI, FDA, NYPD Taliban  Global Data View Comes to the Rescue!  Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm  Global Data View Comes to the Rescue!  Build a word hierarchy from a 10M word corpus (Source + Target), using the Brown word clustering algorithm

The Benefits of Incorporating Global Data View -- Feature Generalization Add an additional layer of features that include word clusters currentToken = John currentPrefix3 = 100 currentPrefix3 = 100 fires also for target words! To avoid commitment to a single cluster: cut word hierarchy at different levels

The Benefits of Incorporating Global Data View -- Feature Generalization  Performance on the target domain ModelPRF1 Source_Model Source_Model + Word Clusters

The Benefits of Incorporating Global Data View -- Instance Selection Cross-domain Bootstrapping Algorithm: 1.Train a tagger from labeled source data 2.Tag all unlabeled target data with current tagger 3.Select good tagged words and add these to labeled data 4.Re-train the tagger Trained tagger Unlabeled target data Instance Selection Labeled Source data President Assad Feature Generalization Multiple Criteria

The Benefits of Incorporating Global Data View -- Instance Selection Multiple criteria – Criterion 1: Novelty– prefer target-specific instances Promote Abdul instead of John – Criterion 2: Confidence - prefer confidently labeled instances

The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence - prefer confidently labeled instances  Local confidence: based on local features

The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence  Global confidence: based on corpus statistics 1PrimeMinisterAbdulKarimKabaritiPER 2warlordGeneralAbdulRashidDostumPER 3PresidentA.P.J.AbdulKalamwillPER 4PresidentA.P.J.AbdulKalamhasPER 5AbdullahbinAbdulAziz,PER 6atKingAbdulAzizUniversityORG 7NawabMohammedAbdulAli,PER 8DrAliAbdulAzizAlPER 9NayefbinAbdulAzizsaidPER 10leaderGeneralAbdulRashidDostumPER

The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 2: Confidence  Global confidence  Combined confidence: product of local and global confidence

The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 3: Density - prefer representative instances which can be seen as centroid instances

The Benefits of Incorporating Global Data View -- Instance Selection  Criterion 4: Diversity - prefer a set of diverse instances instead of similar instances  “, said * in his”  Highly confident instance  High density, representative instance  BUT, continuing to promote such instance would not gain additional benefit

The Benefits of Incorporating Global Data View -- Instance Selection  Putting all criteria together 1.Novelty: filter out source-dependent instances 2.Confidence: rank instances based on confidence and the top ranked instances will be used to generate a candidate set 3.Density: rank instances in the candidate set in descending order of density 4.Diversity: 1.accepts the first instance (with the highest density) in the candidate set 2.and selects other candidates based on the diff measure.

The Benefits of Incorporating Global Data View -- Instance Selection  Results

Part V Conclusion

Contribution The main contribution is the use of both local and global evidence for fast system development The co-testing procedure reduced annotation cost by 97% The use of pattern clusters as the global view in bootstrapping – not only greatly improved the quality of learned patterns – but also contributed to a natural stopping criterion Feature generalization and instance selection in the cross-domain bootstrapping were able to improve the source model's performance on the target domain by 7% F1 without annotating any target domain data

Future Work Active Learning for Relation Type Extension – conduct real world active learning – combine semi-supervised learning with active learning to further reduce annotation cost Semi-supervised Learning for Relation Type Extension – better seed selection strategy Cross-domain Bootstrapping for Named Entity Recognition – extract dictionary-based features to further generalize lexical features – combine with distantly annotated data to further improve performance

Thanks!

?

Backup slides

Experimental Setup for Active Learning ACE 2004 data – 4.4K relation instances – 45K non-relation instances 5-fold cross validation – Roughly 36K unlabeled instances (45K ÷ 5 X 4) – Random initialization (repeated 10 times) – Totally 50 runs – Each iteration selects 5 instances for annotation – 200 iterations are performed