Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University of Utah)
The Vision Data Base Time Line Geo Display Link Analysis Tables Extractor Entities Models Training Program training sentences answers Relations Information Extraction Events
What is IE? Analyze unrestricted text in order to extract information about pre-specified types of events, entities or relationships
Practical / Commercial Applications Database of Job Postings extracted from corporate web paes ( Extracting specific fields from resumes to populate HR databases ( Information Integration ( Shopping Portals
Where the world is now? MUC helped drive information extraction research but most systems were fine tuned for terrorist activities Commercial systems can detect names of people, locations, companies (only for proper nouns) Very costly to train and port to new domains 3-6 months to port to new domain (Cardie 98) 20,000 words to learn named entity extraction (Seymore et al 99) 7000 labeled examples to learn MUC extraction rules (Soderland 99)
IE Approaches Hand-Constructed Rules Supervised Learning Semi-Supervised Learning
Goal Can you start with 5-10 seeds and learn to extract other instances? Example tasks Locations Products Organizations People
Aren’t you missing the obvious? Not really! Acquire lists of proper nouns Locations : countries, states, cities Organizations : online database People: Names But not all instances are proper nouns *by the river*, *customer*,*client*
Use context to disambiguate A lot of NPs are unambiguous “The corporation” A lot of contexts are also unambiguous Subsidiary of But as always, there are exceptions….and a LOT of them in this case “customer”, John Hancock, Washington
Bootstrapping Approaches Utilize Redundancy in Text Noun-Phrases New York, China, place we met last time Contexts Located in, Traveled to Learn two models Use NPs to label Contexts Use Contexts to label NPs
Algorithms for Bootstrapping Meta-Bootstrapping (Riloff & Jones, 1999) Co-Training (Blum & Mitchell, 1999) Co-EM (Nigam & Ghani, 2000)
Data Set ~5000 corporate web pages (4000 for training) Test data marked up manually by labeling every NP as one or more of the following semantic categories: location, organization, person, product, none Preprocessed (parsed) to generate extraction patterns using AutoSlog (Riloff, 1996)
Evaluation Criteria Every test NP is labeled with a confidence score by the learned model Calculate Precision and Recall at different thresholds Precision = Correct / Found Recall = Found / Max that can be found
Active Learning Can we do better by keeping the user in the loop? If we can ask the user to label any examples, which examples should they be? Selected randomly Selected according to their density/frequency Selected according to disagreement between NP and context (KL divergence to the mean weighted by density)
NP – Context Disagreement KL Divergence
What if you’re really lazy? Previous experiments assumed a training set was available What if you don’t have a set of documents that can be used to train? Can we start from only the seeds?
Collecting Training Data from the Web Use the seed words to generate web queries Simple Approaches For each seed word, fetch all documents returned Only fetch documents, where N or more seed words appear
Collecting Training Data from the Web Query GeneratorWWW Seed Documents Text Filter
Interleaved Data Collection Select a seed word with uniform probability Get documents containing that seed word Run bootstrapping on the new documents Select new seedwords that are learned with high confidence Repeat
Seed-Word Density
Summary Starting with 10 seed words, extract NPs matching specific semantic classes Probabilistic Bootstrapping is an effective technique Asking the user helps only if done intelligently The Web is an excellent resource for training data that can be collected automatically => Personal Information Extraction Systems