Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001
Preamble n Bootstrapping for Text Learning Tasks. (1999) Jones, R., McCallum, A., Nigam, K., and Riloff, E. n From the IJCAI-99 Workshop on Text Mining: Foundations, Techniques and Applications n March 27: Ellen Riloff –
Introduction n Learning algorithms require lots of labeled training data –time-consuming & tedious! n Bootstrapping = small quantity of labeled data (seed) + large quantity of unlabeled data –can be used for text learning tasks that otherwise require large training sets n unlabeled data obtained automatically
Case Studies - 1 n learning extraction patterns and dictionaries for information extraction –Supplied knowledge = keywords & parser n noun phrase classifier & NP context classifier (based on extraction patterns) –given noun phrases as seed n generate dictionaries for locations from corporate web pages –76% accuracy after 50 iterations
Case Studies -2 n document classification using a naïve Bayes classifier –provide keywords for each class & class hierarchy n classification of computer science papers –66% accuracy (compare to human agreement levels of 72%)
Information Extraction n IE = identifying predefined types of information from text n extraction patterns + semantic lexicon (words/phrases with semantic category labels) Name: %Murdered% Event Type:MURDER Trigger Word:murdered Slots:VICTIM (human) PERPETRATOR: (human)
Information Extraction n previous extraction systems require –training corpus with annotations for desired extractions –manually defined keywords, frames or object recognizers n Bootstrapping technique uses texts from the domain & small set of seed words
Information extraction n based on two observations: –if “schnauzer”, “terrier”, “dalmation” refer to dogs discover pattern “ barked” –if we know “ barked” is good pattern for extracting dogs every NP it extracts refers to a dog mutual bootstrapping = seed words of semantic category learned extraction patterns new category members
Mutual Bootstrapping n Generate all candidate extraction patterns from the training corpus using AutoSlog (a tool that builds dictionaries of extraction patterns) n Apply candidate extraction patterns to training corpus & save the patterns with their extractions n Next stage: label semantic categories of extraction patterns & NPs
Mutual Bootstrapping Overview Mutual Bootstrapping Temp Semantic lexicon Extraction Phrase list Select best EP Add best EP’s extractions
Mutual Bootstrapping (cont.) Score extraction patterns more general patterns are scored higher & use head phrase matching n Scoring also uses RlogF metric: score(patterni) = Ri * log2(Fi) n identifies most reliable extraction patterns & patterns that frequently extract relevant info. (irrelevant info may also be extracted) n e.g. Kidnapped in vs. kidnapped in January
Problems… n “shot in ”: location or body part? body parts location extracting many body parts as extraction patterns for location category low accuracy n save 5 most reliable NPs from bootstrapping process restart inner bootstrapping process again n reliable NP = one extracted by many extraction patterns
Meta-Bootstrapping Mutual Bootstrapping Seed words Permanent Semantic lexicon Candidate extraction patterns & extractions Temp Semantic lexicon Extraction Phrase list Select best EP Add best EP’s extractions initialize add 5 best NPs
Results n Seed words (terrorist locations): bolivia, city, columbia …. n Location patterns extracted by meta- bootstrapping after 50 iterations –Kidnapped in –Taken in –Operates in –Billion in n 76% of hypothesized location phrases were true locations
Related Work n DIPRE algorithm of Brin (1998) uses bootstrapping to extract (title, author) pairs for books on WWW. n Yarowsky (1995) used bootstrapping algorithm for word sense disambiguation task n Nigam (1999) used a few labeled documents instead of keywords
References n Bootstrapping for Text Learning Tasks. (1999) Jones, R., McCallum, A., Nigam, K., and Riloff, E. n Learning Dictionaries for Information Extraction by Multi-Level Bootstrapping. (1999) Riloff, E. and Jones, R. n Foundations of Statistical Natural Language Processing. Manning and Schütze.