Extracting Geographical Gazetteers from the Internet Olga Uryupina
Overview Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo
NE Recognition National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed ( ) the imposing classical building to house the works.
State-of-the-art systems Standard approaches usually combine Rules Statistics Gazetteers Classes distinguished: Person Organisation Location
NE Recognition – with and without gazetteers (Mikheev, Moens, and Grover, 1999) ran their system in different modes Full gazetteerNo gazetteer RecallPrecisionRecallPrecision organisation 90%93%86%85% person 96%98%90%95% location 95%94%46%59%
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.
Manually created gazetteers Available resources: Word lists from the Web Atlases & maps Digital gazetteers (e.g. Alexandria Digital Library)
Manually created gazetteers – drawbacks Only positive data (no way to find out whether Mainau island does not exist or is simly not listed) Difficult to adjust when new classes are required Not available for most languages: Aquisgrana
Task We can get rid of manually compiled gazetteers by using the Internet. Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine). Offline vs. Online processing
Data Manually created gazetteer (1260 items) Classes: COUNTRYPitcairn REGIONBavaria/Bayern RIVEROder ISLANDSavai‘i MOUNTAINOhmberge CITYNancy Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION
Data Gazetteer example TorontoCITY TotonicapanCITY, REGION TrinidadCITY, RIVER, ISLAND
Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING. CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,... TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Initial system For each class a set of keywords was created. ISLAND island islands archipelago
Initial system For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine. Newfoundland Newfoundland island island of Newfoundland Newfoundland islands islands of Newfoundland Newfound. archipelago
Initial system Machine learners use the counts to induce classifications. Learners tested for this task: C4.5 TiMBL Ripper
Initial system – drawbacks Still needs manually created resources: Set of patterns Initial gazetteer (TRAINING) Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself
Bootstrapping Riloff & Jones, 1999 – Bootstrapping for IE task ITEMSPATTERNS
Bootstrapping Main problem – noise: the patterns set can get infected Remedies: Vaccine (external algorithm for evaluating patterns) Stop lists Human experts
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Collecting patterns (step 1) Go to AltaVista ask for an item download first n pages match with a simple regexp patterns
Example – step 1 10 best patterns for ISLAND: of X70 the X60 X and58 X the55 to X53 in X52 and X47 X is45 X in45 on X45
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Rescoring (step 2) Goal: discard too general patterns – score of pattern p for class c – penalty for appearing in more than one class
Example – step 2 10 best patterns for ISLAND: X island17 island of X9 X islands8 island X7 islands X7 insel X7 the island X6 X elects5 of X islands5 zealand X4
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Learning classifiers (step 3) 20 best patterns are used to train Ripper (as in the initial system) Produced classifiers: high-recall high-accuracy high-precision
Example – step 3 High-recall classifier for ISLAND: if #(„X island“)/#X >= classify X as +ISLAND if #(„and X islands“)/#X >= classify X as +ISLAND if #(„insel X“)/#X >= classify X as +ISLAND otherwise classify X as –ISLAND Extraction patterns: „X island“, „and X islands“, „insel X“
One more example – step 3 High-accuracy classifier for ISLAND: if #(„X island“)/#X >= classify X as +ISLAND if #(„and X islands“)/#X >= and #(„X sea“)/#X>= and #(„X geography“)<13 classify X as +ISLAND if #(„X islands“)/#X >= and #(„pacific islands X“)/#X>= classify X as +ISLAND otherwise classify X as –ISLAND
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Collecting and discarding items (steps 4&5) The same procedure as the step 1: go to AltaVista, ask for extraction patterns (cf. step 3),.. Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)
Example – steps 4 and 5 Extracted islands (alphabetically): About Abyss Achill Active Adatara Akutan Alaska Alaskan Albarella All Amelia American
Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer
Classifying (step 6) High-precision classifier (cf. step 3) is run on collected items rejected items are discarded accepted items used for extraction at the next loop
Example – step 6 Extracted islands (alphabetically): Achill Akutan Albarella Amelia Andaman Ascension Bainbridge Baltrum Beaver Big Block Bouvet
Evaluation Classifiers: initial system bootstrapping from the seed gazetteer bootstrapping from positive examples only Items lists: bootstrapping from the seed gazetteer
Initial system – evaluation ClassAccuracy CITY74.3% ISLAND95.8% RIVER88.8% MOUNTAIN88.7% COUNTRY98.8% REGION82.3% average88.1%
Bootstrapping – evaluation ClassInitial system After the 1 st loop After the 2nd loop CITY74.3%51.2%62.0% ISLAND95.8%91.4%96.4% RIVER88.8%91.5%89.6% MOUNTAIN88.7%89.1%88.8% COUNTRY98.8%99.2%99.6% REGION82.3%80.4%82.6% average88.1%83.8%86.5%
Comparing the performance RIVER, MOUNTAIN, COUNTRY – the new system is better! ISLAND – the new system improved and became better after the 2 nd loop. REGION – infected category („departments of X“); however, the system is improving. CITY – very heterogeneous class (homonymy); 1 st loop – „streets of X“, 2 nd loop – „km from X“, „ort X“.
Comparing the systems Bootstrapping (vs. the initial system): + patterns learned automatically + word lists produced -cheap seed gazetteer Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly
Learning from positives CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,... Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY] Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]
Initial system – evaluation ClassPrecompiled gazetteer Positives only CITY74.3%50.3% ISLAND95.8%94.1% RIVER88.8%91.0% MOUNTAIN88.7%89.3% COUNTRY98.8%99.6% REGION82.3%86.9% average88.1%85.2%
Bootstrapping with positives only – evaluation Class1 st loop2 nd loop CITY39.3%44.1% ISLAND94.5%95.8% RIVER91.2%91.1% MOUNTAIN90.1%91.2% COUNTRY98.7%99.6% REGION86.5%81.6% average83.4%83.9%
New items New ISLANDs: true islands121(90.3%) found in the atlases93 not found28 descriptions5(3.7%) parts of names3(2.2%) mistakes5(3.7%) _______ all134
Conclusion Advantages of our approach: very few manually collected data required (seed gazetteer) no sophisticated engineering – patterns produced automatically on-line classifiers provide negative information and are applicable to any entity new items (off-line gazetteer) collected automatically
ToDo new classes -> hierarchy multi-word expressions more elaborated learning from positive examples determine locations (where is X?)