Extracting Geographical Gazetteers from the Internet Olga Uryupina 30.05.03.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 1/20 IMS Universität Stuttgart Fine-Grained Geographical.
Mining External Resources for Biomedical IE Why, How, What Malvina Nissim
Tour the World: building a web-scale landmark recognition engine ICCV 2009 Yan-Tao Zheng1, Ming Zhao2, Yang Song2, Hartwig Adam2 Ulrich Buddemeier2, Alessandro.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Measuring Scholarly Communication on the Web Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Bibliometric Analysis.
Event Extraction: Learning from Corpora Prepared by Ralph Grishman Based on research and slides by Roman Yangarber NYU.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
Automatic Discovery and Classification of search interface to the Hidden Web Dean Lee and Richard Sia Dec 2 nd 2003.
1 U.S. Census Bureau Data Availability for Geographic Areas March 25, 2008.
An Overview of Link Analysis Techniques for Academic Web Sites Mike Thelwall, Statistical Cybermetrics Research Group, University of Wolverhampton, UK.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
Analysing the link structures of the Web sites of national university systems Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton,
Hyperlinks and Scholarly Communication Mike Thelwall Statistical Cybermetrics Research Group University of Wolverhampton, UK Virtual Methods Seminar, University.
Online Stacked Graphical Learning Zhenzhen Kou +, Vitor R. Carvalho *, and William W. Cohen + Machine Learning Department + / Language Technologies Institute.
Chapter 5: Information Retrieval and Web Search
Retrieving Location-based Data on the Web Andrei Tabarcea,
The RSS Editor Programme: RSS_broker A.Annunziato, C. Best JRC Ispra
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
SIEVE—Search Images Effectively through Visual Elimination Ying Liu, Dengsheng Zhang and Guojun Lu Gippsland School of Info Tech,
Face Detection using the Viola-Jones Method
© URENIO Research Unit 2004 URENIO Online Benchmarking Application Thessaloniki 7 th of October 2004 Isidoros Passas BEng Computer System Engineering.
If you are very familiar with SOAR, try these quick links: Principal’s SOAR checklist here here Term 1 tasks – new features in 2010 here here Term 1 tasks.
Classroom User Training June 29, 2005 Presented by:
The Problem Finding information about people in huge text collections or on-line repositories on the Web is a common activity Person names, however, are.
A Simple Method to Extract Fuzzy Rules by Measure of Fuzziness Jieh-Ren Chang Nai-Jian Wang.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.
Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Lehigh University, USA WWW 2006.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
A Language Independent Method for Question Classification COLING 2004.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Automatic Set Instance Extraction using the Web Richard C. Wang and William W. Cohen Language Technologies Institute Carnegie Mellon University Pittsburgh,
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
PIER Research Methods Protocol Analysis Module Hua Ai Language Technologies Institute/ PSLC.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Union Catalog Architecture Tsach Moshkovits, Development Team Leader Olybris, Ex Libris Seminar 2005 Kos, April 2005.
1 Automatic indexing Salton: When the assignment of content identifiers is carried out with the aid of modern computing equipment the operation becomes.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Searching for NZ Information in the Virtual Library Alastair G Smith School of Information Management Victoria University of Wellington.
Information Retrieval
A Classification-based Approach to Question Answering in Discussion Boards Liangjie Hong, Brian D. Davison Lehigh University (SIGIR ’ 09) Speaker: Cho,
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
A System for Automatic Personalized Tracking of Scientific Literature on the Web Tzachi Perlstein Yael Nir.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
What is a Computer An electronic, digital device that stores and processes information. A machine that accepts input, processes it according to specified.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.
1 CS 430: Information Discovery Lecture 23 Non-Textual Materials.
Major Issues n Information is mostly online n Information is increasing available in full-text (full-content) n There is an explosion in the amount of.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Collection Synthesis Donna Bergmark Cornell Digital Library Research Group March 12, 2002.
F. López-Ostenero, V. Peinado, V. Sama & F. Verdejo
Application of Dublin Core and XML/RDF standards in the KIKERES
Algorithm Discovery and Design
Panagiotis G. Ipeirotis Luis Gravano
Presentation transcript:

Extracting Geographical Gazetteers from the Internet Olga Uryupina

Overview Named Entity Recognition & Gazetteers Data Initial Algorithm Bootstrapping approach Evaluation ToDo

NE Recognition National Gallery of Scotland – The nucleus of the Gallery was formed by the Royal Institution‘s collection, later expanded by bequests and purchasing. Playfair designed ( ) the imposing classical building to house the works.

State-of-the-art systems Standard approaches usually combine Rules Statistics Gazetteers Classes distinguished: Person Organisation Location

NE Recognition – with and without gazetteers (Mikheev, Moens, and Grover, 1999) ran their system in different modes Full gazetteerNo gazetteer RecallPrecisionRecallPrecision organisation 90%93%86%85% person 96%98%90%95% location 95%94%46%59%

Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Fine-grained NER Washington wants protection for its peacekeepers. Until it gets its way the Administration is holding up renewal of the U.N. peacekeeping mandate in Bosnia.

Manually created gazetteers Available resources: Word lists from the Web Atlases & maps Digital gazetteers (e.g. Alexandria Digital Library)

Manually created gazetteers – drawbacks Only positive data (no way to find out whether Mainau island does not exist or is simly not listed) Difficult to adjust when new classes are required Not available for most languages: Aquisgrana

Task We can get rid of manually compiled gazetteers by using the Internet. Task: subclassify locations using the Internet counts (obtained from the Altavista Search Engine). Offline vs. Online processing

Data Manually created gazetteer (1260 items) Classes: COUNTRYPitcairn REGIONBavaria/Bayern RIVEROder ISLANDSavai‘i MOUNTAINOhmberge CITYNancy Washington: 11xCITY, 1xMOUNTAIN, 2xISLAND, (31+1+1)xREGION

Data Gazetteer example TorontoCITY TotonicapanCITY, REGION TrinidadCITY, RIVER, ISLAND

Data For each class we sample 100 items from the gazetteer. As the lists overlap, this results in 520 different items (TRAINING data). The rest was used for TESTING. CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,...  TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

Initial system For each class a set of keywords was created. ISLAND island islands archipelago

Initial system For each item X to be classified, queries of the form “X KEYWORD“ and “KEYWORD of X“ are sent to the Altavista search engine. Newfoundland Newfoundland island island of Newfoundland Newfoundland islands islands of Newfoundland Newfound. archipelago

Initial system Machine learners use the counts to induce classifications. Learners tested for this task: C4.5 TiMBL Ripper

Initial system – drawbacks Still needs manually created resources: Set of patterns Initial gazetteer (TRAINING) Only online (slow) processing – the system can only classify items, provided by the user, but not extract new names itself

Bootstrapping Riloff & Jones, 1999 – Bootstrapping for IE task ITEMSPATTERNS

Bootstrapping Main problem – noise: the patterns set can get infected Remedies: Vaccine (external algorithm for evaluating patterns) Stop lists Human experts

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Collecting patterns (step 1) Go to AltaVista ask for an item download first n pages match with a simple regexp  patterns

Example – step 1 10 best patterns for ISLAND: of X70 the X60 X and58 X the55 to X53 in X52 and X47 X is45 X in45 on X45

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Rescoring (step 2) Goal: discard too general patterns – score of pattern p for class c – penalty for appearing in more than one class

Example – step 2 10 best patterns for ISLAND: X island17 island of X9 X islands8 island X7 islands X7 insel X7 the island X6 X elects5 of X islands5 zealand X4

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Learning classifiers (step 3) 20 best patterns are used to train Ripper (as in the initial system) Produced classifiers: high-recall high-accuracy high-precision

Example – step 3 High-recall classifier for ISLAND: if #(„X island“)/#X >= classify X as +ISLAND if #(„and X islands“)/#X >= classify X as +ISLAND if #(„insel X“)/#X >= classify X as +ISLAND otherwise classify X as –ISLAND Extraction patterns: „X island“, „and X islands“, „insel X“

One more example – step 3 High-accuracy classifier for ISLAND: if #(„X island“)/#X >= classify X as +ISLAND if #(„and X islands“)/#X >= and #(„X sea“)/#X>= and #(„X geography“)<13 classify X as +ISLAND if #(„X islands“)/#X >= and #(„pacific islands X“)/#X>= classify X as +ISLAND otherwise classify X as –ISLAND

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Collecting and discarding items (steps 4&5) The same procedure as the step 1: go to AltaVista, ask for extraction patterns (cf. step 3),.. Discarding: common names (beginning with low-case letters), stop words (not necessary, but save time)

Example – steps 4 and 5 Extracted islands (alphabetically): About Abyss Achill Active Adatara Akutan Alaska Alaskan Albarella All Amelia American

Extraction items Collecting patterns Discarding most general patterns Learning classifiers Extraction patterns Collecting items Discarding common names Classifying items Learned high-precision classifier Initial gazetteer

Classifying (step 6) High-precision classifier (cf. step 3) is run on collected items  rejected items are discarded  accepted items used for extraction at the next loop

Example – step 6 Extracted islands (alphabetically): Achill Akutan Albarella Amelia Andaman Ascension Bainbridge Baltrum Beaver Big Block Bouvet

Evaluation Classifiers: initial system bootstrapping from the seed gazetteer bootstrapping from positive examples only Items lists: bootstrapping from the seed gazetteer

Initial system – evaluation ClassAccuracy CITY74.3% ISLAND95.8% RIVER88.8% MOUNTAIN88.7% COUNTRY98.8% REGION82.3% average88.1%

Bootstrapping – evaluation ClassInitial system After the 1 st loop After the 2nd loop CITY74.3%51.2%62.0% ISLAND95.8%91.4%96.4% RIVER88.8%91.5%89.6% MOUNTAIN88.7%89.1%88.8% COUNTRY98.8%99.2%99.6% REGION82.3%80.4%82.6% average88.1%83.8%86.5%

Comparing the performance RIVER, MOUNTAIN, COUNTRY – the new system is better! ISLAND – the new system improved and became better after the 2 nd loop. REGION – infected category („departments of X“); however, the system is improving. CITY – very heterogeneous class (homonymy); 1 st loop – „streets of X“, 2 nd loop – „km from X“, „ort X“.

Comparing the systems Bootstrapping (vs. the initial system): + patterns learned automatically + word lists produced -cheap seed gazetteer Problem: it‘s easy to download huge lists of islands etc., but very difficult to check them and classify properly

Learning from positives CITY:...REGION:...COUNTRY:... RIVER:..., Victoria,... ISLAND:..., Victoria,... MOUNTAIN:..., Victoria,... Before: => TRAINING: Victoria [+CITY, +REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY] Now: => TRAINING: Victoria [-CITY, -REGION, +RIVER, +ISLAND, +MOUNTAIN, -COUNTRY]

Initial system – evaluation ClassPrecompiled gazetteer Positives only CITY74.3%50.3% ISLAND95.8%94.1% RIVER88.8%91.0% MOUNTAIN88.7%89.3% COUNTRY98.8%99.6% REGION82.3%86.9% average88.1%85.2%

Bootstrapping with positives only – evaluation Class1 st loop2 nd loop CITY39.3%44.1% ISLAND94.5%95.8% RIVER91.2%91.1% MOUNTAIN90.1%91.2% COUNTRY98.7%99.6% REGION86.5%81.6% average83.4%83.9%

New items New ISLANDs: true islands121(90.3%) found in the atlases93 not found28 descriptions5(3.7%) parts of names3(2.2%) mistakes5(3.7%) _______ all134

Conclusion Advantages of our approach: very few manually collected data required (seed gazetteer) no sophisticated engineering – patterns produced automatically on-line classifiers provide negative information and are applicable to any entity new items (off-line gazetteer) collected automatically

ToDo new classes -> hierarchy multi-word expressions more elaborated learning from positive examples determine locations (where is X?)