Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh
Overview : The importance of the experiment. Introduction to SPIRIT and GATE. Techniques employed – Geo Parsing and Geo Coding. Pros Cons What it leads to.
The importance of the experiment: A novel system. Geospatial information extraction from the Web documents. Annotating the retrieved documents with the spatial data. Using the annotated documents to power a working GIR system.
How does it work (summary) Extracting geospatial references from document involves: –Identifying geographic references –Assigning them spatial co-ordinates –Factors influencing the above: speed, reliability, flexibility and multilingualism.
Introduction to SPIRIT Spatial Information Retrieval on the Internet The main aim of the project is to create tools and techniques to help people find information that relates to specified geographical locations.
1TB crawl of about 9million web documents focused on UK, Germany, France and Switzerland. Support of Ontology of places. Relevance ranking of web documents catering to needs of: Documents referring some place of interest Digital geospatial resources
GATE It’s a java suite for tasks related to Natural Language Processing and particularly useful and widely used in the area of Information Extraction. ANNIE (A nearly-new Information Extraction system) is the highlight of this experiment which is employed by SPIRIT.
ANNIE Tokenizer Gazetter Sentence splitter Part-of-speech tagger Named-Entity transducer
Spatial Markup Sources of Spatial markup: OS – Ordnance Survey (UK, point) TGN – Getty Thesaurus of Geographical names (Global, point) SABE – Seamless administrative boundaries of Europe (Europe, polygon)
Geo-Parsing Named-Entity Recognition – lists + rules List lookup inefficient First gazetter lookup then use of contextual evidence to realize this. JAPE (Java Patterns Annotation Engine) – rules defined w.r.t terms of entities identified within GATE. Rules are language independent (using Systran system)
Hurdles faced Filtering out commonly used words – specially which are used in a non-geographical sense. Using person-name list to filter out ambiguity between places and names.
Geo-Coding Gazetter lookup to assign co-ordinates Removing ambiguity in place names: by feature hierarchy and feature type provided by OS. Actual grounding done by SABE and OS. TGN used to resolve global ambiguity.
Experimental Setup Total annotated collection of about 8.8million pages 22 out of top 50 domains from Europe About 1.6 million doc containing 5-10 unique footprints selected. Further 10% chosen from this and then those only from UK (130) All geographic names (1864) manually identified and stored as benchmark
Geo-parsing Results SPIRIT + SABE + OS: Correct – 1340 Missing – 479 False Hits – 596 Precision – Recall – F1 –
Geo-Coding Results TGN ineffective due to global scope – 1021 found, 68% ambiguous. UK SABE good – 942 found, 11% ambiguous places assigned a UID correctly. That is not only correct geo sense but resource order too.
Conclusions Promising as success rate of 89% is there. Geo-parsing can be improved by enhancing gazetter matching methods and filtering of non- geographic entries Geo-coding can be improved by finding better methods for combining geog. resources.
Pros Novel system and high success rate. Towards a geospatial search engine. Spatial markup resources in abundance.
Cons Ambiguity (geographical) Matching correct geographical sense. Large overhead required to build such systems. Inherent NLP problems.
What it all leads to Creating geographical ontology to assist in GIR (Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa Lisboa, Portugal) More focused Local and topical search (Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany)
References Extracting Metadata for Spatially-Aware Information Retrieval on the Internet - Clough, Paul GATE - SPIRIT - Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa Lisboa, Portugal Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany