Information and Communication Technologies 1 The place of place in geographical IR Diana Santos Marcirio Silveira Chaves Linguateca - www.linguateca.pt.

Slides:



Advertisements
Similar presentations
Six Steps to Effective Library Research
Advertisements

Chapter 5: Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
A Geographic Knowledge Base for Semantic Web Applications Marcirio Silveira Chaves Mário J. Silva Bruno Martins 20º Brazilian Symposium on Databases -
So What Does it All Mean? Geospatial Semantics and Ontologies Dr Kristin Stock.
Reasoning Methodologies in Information Technology R. Weber College of Information Science & Technology Drexel University.
Automatic Acquisition of Fuzzy Footprints Steven Schockaert, Martine De Cock, Etienne E. Kerre.
Research topics Semantic Web - Spring 2007 Computer Engineering Department Sharif University of Technology.
A Framework for Ontology-Based Knowledge Management System
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcírio Chaves, Leonardo Andrade, Mário J. Silva
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
A Probabilistic Framework for Information Integration and Retrieval on the Semantic Web by Livia Predoiu, Heiner Stuckenschmidt Institute of Computer Science,
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory University of A Coruña A Coruña,
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.
Coolheads Consulting Copyright © 2003 Coolheads Consulting The Internal Revenue Service Tax Map Michel Biezunski Coolheads Consulting New York City, USA.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
Computer comunication B Information retrieval. Information retrieval: introduction 1 This topic addresses the question on how it is possible to find relevant.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
 Official Site: facility.org/research/evaluation/clef-ip-10http:// facility.org/research/evaluation/clef-ip-10.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Query Relevance Feedback and Ontologies How to Make Queries Better.
Research paper: Web Mining Research: A survey SIGKDD Explorations, June Volume 2, Issue 1 Author: R. Kosala and H. Blockeel.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Author: William Tunstall-Pedoe Presenter: Bahareh Sarrafzadeh CS 886 Spring 2015.
Artificial intelligence project
DIRECTORY OF EXISTING PROFESSIONAL AND TECHNICAL QUALIFICATIONS IN THE EU (Guy Van Gyes, Tom Vandenbrande, Ellen Schryvers) Budapest, June 12 & 13, 2003.
1 CS 430 / INFO 430 Information Retrieval Lecture 2 Text Based Information Retrieval.
Context-based Search in Topic Centered Digital Repositories Christo Dichev, Darina Dicheva Winston-Salem State University Winston-Salem, N.C. USA {dichevc,
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
A Markov Random Field Model for Term Dependencies Donald Metzler W. Bruce Croft Present by Chia-Hao Lee.
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
Copyrighted material John Tullis 10/17/2015 page 1 04/15/00 XML Part 3 John Tullis DePaul Instructor
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
EU Project proposal. Andrei S. Lopatenko 1 EU Project Proposal CERIF-SW Andrei S. Lopatenko Vienna University of Technology
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
1 Everyday Requirements for an Open Ontology Repository Denise Bedford Ontolog Community Panel Presentation April 3, 2008.
Comparing and Ranking Documents Once our search engine has retrieved a set of documents, we may want to Rank them by relevance –Which are the best fit.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Semantic Visualization What do we mean when we talk about visualization? - Understanding data - Showing the relationships between elements of data Overviews.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.
SKOS. Ontologies Metadata –Resources marked-up with descriptions of their content. No good unless everyone speaks the same language; Terminologies –Provide.
Next Generation Search Engines Ehsun Daroodi 1 Feb, 2003.
Ontology Mapping in Pervasive Computing Environment C.Y. Kong, C.L. Wang, F.C.M. Lau The University of Hong Kong.
Metadata Common Vocabulary a journey from a glossary to an ontology of statistical metadata, and back Sérgio Bacelar
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
Of 33 lecture 1: introduction. of 33 the semantic web vision today’s web (1) web content – for human consumption (no structural information) people search.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Semantic Web Technologies Readings discussion Research presentations Projects & Papers discussions.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Text Based Information Retrieval
6 ~ GIR.
Presented by: Hassan Sayyadi
Multimedia Information Retrieval
Presentation 王睿.
CS246: Information Retrieval
Information Retrieval
Presentation transcript:

Information and Communication Technologies 1 The place of place in geographical IR Diana Santos Marcirio Silveira Chaves Linguateca - Workshop on Geographic Information Retrieval, SIGIR 2006 Seattle, August 10th

Information and Communication Technologies 2 Structure of the presentation Purpose of this paper Discussion of language matters concerning place Present work on the measurement of the above issues Preliminary work done Further goals

Information and Communication Technologies 3 Purpose of the paper argue for natural language knowledge to inform GIR applications challenging assumptions routinely made about geographical information in texts discussing the notion of geographical ontology presenting several problems with the current modelling of geographical information provide preliminary data that measures or assesses the several points described, implicitly providing support for those claims

Information and Communication Technologies 4 What is natural language (processing)? Main features vagueness context-dependent implicit knowledge evolves/dynamic/creative Different natural languages different world view different glue/implicit Natural language is the oldest and most successful knowledge representation language Used for comunication, negotiation, and reason (->logic)

Information and Communication Technologies 5 What is NL processing? Using computers to do things with natural language To be useful for humans Most intelligent human tasks involve language as center (communicating, teaching, converting) as periphery (mathematics papers, medical diagnosis) Daily tasks writing (and creating or conveying information or affection) reading (and finding information) translating (and mediating) teaching and learning and documenting Enormous political impact

Information and Communication Technologies 6 GIR is NLP where place is important GIR is mainly retrieving of texts giving special attention to location The semantics of place is therefore important for GIR often provided as a map What are places? how are they called? how are they related? what reasoning they allow? is what people are after when formalizing the geographical part of IR formalization in NLP is often done resorting to ontologies -> geo-ontologies actualized by information extraction mechanisms (such as NER, ontology population) -> extracting geo info from texts made by assigning “semantic/extralinguistic” values to linguistic expressions -> assigning coordinates to names

Information and Communication Technologies 7 Start: What is known about (place in) NL? Why reinvent the wheel when all these subjects have already been looked at? OR If one starts where the others stopped progress in the field should be quicker! GIR is just another instantiation of the eternal problems of NL understanding, why proceeed as if it were a new discipline? OR why proceed ignoring how NL works?

Information and Communication Technologies 8 Place in NL is... dependent on language and culture and time and politics Malvinas or Falkland Spain or Iberian Peninsula Gaza strip: Israel or Palestina UK or England vague in Lisbon, in Portugal, in Europe near depends on its argument(s) used metonymically to refer to a people, a group of representatives of a state, its government, etc.

Information and Communication Technologies 9 Place in NL is... (2) context-dependent of the style/genre of the text of the purpose of the text of the intended readers of the text of the purpose of the reader/information seeker of whether the full location has already been mentioned Reguengos de Monsaraz é... Quando estive em Reguengos,... ambiguous the same place name can denote different places the same name can denote places and non-places (Porto, Faro)

Information and Communication Technologies 10 A geo-ontology in Web IR Formalizes knowledge about geography that is valid in a large collection of texts... OK, but texts have different purposes different readers different languages and cultures Are we talking about a shared consensus ??? Or are we attempting an information structure that encompasses, in its generality, all the inconsistencies and differences recognized above? Or, we are in fact only interested in a extremely tiny bit of place information, namely the one related with physical navigation for consumers? (restaurants in Switzerland, shoes in Genebra)

Information and Communication Technologies 11 Let us try to measure these phenomena 1.Given an administrative geo-ontology how often it is ambiguous (geo/geo and geo/non-geo) how often it is present in NL texts 2.Given a set of place names found in a sizeable text collection how often they are present in an administrative geo-ontology how often they refer to places what kinds of texts include locations 3.Given an NE annotated collection for evaluation purposes how often place names are used metonymically overlap with the geo-ontology 4.Given a set of geographical topics how often place depends on other issues

Information and Communication Technologies 12 Let us try to measure these phenomena (2) 5.Given a set of user logs of a Web search engine which are place names and why they were used its overlap with a geo-ontology 6.Given a set of annotated texts study the distribution of place names study the correlation with kind of text study the co-occurrence with other place names and with other names Work in progress on Portuguese... in several directions, some of these will be presented here.

Information and Communication Technologies 13 First results: projection of Geo-Net on the Web flatten Geo-Net (extract all name-class pairs: Lisboa-província) compute their overlap (total and partial) (Table 1) Total: Lisboa (city) and Lisboa (province); Castelo Branco (district and city) Partial: Monsaraz and Reguengos de Monsaraz; Castelo and Castelo Branco study their occurrence in WPT03 – BACO (a snapshot of the Portuguese Web in 2003 as indexed by tumba!) (Chaves & Santos 2005) using frequency lists (with capitalization distinctions) for one-word names using n-gram searches for MW-names checking co-occurrence with non-capitalized words

Information and Communication Technologies 14 First results: study of locations in Web text Using SIEMÊS to NE-annotate a subset of WPT03 (Chaves & Santos05) presence of the locations in Geo-NET-PT-01: only 10% of types were there presence of the locations in GKB-ML (also covering the Portuguese names of cities outside Portugal): only 3.18% of the types were there what kinds of locations appear on the Web how often are location names included in PEOPLE or ORGANIZATION names already automatically disambiguated in context : 31% of the types of PEOPLE names contain a word in Geo-Net-PT01; 23% of the types of ORGANIZATION names contain a word in Geo-Net-PT01

Information and Communication Technologies 15 First results: study of locations in text study their occurrence in HAREM golden colection (Table 3) presence of locations in Geo-NET-PT01: 30% of types were there GKB-ML: 25% of types were there what kind of locations appear in text 76% ADMINISTRATIVE, 11% LATU SENSU 6.6% PHYSICAL, 4.3% VIRTUAL, 0.01% ADDRESS how often are location names included in PEOPLE or ORGANIZATION names already manually disambiguated in context: 1.85% of the types of PEOPLE names contain a word in Geo-Net-PT01 3.5% of the types of ORGANIZATION names contain a word in Geo-Net-PT01

Information and Communication Technologies 16 First results: theoretical analysis of GeoCLEF topics and their extensions 1.non geographic subject restricted to a place (music festivals in Germany) 2.geographic subject with non-geographic restriction (rivers with vineyards) 3.geographic subject restricted to a place (cities in Germany) 4.non-geographic subject associated to a place (independence, concern, economic handlings to favour/harm that region, etc.) (independence of Quebec, love of Peru) GeoCLEF 2005 and GeoCLEF 2006 GeoCLEF 2006

Information and Communication Technologies 17 First results: theoretical analysis of GeoCLEF topics and their extensions (cont.) 5.non-geographic subject that is a complex function of place (place is a function of topic) (European football cup matches, winners of Eurovision Song Contest) 6.geographical relations among places (how is the Himalayan related to Nepal?) 7.geographical relations among events (Did Waterloo occur more north than the battle of X? Were the findings of Lucy more to the south than those of the Cromagnon in Spain?) 8.relations between events which require their precise localization (was it the same river that flooded last year and in which killings occurred in the XVth century?)

Information and Communication Technologies 18 First results: use of place names in (con)text, in the HAREM NER annotated golden collection Portugal... is a big consumer of football shows (its population)... sent peace-keeping troops to East Timor (its government)... has emmigrants all over the world (socio-ideological-legal being)... lost the match against Germany (its football team) VERSUS really PLACE I was born in Portugal Portugal has the best beaches of the world These prehistoric relics were found in Portugal

Information and Communication Technologies 19 Concluding remarks Many of the properties of NL have been recognized or rediscovered in GIR before, we do not claim to be the first to point them! Still, it seems relevant to show that they derive naturally from the way NL is and are not at all specific to GIR The measurement of many of these issues still remains to be done. This paper simply argues for its need, presents some preliminary results for Portuguese, and hopes that studies for other languages will soon appear.

Information and Communication Technologies 20 Background on the resources used if there are related questions...

Information and Communication Technologies The golden collection of HAREM (PT) 257 (127) documents, with 155,291 (68,386) words and 9,128 (4,101) manually annotated NEs Places: 2157 (980) tokens, 979 (462) types

Information and Communication Technologies 22 GeoNet e GKB-ML StatisticGeo-Net-PT01GKB-ML # of features418,06512,293 # of relationships419,86712,258 # of part-of relationships418,340 (99.83%)12,245 (99,89%) # of equivalence relationships395 (0.09%)2,501(20,40%) # of adjacency relationships1,132 (0.27%)13 (0.10%) Avg. broader features per feature Avg. narrower features per feature Avg. equivalent features per feature with equivalent Avg. adjacent features per feature with adjacent # of features without ancestors3 (0.00%)1(0.00%) # of features without descendants374,349 (89.54%)12,045 (97,98%) # of features without equivalent417,867 (99.95%)11,819 (96,14%) # of features without adjacent417,739 (99.92%)12,291 (99,99%)

Information and Communication Technologies 23 WPT 03 and BACO WPT03 a snapshot of the Portuguese Web in May-June 2003 as indexed by tumba! 3,6 million documents (3,4 in HTML) 854,936,550 tokens, 4,243,875 types 1,150,000 log entries (searches for six months in 2003) BACO: conversion to mySQL format of the text after duplicate removal

Information and Communication Technologies 24 SIEMÊS NER for Portuguese (Sarmento 2006, Sarmento et al. 2006) Based on a large gazetteer, REPENTINO (450,000 instances, ~150,000 locations) and on similarity rules 64.09% precision and 69.83% recall in HAREM for LOCAL 61.3% precision and 56.7% recall in miniHAREM