Extracting metadata for spatially- aware information retrieval on the internet Pual Clough Presented by Ali Khodaei CS 572.

Slides:



Advertisements
Similar presentations
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
Advertisements

CoopIS2001 Trento, Italy The Use of Machine-Generated Ontologies in Dynamic Information Seeking Giovanni Modica Avigdor Gal Hasan M. Jamil.
OS Places New Service Products from May 2014 Address Capture & Verification Address Matching GeoSearch Ordnance Survey 2014.
Lukas Blunschi Claudio Jossen Donald Kossmann Magdalini Mori Kurt Stockinger.
Chapter 5: Introduction to Information Retrieval
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
MINING FEATURE-OPINION PAIRS AND THEIR RELIABILITY SCORES FROM WEB OPINION SOURCES Presented by Sole A. Kamal, M. Abulaish, and T. Anwar International.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Research Paper Presentation – CS572 Summer 2011 Presented by Donghee Sung Paper by Paul Clough (University of Sheffield Western Bank)
1 The GeoParser. 2 Overview What is a geoparser? –Software for the automated extraction of place names from text Why would you want one? –Document characterisation.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
MANISHA VERMA, VASUDEVA VARMA PATENT SEARCH USING IPC CLASSIFICATION VECTORS.
Filtering Multiple-Record Web Documents Based on Application Ontologies Presenter: L. Xu Advisor: D.W.Embley.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
A Framework for Named Entity Recognition in the Open Domain Richard Evans Research Group in Computational Linguistics University of Wolverhampton UK
Retrieving Location-based Data on the Web Andrei Tabarcea,
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Determining and Mapping Locations of Study in Scholarly Documents: A Spatial Representation and Visualization Tool for Information Discovery James Creel.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Location-Based API 1. 2 Location-Based Services or LBS allow software to obtain the phone's current location. This includes location obtained from the.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
DE&T (QuickVic) Reporting Software Overview Term
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.
University of Dublin Trinity College Localisation and Personalisation: Dynamic Retrieval & Adaptation of Multi-lingual Multimedia Content Prof Vincent.
Resolving abbreviations to their senses in Medline S. Gaudan, H. Kirsch and D. Rebholz-Schuhmann European Bioinformatics Institute, Wellcome Trust Genome.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
Playing Biology ’ s Name Game: Identifying Protein Names In Scientific Text Daniel Hanisch, Juliane Fluck, Heinz-Theodor Mevissen and Ralf Zimmer Pac Symp.
Searching the web Enormous amount of information –In 1994, 100 thousand pages indexed –In 1997, 100 million pages indexed –In June, 2000, 500 million pages.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Presented By- Shahina Ferdous, Student ID – , Spring 2010.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
Supertagging CMSC Natural Language Processing January 31, 2006.
Class Imbalance in Text Classification
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Ontology Based Annotation of Text Segments Presented by Ahmed Rafea Samhaa R. El-Beltagy Maryam Hazman.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
General Architecture of Retrieval Systems 1Adrienn Skrop.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Lesson 16 Enhancing Documents
Physical Structure of GDB
Presentation 王睿.
Clustering Semantically Enhanced Web Search Results
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Data Warehousing Concepts
Presentation transcript:

Extracting metadata for spatially- aware information retrieval on the internet Pual Clough Presented by Ali Khodaei CS 572

Outline Why do we care? Setting Geo-parsing – NER – Improving Geo-parsing Geo-coding – Ambiguity – Improving geo-coding Experiments

Why do we care ? Many documents on the web contain geospatial information including addresses, postal codes, hyperlinks and geographic references. This information can be exploited and used to provide spatial awareness to information systems. – transport timetables, routing systems for motorists, map-based web sites and location-based services, etc.

Why do we care? A key part of providing such services is the extraction and use of geospatial information. Extracting geospatial references from documents involves two main tasks: – Identifying geographic references : : geo-parsing – Assigning them spatial coordinates : geo-coding

Efficiency Criteria Speed : fast execution of geo-parsing and geo-coding programs to allow the processing of large collections within feasible timescales. Reliability : robust processing on typical web data with error-recovery strategies to ensure minimal manual intervention. Flexibility : enable control over the geo-parsing process including the addition of custom gazetteer lists and creation of grammars for context matching. Multilingualism : to be able to process texts written in a variety of languages other than English.

Data Sources Several sources of geographic information were available in the SPIRIT project. Each resource contains both place names and spatial information Resources differ in – granularity of place names (e.g. city, state) – Quality of spatial reference (e.g. accuracy) – Type of spatial representation (e.g. point, polygon)

Data Sources SABE : Seamless Administrative Boundaries of Europe dataset OS : the Ordnance Survey Gazetteer for the UK TGN : the Getty Thesaurus of Geographic Names

SPIRIT Web Collection 94,552,870 web pages approximate size of 1TByte. approximately 9.6 million pages (10.24%) were from European domains; the rest mainly from US sites. in total, 1,759,681 UK, 1,556,585 German, 505,023 French and 270,715 Swiss web pages were extracted for geo-parsing.

Geo-parsing

Geo-parsing is very related to the more general problem of Named-Entity Recognition (NER) NER is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories (e.g. names of persons, organizations, …)

NER Most research on NER systems has been structured as taking an unannotated block of text, such as: Jim bought 300 shares of Acme Corp. in And producing an annotated block of text, such as: Jim bought 300 shares of Acme Corp. in 2006.

NER Most NER algorithms combine gazetteers (lists of known locations, organizations and people) with rules, which capture elements of the surrounding context.

Geo-parsing Simple-lookup using gazetteers – Pros Simple Fast Language independent Robust – Cons Fails when entries are not used in geographical sense (e.g. names of people, businesses,..) or When variants of names are used (e.g. historical or vernacular forms) Ambiguity

Improving Geo-Parsing Improving geo-parsing (on top of gazetteer) a) Adding context rules b) Filtering out commonly used words c) Using person name lists

Improving Geo-Parsing a) Adding context rules – filters candidate locations using context rules to remove stopwords, references to people and organizations, and links to s/URLs “Mr. Sheffield” is filtered out as a non-geographic reference using the context rule “ -> null” where and are placeholders for entries from the gazetteer lists.

Improving Geo-Parsing b) Filtering out commonly used words – All geographical resources used in SPIRIT have entries for places which are more commonly used in a non- geographical sense – “New”, “More”, “Read”, “Guide” and “Old” all appear as entries in the OS gazetteer. – A stopword list for each geographic resource was computed by calculating document frequency from each entry in the gazetteer – The top 1000 place names were manually assessed for any well-known locations (e.g. “Bath” and “Derby” in the UK) and removed from the stopword list.

Improving Geo-Parsing c) Using person names lists – In addition to using lists of common words, geo- parsing was extended with lists of person names (both forenames and surnames) to deal with ambiguity resulting from locations being part of a personal name e.g. “John Sheffield”

Geo-coding

After extracting candidate locations, the second stage involves assigning them spatial co-ordinates (through gazetteer matching).

Ambiguity – disambiguating place names with multiple spatial references “Chapeltown” refers to a location in South Yorkshire (UK),Lancashire (UK), Kent County (USA) and Panola County (USA). – degree of ambiguity for each spatial resource is different In OS approximately 8% of place names are ambiguous TGN exhibits the most ambiguity (59%) due to ambiguity between countries as well as within a single country

Improving Geo-Coding The simplest (and often most effective) method to resolve referent ambiguity is to assign ambiguous places a default sense, based on – most commonly occurring place – population of the place name – semi-automatic extraction from the Web – the length of hierarchy containing the location – The feature type provided by dataset

Improving Geo-Coding Other improvements – Overlap score matching words between place names in the hierarchy of an ambiguous location and those found within n words either side of the name – Preference senses are preferred based on their resources – e.g. SABE -> OS -> TGN senses matching a command-line option specifying the country currently being processed – e.g. preferred country is given a value 1; the rest a value 0

Improving Geo-Coding

Experiments Geo-Parsing Results correct (C), missing (M), false hits (S), precision (P), recall (R) and F1-measure (F1). From the annotations in the response- set, automatically generated, precision measures the proportion of these matching the manually assigned annotations. From the annotations defined manually, recall measures the proportion of these which are also correctly identified by the geo-parser. The F1 score is a single-valued summary of both precision and recall

Experiments contribution of each heuristic to geo-parsing main observation is the effect of not removing the stopwords: the F1 score drops from to (24% decrease) indicating the importance of identifying and removing stopwords.

Pros and Cons PROS – Simple and concise – Based on a real system – Real applications CONS – Considers simple/basic cases only – Needs more evaluation results for geo-coding – Not very research-oriented

Questions ?