Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.

Slides:



Advertisements
Similar presentations
ENHANCING ATTRACTIVENESS OF ENVIRONMENTAL ASSESSMENT AND MANAGEMENT HIGHER EDUCATION Seminar on Experiences in China and the EU Nankai University, Tianjin,
Advertisements

Reference Model Ideas. Geospatial Semantics and Ontology Reference Model Metadata Data Sources Underlying Ontologies Semantic and Ontology Services Ontology.
1 OOA-HR Workshop, 11 October 2006 Semantic Metadata Extraction using GATE Diana Maynard Natural Language Processing Group University of Sheffield, UK.
1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004
Geographical Information Retrieval Instituto Superior Técnico - INESC-ID Data Management and Information Retrieval Group (DMIR) - TagusPark Por Bruno Martins.
SINAI-GIR A Multilingual Geographical IR System University of Jaén (Spain) José Manuel Perea Ortega CLEF 2008, 18 September, Aarhus (Denmark) Computer.
Chapter 5: Introduction to Information Retrieval
A Geographic Knowledge Base for Semantic Web Applications Marcirio Silveira Chaves Mário J. Silva Bruno Martins 20º Brazilian Symposium on Databases -
Nuno Cardoso, Bruno Martins, Marcirio Chaves, Leonardo Andrade and Mário J. Silva XLDB Group - Department of Informatics Faculdade de Ciências da Universidade.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcirio Chaves, Leonardo Andrade, Mário J. Silva XLDB Group - Department of Informatics Faculdade.
Opinion Mapping Travelblogs Efthymios Drymonas Alexandros Efentakis Dieter Pfoser Research Center Athena Institute for the Management of Information Systems.
Disambiguating Queries for Geographic Information Retrieval Carolyn Hafernik Thesis Proposal May 10, 2006 Computer Science Advisor: Lisa Ballesteros.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
The XLDB Group at GeoCLEF 2005 Nuno Cardoso, Bruno Martins, Marcírio Chaves, Leonardo Andrade, Mário J. Silva
Research Paper Presentation – CS572 Summer 2011 Presented by Donghee Sung Paper by Paul Clough (University of Sheffield Western Bank)
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory University of A Coruña A Coruña,
Using the Semantic Web for Web Searches Norman Piedade de Noronha, Mário J. Silva XLDB / LaSIGE, Faculdade de Ciências, Universidade de Lisboa.
Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
 Image Search Engine Results now  Focus on GIS image registration  The Technique and its advantages  Internal working  Sample Results  Applicable.
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Semi-Automated Design Guidance Enhancer (SADGE) A Framework for Architectural Guidance Development Mohsen Anvaari Norwegian University of Science and Technology.
Toward Semantic Web Information Extraction B. Popov, A. Kiryakov, D. Manov, A. Kirilov, D. Ognyanoff, M. Goranov Presenter: Yihong Ding.
Language Identification in Web Pages Bruno Martins, Mário J. Silva Faculdade de Ciências da Universidade Lisboa ACM SAC 2005 DOCUMENT ENGENEERING TRACK.
Determining and Mapping Locations of Study in Scholarly Documents: A Spatial Representation and Visualization Tool for Information Discovery James Creel.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
Page 1 WEB MINING by NINI P SURESH PROJECT CO-ORDINATOR Kavitha Murugeshan.
1 The BT Digital Library A case study in intelligent content management Paul Warren
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Mining the Semantic Web: Requirements for Machine Learning Fabio Ciravegna, Sam Chapman Presented by Steve Hookway 10/20/05.
Survey of Semantic Annotation Platforms
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Extracting metadata for spatially- aware information retrieval on the internet Pual Clough Presented by Ali Khodaei CS 572.
The PrestoSpace Project Valentin Tablan. 2 Sheffield NLP Group, January 24 th 2006 Project Mission The 20th Century was the first with an audiovisual.
AnswerBus Question Answering System Zhiping Zheng School of Information, University of Michigan HLT 2002.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Ontology-Driven Automatic Entity Disambiguation in Unstructured Text Jed Hassell.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Pete Bohman Adam Kunk.  ChronoSearch: A System for Extracting a Chronological Timeline ChronoChrono.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
Sheffield -- Victims of Mad Cow Disease???? Or is it really possible to develop a named entity recognition system in 4 days on a surprise language with.
Ontology based Information Extraction
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
Thomas Mandl: GeoCLEF Track Overview Cross-Language Evaluation Forum (CLEF) Thomas Mandl, (U. Hildesheim) 8 th Workshop.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Information and Communication Technologies 1 Overview of GeoCLEF 2007 IR techniques IE/NLP techniques GIR techniques Systems Resources Experiments Translation.
An Ontology-based Automatic Semantic Annotation Approach for Patent Document Retrieval in Product Innovation Design Feng Wang, Lanfen Lin, Zhou Yang College.
Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
6 ~ GIR.
Text Categorization Document classification categorizes documents into one or more classes which is useful in Information Retrieval (IR). IR is the task.
PROJECTS SUMMARY PRESNETED BY HARISH KUMAR JANUARY 10,2018.
Web Mining Department of Computer Science and Engg.
CS246: Information Retrieval
Presentation transcript:

Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh

Overview : The importance of the experiment. Introduction to SPIRIT and GATE. Techniques employed – Geo Parsing and Geo Coding. Pros Cons What it leads to.

The importance of the experiment: A novel system. Geospatial information extraction from the Web documents. Annotating the retrieved documents with the spatial data. Using the annotated documents to power a working GIR system.

How does it work (summary) Extracting geospatial references from document involves: –Identifying geographic references –Assigning them spatial co-ordinates –Factors influencing the above: speed, reliability, flexibility and multilingualism.

Introduction to SPIRIT Spatial Information Retrieval on the Internet The main aim of the project is to create tools and techniques to help people find information that relates to specified geographical locations.

1TB crawl of about 9million web documents focused on UK, Germany, France and Switzerland. Support of Ontology of places. Relevance ranking of web documents catering to needs of: Documents referring some place of interest Digital geospatial resources

GATE It’s a java suite for tasks related to Natural Language Processing and particularly useful and widely used in the area of Information Extraction. ANNIE (A nearly-new Information Extraction system) is the highlight of this experiment which is employed by SPIRIT.

ANNIE Tokenizer Gazetter Sentence splitter Part-of-speech tagger Named-Entity transducer

Spatial Markup Sources of Spatial markup: OS – Ordnance Survey (UK, point) TGN – Getty Thesaurus of Geographical names (Global, point) SABE – Seamless administrative boundaries of Europe (Europe, polygon)

Geo-Parsing Named-Entity Recognition – lists + rules List lookup inefficient First gazetter lookup then use of contextual evidence to realize this. JAPE (Java Patterns Annotation Engine) – rules defined w.r.t terms of entities identified within GATE. Rules are language independent (using Systran system)

Hurdles faced Filtering out commonly used words – specially which are used in a non-geographical sense. Using person-name list to filter out ambiguity between places and names.

Geo-Coding Gazetter lookup to assign co-ordinates Removing ambiguity in place names: by feature hierarchy and feature type provided by OS. Actual grounding done by SABE and OS. TGN used to resolve global ambiguity.

Experimental Setup Total annotated collection of about 8.8million pages 22 out of top 50 domains from Europe About 1.6 million doc containing 5-10 unique footprints selected. Further 10% chosen from this and then those only from UK (130) All geographic names (1864) manually identified and stored as benchmark

Geo-parsing Results SPIRIT + SABE + OS: Correct – 1340 Missing – 479 False Hits – 596 Precision – Recall – F1 –

Geo-Coding Results TGN ineffective due to global scope – 1021 found, 68% ambiguous. UK SABE good – 942 found, 11% ambiguous places assigned a UID correctly. That is not only correct geo sense but resource order too.

Conclusions Promising as success rate of 89% is there. Geo-parsing can be improved by enhancing gazetter matching methods and filtering of non- geographic entries Geo-coding can be improved by finding better methods for combining geog. resources.

Pros Novel system and high success rate. Towards a geospatial search engine. Spatial markup resources in abundance.

Cons Ambiguity (geographical) Matching correct geographical sense. Large overhead required to build such systems. Inherent NLP problems.

What it all leads to Creating geographical ontology to assist in GIR (Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa Lisboa, Portugal) More focused Local and topical search (Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany)

References Extracting Metadata for Spatially-Aware Information Retrieval on the Internet - Clough, Paul GATE - SPIRIT - Challenges and Resources for Evaluating Geographical IR - Bruno Martins, Mário J. Silva and Marcirio Silveira Chaves Faculdade de Ciências da Universidade de Lisboa Lisboa, Portugal Urban Web Crawling - Dirk Ahlers OFFIS Institute for Information Technology Oldenburg, Germany; Susanne Boll University of Oldenburg Germany