Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

ARNOLD SMEULDERS MARCEL WORRING SIMONE SANTINI AMARNATH GUPTA RAMESH JAIN PRESENTERS FATIH CAKIR MELIHCAN TURK Content-Based Image Retrieval at the End.
Dialogue – Driven Intranet Search Suma Adindla School of Computer Science & Electronic Engineering 8th LANGUAGE & COMPUTATION DAY 2009.
Explorations in Tag Suggestion and Query Expansion Jian Wang and Brian D. Davison Lehigh University, USA SSM 2008 (Workshop on Search in Social Media)
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
Retrieving Documents with Geographic References Using a Spatial Index Structure Based on Ontologies Database Laboratory University of A Coruña A Coruña,
Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,
Shared Ontology for Knowledge Management Atanas Kiryakov, Borislav Popov, Ilian Kitchukov, and Krasimir Angelov Meher Shaikh.
IST NeOn-project.org The Semantic Web is growing… #SW Pages Lee, J., Goodwin, R. (2004) The Semantic.
ReQuest (Validating Semantic Searches) Norman Piedade de Noronha 16 th July, 2004.
Data Mining, Information Theory and Image Interpretation Sargur N. Srihari Center of Excellence for Document Analysis and Recognition and Department of.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Retrieving Location-based Data on the Web Andrei Tabarcea,
Design and Implementation of a Geographic Search Engine Alexander Markowetz Yen-Yu Chen Torsten Suel Xiaohui Long Bernhard Seeger.
Alonso Robles Solutions Architect speakTECH
Large-Scale Content-Based Image Retrieval Project Presentation CMPT 880: Large Scale Multimedia Systems and Cloud Computing Under supervision of Dr. Mohamed.
DOG I : an Annotation System for Images of Dog Breeds Antonis Dimas Pyrros Koletsis Euripides Petrakis Intelligent Systems Laboratory Technical University.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Master Thesis Defense Jan Fiedler 04/17/98
Ihr Logo Chapter 7 Web Content Mining DSCI 4520/5240 Dr. Nick Evangelopoulos Xxxxxxxx.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Information Retrieval and Web Search Lecture 1. Course overview Instructor: Rada Mihalcea Class web page:
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Cohesive Design of Personalized Web Applications Presented by Yinghua Hu Schwabe, D. Mattos Guimaraes, R. Rossi, G. Pontificia Univ. Catolica do Rio de.
1 Enhancements in Query Evaluation and Page Summarization of The Thinking Algorithm M. Shoaib Jameel Amar Akshat Chingtham Tejbanta Singh Department of.
Extracting Metadata for Spatially- Aware Information Retrieval on the Internet Clough, Paul University of Sheffield, UK Presented By Mayank Singh.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Detecting Dominant Locations from Search Queries Lee Wang, Chuang Wang, Xing Xie, Josh Forman, Yansheng Lu, Wei-Ying Ma, Ying Li SIGIR 2005.
CONVERSION ARCHITECTURE CONVERSION ARCHITECTURE Testing data Keyword expansion Historical data Conversion analysis Geographic data Keyword analysis Visual.
IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.
Spatially-aware Information Retrieval on the Internet Roelof van Zwol Marc van Kreveld.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Personalized Course Navigation Based on Grey Relational Analysis Han-Ming Lee, Chi-Chun Huang, Tzu- Ting Kao (Dept. of Computer Science and Information.
Google’s Deep-Web Crawl By Jayant Madhavan, David Ko, Lucja Kot, Vignesh Ganapathy, Alex Rasmussen, and Alon Halevy August 30, 2008 Speaker : Sahana Chiwane.
Algorithmic Detection of Semantic Similarity WWW 2005.
Information Retrieval in Context of Digital Libraries - or DL in Context of IR Peter Ingwersen Royal School of LIS Denmark –
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Information Retrieval
A Novel Visualization Model for Web Search Results Nguyen T, and Zhang J IEEE Transactions on Visualization and Computer Graphics PAWS Meeting Presented.
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
Problem Query image by content in an image database.
M4 / September Integrating multimodal descriptions to index large video collections M4 meeting – Munich Nicolas Moënne-Loccoz, Bruno Janvier,
Digital Video Library Network Supervisor: Prof. Michael Lyu Student: Ma Chak Kei, Jacky.
Soon Joo Hyun Database Systems Research and Development Lab. US-KOREA Joint Workshop on Digital Library t Introduction ICU Information and Communication.
Achieving Semantic Interoperability at the World Bank Designing the Information Architecture and Programmatically Processing Information Denise Bedford.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
Contextual Text Cube Model and Aggregation Operator for Text OLAP
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18.
Data mining in web applications
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Proposal for Term Project
6 ~ GIR.
Enhancing Internet Search Engines to Achieve Concept-based Retrieval
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Exploring Scholarly Data with Rexplore
CS & CS Capstone Project & Software Development Project
Searching and browsing through fragments of TED Talks
Multimedia Information Retrieval
Web Mining Department of Computer Science and Engg.
Lecture 8 Information Retrieval Introduction
Knowledge Sharing Mechanism in Social Networking for Learning
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009

Cláudio Baptista, UFCG Agenda Introduction Main Challenges Detection of Geographic References The Geographic Scope GeoSEn Prototype  Architecture  GUI Experiments Conclusion and Future Work

Cláudio Baptista, UFCG Introduction Web: need for searching using the geographic context; Traditional search engines: search based on keywords only; Example:  A Web document: “...With the arrival of the industry in Gramado, one thousand of new jobs for Java programmers will be created...”;  User query: “Java programmer jobs Brazil”; The mentioned document will not be retrieved in the previous query!

Cláudio Baptista, UFCG Introduction  What is the Geographic Context of Web documents?  The place where the information was created?  The places mentioned in the document content?  Where are people who are most interested in a particular information?  etc…  Several documents have this context: Research in Portugal in which only occurrence of names of Portuguese cities was considered (308 in total):  Total of about 4 millions pages analyzed.  Occurrence of 2.2 references per document;  4% of the queries submitted had a reference to one of those cities.

Cláudio Baptista, UFCG Main Challenges Detection of geographic references in the documents; Modeling of geographic scope of documents; Relevance ranking according to geographic context; Need for efficient index techniques which cope with both textual and spatial dimensions Development of user interfaces which provide usability to deal with both dimensions

Cláudio Baptista, UFCG Detection of Geographic References Aim: to identify document features which may be mapped to a geographic place name; Challenge: elimination of ambiguities, ex:  Place with a name of a thing; (Ex. Gramado, Canela)  Place with name of a Person (Ex. Garibaldi);  Places with same names and same types: (Ex. Cachoeirinha-Pe e Cachoeirinha-Rs);  Places with same names and different types (ex. city of Rio de Janeiro and state of Rio de Janeiro  Places and gentilics with the same names (ex. city of Paulista-Pe and paulista (who is born in São Paulo)

Cláudio Baptista, UFCG Detection of Geographic References Another example of ambiguity:  São Paulo as a State  São Paulo as a City  São Paulo as a football team  São Paulo as the name of a hospital  São Paulo as the Saint!

Cláudio Baptista, UFCG Detection of Geographic References Explored detected points: page content, page title, URL; Types of detected places: all of the spatial hierarchy: (from city to region); Types of detected references: place names, postal code, telephone code area, gentilic.

Cláudio Baptista, UFCG Definitions Confidence Rate (CR) represents the probability of a given reference be a valid place name. Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference. CR CF 1 N

Cláudio Baptista, UFCG Confidence Factor CF ST – analyzes the occurrence of special terms associated to geographic references;  Examples of STs include: “in" (e.g. “in Gramado); "city" (e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: ”);  Storage of special terms: Term; Type of geographic reference (zip code, telephone area code, place name, etc,); Type of place (city, state, region); Minimum distance (D MIN ); Maximum distance (D MAX ); Maximum confidence grade (C MAX ).

Cláudio Baptista, UFCG Confidence Factor CF TS – considers the probability of a term be a geographic reference using a traditional search engine;

Cláudio Baptista, UFCG Confidence Factor CF CROSS :  analyzes the occurrence of cross references based on topological relationships (inside, contains, etc); CF FMT – evaluates the syntax used to describe the geographic references;  Abbreviation of place names (R. de Janeiro, RJ);  The use of uppercase in the place names;  Telephone format ( 083) ;  Postal code format

Cláudio Baptista, UFCG Modeling of the Geographic Scope A document may be associated to one or more places; A geographic scope may have places that are not mentioned directly in a document (geographic expansion) Each place which is part of the scope has an associated relevance value;

Cláudio Baptista, UFCG Geographic Dispersion Rate (a)(b) Another factor used in the composition of the geographic relevance value; Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social);

Cláudio Baptista, UFCG GeoSEn – an overview Geographic Search Engine:  Indexes a subset of the Brazilian Web;  Deals with 6,291 places in Brazil, which are organized in a five-levels hierarchy: from city to region. Region: ex. South State: ex. Rio Grande do Sul MesoRegion: ex. Metropolitana de Porto Alegre MicroRegion: ex. Gramado-Canela Municipality: ex. Gramado

Cláudio Baptista, UFCG GeoSEn - Architecture

Cláudio Baptista, UFCG

Query Example Example of query using a user defined area of interest SELECT id FROM places plc1 WHERE within(plc1.geometry, specified_geometry) AND NOT EXISTS ( SELECT id FROM places plc2 WHERE within(plc2.geometry, specified_geometry) AND within(plc1.geometry, plc2.geometry))

Cláudio Baptista, UFCG Experiments Experiments using 66,531 indexed documents; 5 classes:.edu,.gov, blogs, tourism, arts; Detection of terms:  Documents from the Web manually analyzed;  Documents with strong ambiguities created for the test bed;

Cláudio Baptista, UFCG Conclusion We have presented a heuristic based approach to implement a GIR system. The techniques presented may be combined with others already known. Precomputed relevance values may be used aiming to simplify the search process;

Cláudio Baptista, UFCG Future Work Retrieval of georeferenced images and videos; Recognition of other kinds of places; Integration of other data sources; Evaluation using large data set collections.

Cláudio Baptista, UFCG Thank you very much! Questions?