Presentation is loading. Please wait.

Presentation is loading. Please wait.

Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza.

Similar presentations


Presentation on theme: "Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza."— Presentation transcript:

1 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza Baptista University of Campina Grande Computer Science Department Information Systems Laboratory http://www.lsi.dsc.ufcg.edu.br SECOGIS – ER 2009 Gramado – RS- Brazil, 13th November 2009

2 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br2 Agenda Introduction Main Challenges Detection of Geographic References The Geographic Scope GeoSEn Prototype  Architecture  GUI Experiments Conclusion and Future Work

3 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Introduction Web: need for searching using the geographic context; Traditional search engines: search based on keywords only; Example:  A Web document: “...With the arrival of the industry in Gramado, one thousand of new jobs for Java programmers will be created...”;  User query: “Java programmer jobs Brazil”; The mentioned document will not be retrieved in the previous query!

4 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Introduction  What is the Geographic Context of Web documents?  The place where the information was created?  The places mentioned in the document content?  Where are people who are most interested in a particular information?  etc…  Several documents have this context: Research in Portugal in which only occurrence of names of Portuguese cities was considered (308 in total):  Total of about 4 millions pages analyzed.  Occurrence of 2.2 references per document;  4% of the queries submitted had a reference to one of those cities.

5 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Main Challenges Detection of geographic references in the documents; Modeling of geographic scope of documents; Relevance ranking according to geographic context; Need for efficient index techniques which cope with both textual and spatial dimensions Development of user interfaces which provide usability to deal with both dimensions

6 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Detection of Geographic References Aim: to identify document features which may be mapped to a geographic place name; Challenge: elimination of ambiguities, ex:  Place with a name of a thing; (Ex. Gramado, Canela)  Place with name of a Person (Ex. Garibaldi);  Places with same names and same types: (Ex. Cachoeirinha-Pe e Cachoeirinha-Rs);  Places with same names and different types (ex. city of Rio de Janeiro and state of Rio de Janeiro  Places and gentilics with the same names (ex. city of Paulista-Pe and paulista (who is born in São Paulo)

7 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Detection of Geographic References Another example of ambiguity:  São Paulo as a State  São Paulo as a City  São Paulo as a football team  São Paulo as the name of a hospital  São Paulo as the Saint!

8 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Detection of Geographic References Explored detected points: page content, page title, URL; Types of detected places: all of the spatial hierarchy: (from city to region); Types of detected references: place names, postal code, telephone code area, gentilic.

9 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Definitions Confidence Rate (CR) represents the probability of a given reference be a valid place name. Confidence Factor (CF) a measure associated to each analyzed feature during the detection of geographic reference. CR CF 1 N

10 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Confidence Factor CF ST – analyzes the occurrence of special terms associated to geographic references;  Examples of STs include: “in" (e.g. “in Gramado); "city" (e.g. "city of São Paulo"); “ZIP” (e.g. “ZIP: 58109-000”);  Storage of special terms: Term; Type of geographic reference (zip code, telephone area code, place name, etc,); Type of place (city, state, region); Minimum distance (D MIN ); Maximum distance (D MAX ); Maximum confidence grade (C MAX ).

11 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Confidence Factor CF TS – considers the probability of a term be a geographic reference using a traditional search engine;

12 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Confidence Factor CF CROSS :  analyzes the occurrence of cross references based on topological relationships (inside, contains, etc); CF FMT – evaluates the syntax used to describe the geographic references;  Abbreviation of place names (R. de Janeiro, RJ);  The use of uppercase in the place names;  Telephone format ( 083)-999-3456;  Postal code format 58.104-867

13 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br13 Modeling of the Geographic Scope A document may be associated to one or more places; A geographic scope may have places that are not mentioned directly in a document (geographic expansion) Each place which is part of the scope has an associated relevance value;

14 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br14 Geographic Dispersion Rate (a)(b) Another factor used in the composition of the geographic relevance value; Hypothesis: references dispersed may characterize regions that share common features (e.g. cultural, economic, social);

15 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br15 GeoSEn – an overview Geographic Search Engine:  Indexes a subset of the Brazilian Web;  Deals with 6,291 places in Brazil, which are organized in a five-levels hierarchy: from city to region. Region: ex. South State: ex. Rio Grande do Sul MesoRegion: ex. Metropolitana de Porto Alegre MicroRegion: ex. Gramado-Canela Municipality: ex. Gramado

16 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br16 GeoSEn - Architecture

17 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br

18 Query Example Example of query using a user defined area of interest SELECT id FROM places plc1 WHERE within(plc1.geometry, specified_geometry) AND NOT EXISTS ( SELECT id FROM places plc2 WHERE within(plc2.geometry, specified_geometry) AND within(plc1.geometry, plc2.geometry))

19 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br Experiments Experiments using 66,531 indexed documents; 5 classes:.edu,.gov, blogs, tourism, arts; Detection of terms:  Documents from the Web manually analyzed;  Documents with strong ambiguities created for the test bed;

20 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br20 Conclusion We have presented a heuristic based approach to implement a GIR system. The techniques presented may be combined with others already known. Precomputed relevance values may be used aiming to simplify the search process;

21 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br21 Future Work Retrieval of georeferenced images and videos; Recognition of other kinds of places; Integration of other data sources; Evaluation using large data set collections.

22 Cláudio Baptista, UFCG http://lsi.dsc.ufcg.edu.br22 Thank you very much! Questions?


Download ppt "Cláudio Baptista, UFCG A Model for Geographic Knowledge Extraction on Web Documents Cláudio E. C. Campelo and Cláudio de Souza."

Similar presentations


Ads by Google