Presentation is loading. Please wait.

Presentation is loading. Please wait.

AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti.

Similar presentations


Presentation on theme: "AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti."— Presentation transcript:

1 AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti University of Eastern FinlandUniversity of Eastern Finland http://cs.joensuu.fi/mopsi/

2 INTRODUCTION Our goal is to find services and points of interest close to the user’s location We call this “location-based search” We try to find location information in web-pages

3 AD-HOC GEOREFERENCINGAD-HOC GEOREFERENCING The problem is how to extract and validate location data from free- form text Most web pages don’t contain explicit georeferencing (eg. geo-tags) Postal address is the most common location data found Our goal is to give geographical coordinates to services mentioned in web-pages We call this method ad-hoc georeferencing Pages of Pasi Fränti

4 MOPSI LOCATION-BASED SEARCHMOPSI LOCATION-BASED SEARCH MOPSI = Mobiilit paikkatieto- sovellukset ja Internet (Mobile location based applications and Internet) http://cs.joensuu.fi/mopsi/ Available on http://cs.joensuu.fi/mopsi/ Main focus areas: Mobile search engine How to collect & present location-based data Other location-related topics

5 MOBILE SEARCH ENGINEMOBILE SEARCH ENGINE – How can you find services: – Asking directions – Advertisements – Wandering around – Yellow pages – Internet – Query consists of: – Keyword – Location

6 MOBILE SEARCH ENGINE STRUCTUREMOBILE SEARCH ENGINE STRUCTURE Geocoded street-name database Core server software Mobile application Web user interface Coordinates Address Keyword Coordinates Search results Keyword Coordinates Search results Search Engine consists of: User interface Core server software Geocoded street-name database

7 CORE SERVER SOFTWARECORE SERVER SOFTWARE Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser

8 CORE SERVER SOFTWARECORE SERVER SOFTWARE Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser

9 OUR SOLUTIONOUR SOLUTION A rule-based solution that detects address-based locations using a gazetteer and street-name prefix trees created from the gazetteer We compare this approach against: – a method that doesn’t require a gazetteer (a heuristic method that assumes that the street- name has a certain structure) – a method that also uses data structures created from the gazetteer in the form of street- name arrays StreetNameDetection(words) { WHILE i < count(words) DO { IF words[i] = street name THEN { Search for street number, postal code and other address elements near words[i]. IF address elements found THEN { Create address block Get coordinates using Geocoded Database IF coordinates found THEN Add address block to address list } } i = i + 1; } }

10 STREET-ADDRESS DETECTIONSTREET-ADDRESS DETECTION We use a rule-based pattern matching algorithm The detection of street-names is the starting point of the algorithm An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) Address block candidates are validated using the gazetteer

11 STREET-NAME DETECTIONSTREET-NAME DETECTION Street-name detection is the starting point of the address detection Heuristic and brute-force method are compared against our Prefix Tree solution Our application uses a commercial gazetteer for Finland and, for Singapore, street data from the free map project OpenStreetMap Gazetteer StatisticsFinlandSingapore Number of municipalities4101 Total number of street names92 572573 Number of streets per municipality474573 Average street name length11.66.1 Total size (MB)2 9820.18

12 PREFIX TREESPREFIX TREES Invented by Friedkin (1960) The prefix tree (or trie) is a fast ordered tree data structure used for retrieval Root is associated with an empty string All the descendants of a node have a common prefix of the string associated with that node Some nodes can have associated values (usually they mark the end of a word)

13 STREET-NAME PREFIX TREESSTREET-NAME PREFIX TREES Our solution is to detect street-names using prefix trees constructed from the gazetteer A street-name prefix tree is build for each municipality used in the search The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities Prefix Tree StatisticsFinlandSingapore Maximum tree depth3414 Average tree depth12.77.4 Average tree width105167 Average number of nodes per tree 23382335 Total size (MB)74.40.18

14 OTHER SOLUTIONSOTHER SOLUTIONS Heuristic solution – Relies on regular expression matching – Street names usually have similar endings or similar prefixes – A gazetteer is not needed (except for validation) – Can be fast but not precise Brute-force solution – Every word should be checked if it exists in the gazetteer – An optimized solution is used (gazetteer is locally limited and preloaded into arrays)

15 EXPERIMENTS 10 urban locations (blue) and 10 rural location (orange) were used for testing Testing was done using the MOPSI prototype for Finland and Singapore Both commercial and non- commercial keywords were used: Commercial hotel, restaurant, pizzeria, cinema, car repair Non-commercial hospital, museum, police station, swimming hall, church

16 RESULTS Average processing times for every solution were calculated The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution The resulting solution improves the speed and quality of web- page georeferencing MethodTime (s) Standard deviation Validated addresses Rural municipalities Brute-Force3,012,433,7 Heuristic1,541,152,5 Prefix Tree0,510,353,7 Urban Municipalities Brute-Force10,187,1119,8 Heuristic1,701,2418,6 Prefix Tree0,870,8519,8 Total Brute-Force6,596,4011,8 Heuristic1,621,2010,5 Prefix Tree0,690,6811,8

17 OPEN PROBLEMSOPEN PROBLEMS Support approximate matching to avoid problems in misspellings Improve flexibility of the address detection algorithm Implement a way to learn rules automatically using hand tagged example corpus.

18 http://cs.joensuu.fi/mopsi Thank you!Thank you!


Download ppt "AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti."

Similar presentations


Ads by Google