Download presentation
Presentation is loading. Please wait.
Published byTodd York Modified over 9 years ago
1
Retrieving Location-based Data on the Web Andrei Tabarcea, 14.02.2011
2
Introduction The goal is to find services and points of interest close to the user’s location We call this “location-based search” We try to find location information in web-pages
3
MOPSI Search
4
MOPSI Search Results Locally Managed Database Users’ Collection Open Web Searches Combinationofsearchresults
5
Location Information in Webpages - Site hosting information (owner address, server address etc.) - HTML tags (geo-tags, address-tags, vcards for Google Maps etc.) - Addresses, postal codes, phone numbers - Well-known places
6
Main Challenges Find location information in webpages Find relevant information related to the found location information
7
Ad-Hoc Georeferencing The problem is how to extract and validate location data from semi-structured text Postal address is the most common location data found Our goal is to give geographical coordinates to services mentioned in web-pages We call this method ad-hoc georeferencing Pages of Pasi Fränti VS.
8
Extracting the Information For each link: - Extract plain text from html-file - Detect street names by using gazetteer - Extract additional service information - Gather results as list For result list: - Evaluate relevance - Arrange by distance - Purge overlapping results - Show results - (Optionally) Save results
9
Problems - How to evaluate relevance? - Mixed keyword meanings - No relation between keywords and addresses
10
Mobile Search Engine Geocoded street-name database Core server software Mobile application Web user interface Coordinates Address Keyword Coordinates Search results Keyword Coordinates Search results Search Engine consists of: User interface Core server software Geocoded street-name database
11
Core Server software Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser
12
Street-address Detection We use a rule-based pattern matching algorithm The detection of street-names is the starting point of the algorithm An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) Address block candidates are validated using the gazetteer
13
Title Detection - Title detection (or company detection) is a Named Entity Recognition problem - We designed a 2-step system to detect titles associated to addresses: - Step 1: Fast dictionary match - Step 2: Use a classifier to detect the title
14
Title Extractor Usually, the text before the address holds relevant information Joen Pizza Special Y-tunnus: 2129577-6 Käyntiosoite: Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala: Kahvila-ravintolat address words before the address
15
The Problem - Results for keyword “kahvila”, address: ”Freesenkatu 1, Helsinki” No title
16
System Architecture Tagged and hand-checked data Classifier Training data HTML pages Evaluator Evaluation data HTML parser Dictionary matching Match Title extractor Title candidate Parsed HTML Statistics TITLE Dataset Collection No match
17
Parsing HTML pages - Current solution extracts text from HTML pages - We don’t exploit the advantage that we extract data from web pages - Proposed future solution: - Visual segmentation of web pages - Detection of the address block - Nearest-neighbor search considering text and visual characteristics Joen Pizza Special Y-tunnus 2129577-6 Käyntiosoite Koskikatu 17 80100 JOENSUU Postiosoite Koskikatu 17 80100 JOENSUU Puhelin: 013-220246 Virallinen toimiala Kahvila-ravintolat
18
Questions Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.