AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti.

Slides:



Advertisements
Similar presentations
Getting Your Web Site Found. Meta Tags Description Tag This allows you to influence the description of your page with the web crawlers.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Efficient Keyword Search for Smallest LCAs in XML Database Yu Xu Department of Computer Science & Engineering University of California, San Diego Yannis.
Fast Algorithms For Hierarchical Range Histogram Constructions
Location-based search: services, photos, web Andrei Tabarcea Mohammad Rezaei
1 Presented By Avinash Gutte Under The Guidance of Mrs. Hemangi Kulkarni Department of Computer Engineering Pimpri-Chinchwad College of Engineering, Pune.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
The Trie Data Structure Basic definition: a recursive tree structure that uses the digital decomposition of strings to represent a set of strings for searching.
Tries Standard Tries Compressed Tries Suffix Tries.
A reactive location-based service for geo-referenced individual data collection and analysis Xiujun Ma Department of Machine Intelligence, Peking University.
Aki Hecht Seminar in Databases (236826) January 2009
1 The GeoParser. 2 Overview What is a geoparser? –Software for the automated extraction of place names from text Why would you want one? –Document characterisation.
Data-rich Section Extraction from HTML pages Introducing the DSE-Algorithm Original Paper from: Jiying Wang and Fred H. Lochovsky Department of Computer.
Learning to Advertise. Introduction Advertising on the Internet = $$$ –Especially search advertising and web page advertising Problem: –Selecting ads.
A Mobile World Wide Web Search Engine Wen-Chen Hu Department of Computer Science University of North Dakota Grand Forks, ND
1 CS 430: Information Discovery Lecture 4 Data Structures for Information Retrieval.
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
MOBIGUIDE MOBIGUIDE CS 8803 – ADVANCED INTERNET APPLICATION DEVELOPMENT Project Presentation By: Ashwin Pallikarana Tirumala Lalanthika Vasudevan Sneha.
Retrieving Location-based Data on the Web Andrei Tabarcea,
Lecture 5 Geocoding. What is geocoding? the process of transforming a description of a location—such as a pair of coordinates, an address, or a name of.
To quantitatively test the quality of the spell checker, the program was executed on predefined “test beds” of words for numerous trials, ranging from.
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
On the Use of Regular Expressions for Searching Text Charles L.A. Clarke and Gordon V. Cormack Fast Text Searching.
Lecturer: Ghadah Aldehim
Mobile collection of location-based multimedia School of Computing University of Eastern Finland Prof. Pasi Fränti Research presentation
Data Structures and Algorithms Semester Project – Fall 2010 Faizan Kazi Comparison of Binary Search Tree and custom Hash Tree data structures.
Location-Based API 1. 2 Location-Based Services or LBS allow software to obtain the phone's current location. This includes location obtained from the.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
MOBIGUIDE MOBIGUIDE CS 8803 – ADVANCED INTERNET APPLICATION DEVELOPMENT Project Presentation By: Ashwin Pallikarana Tirumala ( ) Lalanthika Vasudevan( )
Metadata Understanding the Value and Importance of Proper Data Documentation Exercise 2 Reading a Metadata File Exercise 3 Using the Workbook Exercise.
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Lecturer: Ghadah Aldehim
Accurately and Reliably Extracting Data from the Web: A Machine Learning Approach by: Craig A. Knoblock, Kristina Lerman Steven Minton, Ion Muslea Presented.
Extracting tabular data from the Web. Limitations of the current BP screen scraper. Parsing is done line by line. Parsing is done line by line. Pattern.
PERSONALIZED SEARCH Ram Nithin Baalay. Personalized Search? Search Engine: A Vital Need Next level of Intelligent Information Retrieval. Retrieval of.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Data Mining for Personal Navigation Gurushyam Hariharan Pasi Fränti Sandeep Mehta DYNAMAP PROJECT University of Joensuu, FINLAND
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
استاد : مهندس حسین پور ارائه دهنده : احسان جوانمرد Google Architecture.
Intelligent Web Topics Search Using Early Detection and Data Analysis by Yixin Yang Presented by Yixin Yang (Advisor Dr. C.C. Lee) Presented by Yixin Yang.
Can’t provide fast insertion/removal and fast lookup at the same time Vectors, Linked Lists, Stack, Queues, Deques 4 Data Structures - CSCI 102 Copyright.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Data Creation and Editing Based in part on notes by Prof. Joseph Ferreira and Michael Flaxman Lulu Xue | Nov. 3, :A Workshop on Geographical.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Search Tools and Search Engines Searching for Information and common found internet file types.
Mobile Search Engine Based on idea presented in paper Data mining for personal navigation, Hariharan, G., Fränti, P., Mehta S. (2002)
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
INTERNET VOCAB. WEB BROWSER An app for finding info on the web.
Extracting Representative Image from Web page Najlaa Gali, Andrei Tabarcea and Pasi Fränti.
Automated Geo-referencing of Images Dr. Ronald Briggs Yan Li GeoSpatial Information Sciences The University.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Geocoding Chapter 16 GISV431 &GEN405 Dr W Britz. Georeferencing, Transformations and Geocoding Georeferencing is the aligning of geographic data to a.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
4.01 How Web Pages Work.
Search Engine Optimization
Web Page Elements Writing For the Web
Priority Queues An abstract data type (ADT) Similar to a queue
Location-based web search and mobile applications
Prepared by Rao Umar Anwar For Detail information Visit my blog:
Extracting Representative Image from Web page
Information Retrieval
Data Mining Chapter 6 Search Engines
Priority Queues An abstract data type (ADT) Similar to a queue
Spreadsheets, Modelling & Databases
Presentation transcript:

AD-HOC GEOREFERENCING OF WEB-PAGES USING STREET-NAME PREFIX TREES Andrei Tabarcea, Ville Hautamäki, Pasi FräntiAndrei Tabarcea, Ville Hautamäki, Pasi Fränti University of Eastern FinlandUniversity of Eastern Finland

INTRODUCTION Our goal is to find services and points of interest close to the user’s location We call this “location-based search” We try to find location information in web-pages

AD-HOC GEOREFERENCINGAD-HOC GEOREFERENCING The problem is how to extract and validate location data from free- form text Most web pages don’t contain explicit georeferencing (eg. geo-tags) Postal address is the most common location data found Our goal is to give geographical coordinates to services mentioned in web-pages We call this method ad-hoc georeferencing Pages of Pasi Fränti

MOPSI LOCATION-BASED SEARCHMOPSI LOCATION-BASED SEARCH MOPSI = Mobiilit paikkatieto- sovellukset ja Internet (Mobile location based applications and Internet) Available on Main focus areas: Mobile search engine How to collect & present location-based data Other location-related topics

MOBILE SEARCH ENGINEMOBILE SEARCH ENGINE – How can you find services: – Asking directions – Advertisements – Wandering around – Yellow pages – Internet – Query consists of: – Keyword – Location

MOBILE SEARCH ENGINE STRUCTUREMOBILE SEARCH ENGINE STRUCTURE Geocoded street-name database Core server software Mobile application Web user interface Coordinates Address Keyword Coordinates Search results Keyword Coordinates Search results Search Engine consists of: User interface Core server software Geocoded street-name database

CORE SERVER SOFTWARECORE SERVER SOFTWARE Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser

CORE SERVER SOFTWARECORE SERVER SOFTWARE Georeferencing module Geocoded database Address and description detector Address validator Word list Results list Sorted results list Keyword Municipalities query Result links Coordinates Municipalities list Addresses Coordinates Relevant municipalities detector Keyword, Address, Coordinates Page parser

OUR SOLUTIONOUR SOLUTION A rule-based solution that detects address-based locations using a gazetteer and street-name prefix trees created from the gazetteer We compare this approach against: – a method that doesn’t require a gazetteer (a heuristic method that assumes that the street- name has a certain structure) – a method that also uses data structures created from the gazetteer in the form of street- name arrays StreetNameDetection(words) { WHILE i < count(words) DO { IF words[i] = street name THEN { Search for street number, postal code and other address elements near words[i]. IF address elements found THEN { Create address block Get coordinates using Geocoded Database IF coordinates found THEN Add address block to address list } } i = i + 1; } }

STREET-ADDRESS DETECTIONSTREET-ADDRESS DETECTION We use a rule-based pattern matching algorithm The detection of street-names is the starting point of the algorithm An address-block candidate is constructed by detecting typical address elements (street names, numbers, postal codes, telephone numbers and municipal names) Address block candidates are validated using the gazetteer

STREET-NAME DETECTIONSTREET-NAME DETECTION Street-name detection is the starting point of the address detection Heuristic and brute-force method are compared against our Prefix Tree solution Our application uses a commercial gazetteer for Finland and, for Singapore, street data from the free map project OpenStreetMap Gazetteer StatisticsFinlandSingapore Number of municipalities4101 Total number of street names Number of streets per municipality Average street name length Total size (MB)

PREFIX TREESPREFIX TREES Invented by Friedkin (1960) The prefix tree (or trie) is a fast ordered tree data structure used for retrieval Root is associated with an empty string All the descendants of a node have a common prefix of the string associated with that node Some nodes can have associated values (usually they mark the end of a word)

STREET-NAME PREFIX TREESSTREET-NAME PREFIX TREES Our solution is to detect street-names using prefix trees constructed from the gazetteer A street-name prefix tree is build for each municipality used in the search The user’s location and his area of interested are known, therefore prefix-trees can be limited to municipalities Prefix Tree StatisticsFinlandSingapore Maximum tree depth3414 Average tree depth Average tree width Average number of nodes per tree Total size (MB)

OTHER SOLUTIONSOTHER SOLUTIONS Heuristic solution – Relies on regular expression matching – Street names usually have similar endings or similar prefixes – A gazetteer is not needed (except for validation) – Can be fast but not precise Brute-force solution – Every word should be checked if it exists in the gazetteer – An optimized solution is used (gazetteer is locally limited and preloaded into arrays)

EXPERIMENTS 10 urban locations (blue) and 10 rural location (orange) were used for testing Testing was done using the MOPSI prototype for Finland and Singapore Both commercial and non- commercial keywords were used: Commercial hotel, restaurant, pizzeria, cinema, car repair Non-commercial hospital, museum, police station, swimming hall, church

RESULTS Average processing times for every solution were calculated The prefix tree solution proved to be on average 57% faster and 10% more accurate than the heuristic solution and 10 times faster than the brute-force solution The resulting solution improves the speed and quality of web- page georeferencing MethodTime (s) Standard deviation Validated addresses Rural municipalities Brute-Force3,012,433,7 Heuristic1,541,152,5 Prefix Tree0,510,353,7 Urban Municipalities Brute-Force10,187,1119,8 Heuristic1,701,2418,6 Prefix Tree0,870,8519,8 Total Brute-Force6,596,4011,8 Heuristic1,621,2010,5 Prefix Tree0,690,6811,8

OPEN PROBLEMSOPEN PROBLEMS Support approximate matching to avoid problems in misspellings Improve flexibility of the address detection algorithm Implement a way to learn rules automatically using hand tagged example corpus.

Thank you!Thank you!