© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide 06-September-2013 Prof. Dr.-Ing. Ralf Steinmetz KOM - Multimedia Communications Lab i-know_Address Extraction_SebS___2013.08.20.pptx Image Source: http://upload.wikimedia.org/wikipedia/en/7/7f/World_Map_flat_Mercator.png, http://www.frdc.at/hp_frdc_pictures/frdc_dart_pfeil.jpg.gif Sebastian Schmidt, M.Sc. Extraction of Address Data from Unstructured Text using Free Knowledge Resources ??

KOM – Multimedia Communications Lab2 1.Motivation Why Business Address Data? Application Scenarios 2.Structure of German Addresses 3.Solution 4.Evaluation Methodology Results Challenges 5.Related Work 6.Adaptation 7.Conclusion and Future Work Outline Image source: http://www.yourshiningredthread.com/wp-content/uploads/2012/08/WavyThreadImage.jpg

KOM – Multimedia Communications Lab3 Text documents are everywhere around us (e.g. 189 Mio Web Sites 1 ) All containing lots of valuable information Semantic Web as a vision to annotate information with their meaning Only 12% of Web Sites make use of any semantic annotation like RDFa, microformat or Microdata [Mühleisen12] Most content remains incomprehensible to machines Tools required that allow automatic identification of certain information in text 1 http://news.netcraft.com/archives/2013/08/09/august-2013-web-server-survey.html 1. Motivation General Image source: http://www.netresearch.de/blog/wp-content/uploads/2009/04/semantic_web_day.jpg

KOM – Multimedia Communications Lab4 Addresses consisting of different attributes Extracted data is only valuable if all attributes have been identified correctly Sequentiality can be exploited Business addresses have a high volatility Need to track them automatically Business address data is of interest in various domains 1. Motivation Business Address Data

KOM – Multimedia Communications Lab5 Semantic Web! Web Sites aggregating existing content Often relying on addresses given on Web Sites E.g. restaurant recommendations, job search engines, product search engines Address-repositories Can be created automatically Location-based services Can gain from population of geographical repositories with business information 1. Motivation Application Scenario Image source: http://www.thedigitalbus.com/wp-content/uploads/2011/09/Location-Based-Services.jpg

KOM – Multimedia Communications Lab7 2. Structure of German Addresses No common pattern Variable length Type of business entity can be part of the name A number of common suffixes But many exceptions Spelling varies a lot (abbreviations) Variable length Single digit or number Can be suffixed by a character Five digits Might be pre-fixed by D- No common structure Some suffixed indicators Not for all cities Different naming schemes for single city E.g. Frankfurt, Frankfurt/Main, Ffm,… General structure exists but many exceptions fragmented by other attributes E.g. name of a company not mentioned next to the address but somewhere else on a Web site All attributes within one line …

KOM – Multimedia Communications Lab9 Aggregation Approach: 1.Pre-Processing 2.Identification of single attributes with some dependencies defined by patterns 3.Afterwards aggregation of results to complete addresses 3. Solution Overview Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab10 Preprocessing Stripping of HTML markups Data cleaning Line splitting Tokenization Part-of-Speech (POS) Tagging Identification of Single Attributes Independently of previous identifications Only some dependencies for improving precision Leads to a large number of candidates for each attribute 3. Solution Steps Aggregation Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab11 Identification of Postal Codes Regular expression Identification of Cities 1.Terms in a certain distance (3 tokens) to postal code candidate that exist in Gazetteer Gazetteer assembled from OpenStreetMap 28,087 entries 2.Terms that are preceded directly by a postal code candidate Capitalized 3. Solution Steps Aggregation Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab12 Identification of Street Numbers Regular expression Also for range of street numbers Identification of Street Names 1.Token chains ending with an indicator term Gazetteer of indicators assembled from OpenStreetMap Containing 30 most common endings of German street names Covering 70% of German street names 2.Token chains that follow a certain POS pattern Out of 6 manually defined patterns 3. Solution Steps Aggregation Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab13 Identification of Company Names 1.Token chain ending with indicator term List of terms from a Wikipedia page on types of business entities 29 indicator terms 2.Token chains preceding a street name 3. Solution Steps Aggregation Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab14 Aggregation 1.Company candidates as seed 2.Search for closest combination of street name and number candidate 3.Search for closest combination of postal code and city candidate 4.If all elements are found for a company candidate Complete address 3. Solution Steps Image source: http://d3sdoylwcs36el.cloudfront.net/online_content_distribution_strategies_aggregation_getty_images.jpg/ Aggregation Pre- Processing Cities Street Numbers Street Names Company Names Postal Codes Identification of

KOM – Multimedia Communications Lab16 4. Evaluation Methodology Image source: http://wisesyracuse.wordpress.com/2012/05/23/how-to-measure-the-effectiveness-of-your-social-media-efforts/

KOM – Multimedia Communications Lab17 4. Evaluation Results

KOM – Multimedia Communications Lab18 Structure of company names often very unusual Leads to partly correct detection E.g. oberüber Agentur für digitale Wertschöpfung has been detected asAgentur für digitale Wertschöpfung Several company names on the Web site Wrong company is assigned to an address Transformation from HTML code to text introduces errors 4. Evaluation Challenges

KOM – Multimedia Communications Lab20 [Loos08] Usage of Conditional Random Fields Small annotated dataset for bootstrapping Result of unsupervised tagger as an additional feature [Asadi08] Manually defined patterns for address extraction with confidence scores Usage of some geographic information from unknown source [Cai05] Exploiting graph based similarity to a template graph Usage of commercial GIS database [Ahlers08] Relying on complete database of street names, postal codes and cities Matching of text to valid combination of those attributes Relying on manual effort and/or extensive proprietary data sources No identification of business addresses 5. Related Work

KOM – Multimedia Communications Lab21 Comparison to Related Work Restricting to address without company name 5. Related Work Results ApproachPrecisionRecallF1-MeasureLanguage [Loos08]0.890.640.74de [Asadi08]0.970.730.83en [Cai05]0.750.730.74en [Ahlers08]Not given~0.95Not givende Our approach0.930.950.94de

KOM – Multimedia Communications Lab23 Define overall pattern (order of attributes) Adapt identification of single attributes Re-Create Gazetteers Cities Street name indicators Business entity types OpenStreetMap and Wikipedia exist in most countries/languages 6. Adaptation to other Country/Language

KOM – Multimedia Communications Lab25 A new approach for identification of address data Outperforming existing approaches No usage of commercial databases Adaptable to other languages / countries Tailored for identification of business addresses Next steps: Adapt patterns to other languages / countries Evaluate in other languages / countries 7. Conclusion & Future Work

KOM – Multimedia Communications Lab26 Questions & Contact Source: http://www.dreifragezeichen.de/

KOM – Multimedia Communications Lab27 [Ahlers08] D. Ahlers and S. Boll. Retrieving Address-based Locations from the Web. In Proceedings of the 2 nd international workshop on Geographic information retrieval, GIR 08, pages 27–34, New York, NY, USA, 2008. ACM [Asadi08] S. Asadi, G. Yang, X. Zhou, Y. Shi, B. Zhai, and W.-R. Jiang. Pattern- Based Extraction of Addresses from Web Page Content. In Y. Zhang, G. Yu, E. Bertino, and G. Xu, editors, Progress in WWW Research and Development, volume 4976 of Lecture Notes in Computer Science, pages 407–418. Springer Berlin Heidelberg, 2008. [Cai05] W. Cai, S. Wang, and Q. Jiang. Address extraction: Extraction of location- based information from the web. In Y. Zhang, K. Tanaka, J. Yu, S. Wang, and M. Li, editors, Web Technologies Research and Development - APWeb 2005, volume 3399 of Lecture Notes in Computer Science, pages 925–937. Springer Berlin Heidelberg, 2005. [Loos08] B. Loos and C. Biemann. Supporting Web-based Address Extraction with Unsupervised Tagging. In C. Preisach, H. Burkhardt, L. Schmidt-Thieme, and R. Decker, editors, Data Analysis, Machine Learning and Applications, Studies in Classification, Data Analysis, and Knowledge Organization, pages 577–584. Springer Berlin Heidelberg, 2008. [Mühleisen12] H. Mühleisen and C. Bizer. Web Data Commons -Extracting Structured Data from Two Large Web Corpora. In Proceedings of the 5th Workshop on Linked Data on the Web, 2012. References

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Similar presentations

Presentation on theme: "© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide.

Similar presentations

Presentation on theme: "© author(s) of these slides including research results from the KOM research network and TU Darmstadt; otherwise it is specified at the respective slide."— Presentation transcript:

Similar presentations

About project

Feedback