Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 1/20 IMS Universität Stuttgart Fine-Grained Geographical.

Slides:



Advertisements
Similar presentations
ESRI Dev Meetup Lightning Talk
Advertisements

Automatic Timeline Generation from News Articles Josh Taylor and Jessica Jenkins.
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Specialized models and ranking for coreference resolution Pascal Denis ALPAGE Project Team INRIA Rocquencourt F Le Chesnay, France Jason Baldridge.
Progress update Lin Ziheng. System overview 2 Components – Connective classifier Features from Pitler and Nenkova (2009): – Connective: because – Self.
Discovering Severity and Body Site Modifiers Dmitriy Dligach, Ph.D. Boston Children’s Hospital and Harvard Medical School.
Large-Scale Entity-Based Online Social Network Profile Linkage.
Distant Supervision for Knowledge Base Population Mihai Surdeanu, David McClosky, John Bauer, Julie Tibshirani, Angel Chang, Valentin Spitkovsky, Christopher.
ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
NYU ANLP-00 1 Automatic Discovery of Scenario-Level Patterns for Information Extraction Roman Yangarber Ralph Grishman Pasi Tapanainen Silja Huttunen.
Text mining Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.
Information and Telecommunication Technology Center (ITTC) University of Kansas SmartXAutofill Intelligent Data Entry Assistant for XML Documents Danico.
A Robust System Architecture For Mining Semi-structured Data By Aby M Mathew CSE
1 CSC 594 Topics in AI – Applied Natural Language Processing Fall 2009/ Shallow Parsing.
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Introduction to Machine Learning Approach Lecture 5.
Software Process and Product Metrics
Retrieving Location-based Data on the Web Andrei Tabarcea,
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
INTRODUCTION TO ARTIFICIAL INTELLIGENCE Truc-Vien T. Nguyen Lab: Named Entity Recognition.
Institute of Informatics & Telecommunications – NCSR “Demokritos” Ellogon and the challenge of threads Georgios Petasis Software and Knowledge Engineering.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Final Review 31 October WP2: Named Entity Recognition and Classification Claire Grover University of Edinburgh.
NERIL: Named Entity Recognition for Indian FIRE 2013.
Survey of Semantic Annotation Platforms
Natural Language Processing
Interpreting Dictionary Definitions Dan Tecuci May 2002.
Extracting metadata for spatially- aware information retrieval on the internet Pual Clough Presented by Ali Khodaei CS 572.
A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Partial Parsing CSCI-GA.2590 – Lecture 5A Ralph Grishman NYU.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.
Open Information Extraction using Wikipedia
CountrySTAT Regional Basic Administrator Training for ECO Member States Friday, October 23, 2015 EVENT Foundations of CountrySTAT E-learning.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Natural language processing tools Lê Đức Trọng 1.
IBM Research © Copyright IBM Corporation 2005 | A Development Environment for Configurable Meta-Annotators in a Pipelined NLP Architecture Youssef Drissi,
Using Semantic Relations to Improve Passage Retrieval for Question Answering Tom Morton.
Combining GATE and UIMA Ian Roberts. University of Sheffield NLP 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE.
Summarizing Encyclopedic Term Descriptions on the Web from Coling 2004 Atsushi Fujii and Tetsuya Ishikawa Graduate School of Library, Information and Media.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
MedKAT Medical Knowledge Analysis Tool December 2009.
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Ranking Definitions with Supervised Learning Methods J.Xu, Y.Cao, H.Li and M.Zhao WWW 2005 Presenter: Baoning Wu.
Software Engineering for Business Information Systems (sebis) Department of Informatics Technische Universität München, Germany wwwmatthes.in.tum.de A.
Supertagging CMSC Natural Language Processing January 31, 2006.
Linked Data Profiling Andrejs Abele National University of Ireland, Galway Supervisor: Paul Buitelaar.
Personalized Recommendation of Related Content Based on Automatic Metadata Extraction Andreas Nauerz 1, Fedor Bakalov 2, Birgitta.
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Extracting Geographical Gazetteers from the Internet Olga Uryupina
Combining GATE and UIMA Ian Roberts. 2 Overview Introduction to UIMA Comparison with GATE Mapping annotations between GATE and UIMA.
Relation Extraction (RE) via Supervised Classification See: Jurafsky & Martin SLP book, Chapter 22 Exploring Various Knowledge in Relation Extraction.
© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.
Automatically Labeled Data Generation for Large Scale Event Extraction
CRF &SVM in Medication Extraction
Improving a Pipeline Architecture for Shallow Discourse Parsing
CSCE 590 Web Scraping – Information Retrieval
Social Knowledge Mining
Survey phases, survey errors and quality control system
Survey phases, survey errors and quality control system
Lecture 13 Information Extraction
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 1/20 IMS Universität Stuttgart Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 2/20 IMS Universität Stuttgart Overview motivation why are fine-grained relations important? self-annotation automatic annotation using structured data use this annotation for training classifier extraction framework evaluation and conclusion

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 3/20 IMS Universität Stuttgart Geographical data provider GeoNames gazetteer names, type, coordinates 8 million entries 2.6 million populated places community-based Creative Commons Attribution 3.0 License Free to share

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 4/20 IMS Universität Stuttgart GeoNames

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 5/20 IMS Universität Stuttgart GeoNames – hierarchical types NameGerman name + sample Description ADM1Bundesland (Rheinland- Pfalz) State in the United States, a primary administrative division of a country ADM2Regierungs- Bezirk a subdivision of a first-order administrative division ADM3Landkreis (Bad Kreuznach) County, a subdivision of a second-order administrative division ADM4Gemeinde (Gebroth) Municipality, a subdivision of a third-order administrative division PPL (populated place) Stadt-, Ortsteil (Stuttgart Bad Cannstatt) Suburb, a subdivision of a fourth-order administrative division

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 6/20 IMS Universität Stuttgart GeoNames – missing hierarchical relations

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 7/20 IMS Universität Stuttgart Task Definition relation definition R 1-2 ADM3-ADM4 Landkreis (county)- Gemeinde (municipality) R 0-1 ADM4-PPL Gemeinde (municipality) and Ortsteil (suburb) task classify all possible binary relations of named entities in one sentence

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 8/20 IMS Universität Stuttgart Example - binary relations between all NEs Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). binary relations between NEs (Gebroth,Bad Kreuznach) element of R 1_2 (Gebroth, Rheinland-Pfalz) (Gebroth, Deutschland) (Bad Kreuznach, Rheinland-Pfalz) (Bad Kreuznach, Germany) (Rheinland-Pfalz, Deutschland)

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 9/20 IMS Universität Stuttgart Requirements for extraction system fast to develop requested relation types can change avoid expensive manual annotation fine-grained relation types e.g. simple part-of relation is not sufficient trained system need no structured data several input sources (Wikipedia, blogs, twitter, news) German data

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 10/20 IMS Universität Stuttgart Wikipedia as resource structured data templates (e.g. infoboxes), links, categories, tables, lists unstructured data written text high quality many users WikiBots structured data can be used to annotate unstructured data → self-annotation

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 11/20 IMS Universität Stuttgart Self-Annotation - example structured dataunstructured data Landkreis Bad Kreuznach (county) Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). Gebroth R 1_2 (Gebroth, Bad Kreuznach)

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 12/20 IMS Universität Stuttgart Self-annotation - challenges infoboxes are not always complete/correct/coherent filled matching with unstructured data pattern matching not sufficient orthographic variances morphology multi-word expressions matching need some manual adjustment only one relation per article

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 13/20 IMS Universität Stuttgart Extraction framework UIMA (Unstructured Information Management Architecture) pipeline architecture easy exchange of components fast development extended components CollectionReader for Wikipedia linguistic annotation supervised classifier

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 14/20 IMS Universität Stuttgart Extraction pipeline JWPL UIMA Pipeline Collection Reader Self- Annotation ClearTK FSPar- Engine MaxEnt- Classifier Consumer German Wikipedia GeoNames FSPar- Annotator unstructured text structured data Collection Reader text

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 15/20 IMS Universität Stuttgart Linguistic processing FSPar engine (Schiehlen 2003) tokenizer PoS-tagger (bases on TreeTagger) chunker partial dependency parser TokenPoSLemma GeborthNEGebroth istVAFINseinA eineARTein imAPPARTin

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 16/20 IMS Universität Stuttgart Supervised classification extended ClearTK-Annotator feature sets F0: NE distance (baseline) F1: Window-based (pos, lemma, size=2) F2: chunks (parent chunks of NEs) F3: dependency parse (paths between NEs) MaxEntClassifier

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 17/20 IMS Universität Stuttgart Evaluation 9000 articles about German municipalities and suburbs 5300 articles for training 1800 articles for development 1800 articles for final evaluation R 1_2 relation is also available from the Federal Statistical Office of Germany Used for evaluate self-annotation 99.9 % ( 1 error in 1304 sentences)

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 18/20 IMS Universität Stuttgart Results ClassifierFeaturesPrecisionRecallFPFN 1F079.0%55.7% F0+F192.4%89.3% F0+F290.2%89.5% F0+F397.7%97.4%4348 5F0....F398.8%97.8%2341 Linguistic effortdescription F0NoneDistance + NE position F1PoS-TaggingWindow-based (size=2, PoS, lemma) F2Chunk-parseParent chunk F3Dependency-parseDependency paths between NEs

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 19/20 IMS Universität Stuttgart Conclusion text is important resource for context-aware systems self-annotation automatic annotation using structured data Wikipedia is a valuable resource structured and unstructured data containing fine-grained relations UIMA based implementation fine-grained geographical relation extraction is possible

Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 20/20 IMS Universität Stuttgart Questions: ?!