Download presentation
Presentation is loading. Please wait.
Published byCharles Billey Modified over 9 years ago
1
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 1/20 IMS Universität Stuttgart Fine-Grained Geographical Relation Extraction from Wikipedia André Blessing Hinrich Schütze University of Stuttgart Institute for Natural Language Processing (IMS)
2
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 2/20 IMS Universität Stuttgart Overview motivation why are fine-grained relations important? self-annotation automatic annotation using structured data use this annotation for training classifier extraction framework evaluation and conclusion
3
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 3/20 IMS Universität Stuttgart Geographical data provider GeoNames gazetteer names, type, coordinates 8 million entries 2.6 million populated places community-based Creative Commons Attribution 3.0 License Free to share
4
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 4/20 IMS Universität Stuttgart GeoNames
5
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 5/20 IMS Universität Stuttgart GeoNames – hierarchical types NameGerman name + sample Description ADM1Bundesland (Rheinland- Pfalz) State in the United States, a primary administrative division of a country ADM2Regierungs- Bezirk a subdivision of a first-order administrative division ADM3Landkreis (Bad Kreuznach) County, a subdivision of a second-order administrative division ADM4Gemeinde (Gebroth) Municipality, a subdivision of a third-order administrative division PPL (populated place) Stadt-, Ortsteil (Stuttgart Bad Cannstatt) Suburb, a subdivision of a fourth-order administrative division
6
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 6/20 IMS Universität Stuttgart GeoNames – missing hierarchical relations
7
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 7/20 IMS Universität Stuttgart Task Definition relation definition R 1-2 ADM3-ADM4 Landkreis (county)- Gemeinde (municipality) R 0-1 ADM4-PPL Gemeinde (municipality) and Ortsteil (suburb) task classify all possible binary relations of named entities in one sentence
8
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 8/20 IMS Universität Stuttgart Example - binary relations between all NEs Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). Gebroth is a municipality in the county Bad Kreuznach in Rheinland-Pfalz (Germany). binary relations between NEs (Gebroth,Bad Kreuznach) element of R 1_2 (Gebroth, Rheinland-Pfalz) (Gebroth, Deutschland) (Bad Kreuznach, Rheinland-Pfalz) (Bad Kreuznach, Germany) (Rheinland-Pfalz, Deutschland)
9
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 9/20 IMS Universität Stuttgart Requirements for extraction system fast to develop requested relation types can change avoid expensive manual annotation fine-grained relation types e.g. simple part-of relation is not sufficient trained system need no structured data several input sources (Wikipedia, blogs, twitter, news) German data
10
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 10/20 IMS Universität Stuttgart Wikipedia as resource structured data templates (e.g. infoboxes), links, categories, tables, lists unstructured data written text high quality many users WikiBots structured data can be used to annotate unstructured data → self-annotation
11
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 11/20 IMS Universität Stuttgart Self-Annotation - example structured dataunstructured data Landkreis Bad Kreuznach (county) Gebroth ist eine Ortsgemeinde im Landkreis Bad Kreuznach in Rheinland-Pfalz (Deutschland). Gebroth R 1_2 (Gebroth, Bad Kreuznach)
12
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 12/20 IMS Universität Stuttgart Self-annotation - challenges infoboxes are not always complete/correct/coherent filled matching with unstructured data pattern matching not sufficient orthographic variances morphology multi-word expressions matching need some manual adjustment only one relation per article
13
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 13/20 IMS Universität Stuttgart Extraction framework UIMA (Unstructured Information Management Architecture) pipeline architecture easy exchange of components fast development extended components CollectionReader for Wikipedia linguistic annotation supervised classifier
14
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 14/20 IMS Universität Stuttgart Extraction pipeline JWPL UIMA Pipeline Collection Reader Self- Annotation ClearTK FSPar- Engine MaxEnt- Classifier Consumer German Wikipedia GeoNames FSPar- Annotator unstructured text structured data Collection Reader text
15
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 15/20 IMS Universität Stuttgart Linguistic processing FSPar engine (Schiehlen 2003) tokenizer PoS-tagger (bases on TreeTagger) chunker partial dependency parser TokenPoSLemma GeborthNEGebroth istVAFINseinA eineARTein OrtsgemeindeNNOrts#@gemeinde imAPPARTin
16
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 16/20 IMS Universität Stuttgart Supervised classification extended ClearTK-Annotator feature sets F0: NE distance (baseline) F1: Window-based (pos, lemma, size=2) F2: chunks (parent chunks of NEs) F3: dependency parse (paths between NEs) MaxEntClassifier
17
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 17/20 IMS Universität Stuttgart Evaluation 9000 articles about German municipalities and suburbs 5300 articles for training 1800 articles for development 1800 articles for final evaluation R 1_2 relation is also available from the Federal Statistical Office of Germany Used for evaluate self-annotation 99.9 % ( 1 error in 1304 sentences)
18
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 18/20 IMS Universität Stuttgart Results ClassifierFeaturesPrecisionRecallFPFN 1F079.0%55.7%279833 2F0+F192.4%89.3%138202 3F0+F290.2%89.5%182198 4F0+F397.7%97.4%4348 5F0....F398.8%97.8%2341 Linguistic effortdescription F0NoneDistance + NE position F1PoS-TaggingWindow-based (size=2, PoS, lemma) F2Chunk-parseParent chunk F3Dependency-parseDependency paths between NEs
19
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 19/20 IMS Universität Stuttgart Conclusion text is important resource for context-aware systems self-annotation automatic annotation using structured data Wikipedia is a valuable resource structured and unstructured data containing fine-grained relations UIMA based implementation fine-grained geographical relation extraction is possible
20
Fine-Grained Geographical Relation Extraction from WikipediaAndre Blessing and Hinrich Schütze 20/20 IMS Universität Stuttgart Questions: ?! www.nexus.uni-stuttgart.de
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.