Download presentation
Presentation is loading. Please wait.
1
Patrick Arnold, DBS-Oberseminar in December 2013
Semantic Information Extraction from Wikipedia Data Extraction for Data Integration Patrick Arnold, DBS-Oberseminar in December 2013
2
Motivation Semantic Enrichment of Mappings
Given two matching concepts c1, c2 Two questions: Do they really match? (verification) What is their semantic relationship? (enrichment) One possibility: Generic Strategies Example: Morphological Analysis e.g., lexicographic similarity between matching concepts
3
Motivation Arbitrariness of Language Schema and Ontology Matching
No correlation between meaning and representation of a real-world object Most matching tools based on lexicographic analysis Why do they work? Schema and Ontology Matching Concatenative Word Formation (e.g. Compounding) Large overlap between ontologies
4
Motivation Lexicographic strategies fail if…
Domains are different Granularity is different Languages are different Example: Wikipedia-Ebay Furniture Benchmark COMA only obtains 31 % recall Only 8 % in default mode (without enrichment)
5
Motivation Remedy: Background Knowledge Problem: Limited content
Very precise Very effective Problem: Limited content Example WordNet: Only 10 % recall in Ebay-Wikip. Benchmark Problem: Limited amount of sources Comprehensive and generic? Free of charge? English (or German) language? Semantic relations?
6
Ambitions Solution: Build thesaurus (background knowledge source) by extracting knowledge from the web Wikipedia: 4.4 million entries Practically every common noun of the English language Good reliability First sentence is typically a definition Expresses “is-a” relations in about % of all cases Also expresses synonyms and part-of relations
7
Ambitions Provide interface similar to WordNet approach
Combine it with WordNet approach
8
Wikipedia Article Distribution
Wikipedia contains instance pages and concepts pages Instances: Persons, places, buildings, companies, bands, movies, diseases, software etc. Only 5.6 % of Wikipedia are concept pages Still some 246,000 articles Some instance articles quite valuable Diseases, species, vehicles (e.g., BMW is a car)
9
Wikipedia Article Distribution
10
Approach Article Extraction Download Wikipedia Dump
Extract each article Extract and clean the abstract of each article Store article objects in DB Information Extraction Take first sentence of abstract Preprocess sentence Find Hearst Pattern Split sentence in HP fragments Extract relevant information in each fragment Source terms, hypernyms, meronyms, fields Information Integration Store extracted information in DB Use interface to access information Combine extracted information with other sources Example: WordNet
11
Step 1: Article Extraction
Large amount of data to process Wikipedia dump: 9.5 GB (zip) and 44 GB unpacked Takes about 6 hours to download it Using a web crawler: Takes about 17 days Can crawl about 3 pages per second Using SAX to parse content Extract abstract or first X characters of each page Save text to Database MongoDB for fast access
12
Step 2: Information Extraction
Approach: For each article, extract the semantic relations. Exploit systematic structure of (Wikipedia) definitions. Classic Definition: Hypernym + Properties Definition must be related to another concept. A washing machine is a machine to wash laundry. Equivalence relations (using synonyms) Rare, as there are hardly any full synonyms Example: U.K. stands for United Kingdom
13
Step 2: Information Extraction The Structure of Wikipedia Definitions
Structure of typical Wikipedia Definitions: Given a Term T, T is commonly defined by Hypernym(T) Someone who does not know T, may know H(T) A cassone is a chest. Often in combination with synonyms An automobile, autocar, motor car or car is a wheeled motor v… Sometimes expressing meronyms Wipers are part of vehicles to remove rain from windshields. Sometimes expressing holonyms A computer consists of a CPU, main memory, BUS and some peripherals.
14
Step 2: Information Extraction The Structure of Wikipedia Definitions
Additional Information Field Reference In computing, a mouse is an input device that functions by… Column or pillar in architecture and structural engineering is a structural element that… Grammatical, phonological or etymological information A bus (/ˈbʌs/; plural "buses", /ˈbʌsɨz/, archaically also omnibus, multibus, or autobus) is a road vehicle… A wardrobe, also known as an armoire from the French, is a standing closet.
15
Step 2: Information Extraction The Structure of Wikipedia Definitions
Hearst Pattern Indicate the relationship between two terms Similiar to an relational operator in algebra Examples: is a consists of describes a
16
Step 2: Information Extraction The Structure of Wikipedia Definitions
From these information we may form a standard Wikipedia Definition Pattern: In computing, a mouse is an input device that…
17
Step 2: Information Extraction Procedure
Find the Hearst Pattern Split sentence at the HPs Result: 2 or 3 fragments From each fragment, extract the source terms, hypernyms and meronyms Also extract the field references
18
Step 2: Information Extraction Procedure
Step 1: Find Hearst Pattern Hypernym Pattern: Not restricted to “is a” is a is typically a is defined as a is commonly a class of is any form of is one of the many is a general term for is used as a term for describes/denotes a
19
Step 2: Information Extraction Procedure
Step 1: Find Hearst Pattern Typical has-a pattern consisting of a with a having a Typical part-of pattern within (is used) in (as part) of
20
Step 2: Information Extraction Procedure
Some fuzzy patterns refers to applies to is similar to is related to
21
Step 2: Information Extraction Procedure
FSM for is-a Pattern
22
Step 2: Information Extraction Procedure
Examples
23
Step 2: Information Extraction Procedure
Approach: Using a FSM to parse through the fragments Word-by-word processing Under due regard of word classes (POS) Very restrictive If an unexpected condition is entered: Revoke fragment
24
Step 2: Information Extraction Procedure
Preprocessing Remove braces if necessary Braces may contain valuable information Try different configurations Replace expressions for simplification is applied to applies to is any of a variety of is any and or and means of {} (Auto rickshaws are means of public transportation)
25
Step 2: Information Extraction Procedure
Extract information from source/target segment
26
Step 2: Information Extraction Procedure
27
Step 2: Information Extraction Procedure
Post-Processing Extracted information must be post-processed Remove braces Remove quotes etc. Stemming (tbd)
28
Step 3: Integration Store extracted information Work in progress
Can be reduced to simple triples (term, relation, term) RDBS or Main Memory (Hash Map) Work in progress Develop Interface to handle queries Combine it with WordNet Recursive approach
29
Step 3: Integration Resolution of Queries (Example)
30
Evaluation 4 Scenarios from Wikipedia: Questions:
Furniture (186 concepts) Infectious Diseases (107 concepts) Optimization Algorithms (122 concepts) Vehicles (94 concepts) Questions: How many articles could be parsed? How many relevant relations could be found (recall) How many extracted relations were correct (precision)
31
Evaluation Articles Segmentation
We can process % of all Wikipedia articles Highly depends on the domain Not all articles are „parsable“ Scenario #Articles #Processed Articles Effectiveness Furniture 186 137 73.7 % Infectious Diseases 107 67 63.2 % Optimization Algorithms 122 91 74.5 % Vehicles 94 87 92.5 %
32
Evaluation Article Segmentation
Some WP articles simply have no “classic” definition (hypnerym/meronym missing) Examples: Anaerobic infections are caused by anaerobic bacteria. A blood-borne disease is one that can be spread through contamination… Cholera Hospital was established on June 24, 1854, at… Hutchinson's triad is named after Sir Jonathan Hutchinson (1828–1913). A pathogen in the oldest and broadest sense is anything that can produce disease. Instance
33
Evaluation Article Segmentation
Results for parsable articles: Scenario #Articles #Parsable articles #Processed Articles Effectiveness Furniture 186 169 137 81.1 % Infectious Diseases 107 91 67 73.6 % Optimization Algorithms 122 113 80.5 % Vehicles 94 87 95.6 %
34
Evaluation Concept Extraction
Recall (strict): Recall (effective): Source Terms Hypernym Meronym/Holonym # in B # in R Vehicl. 194 149 76.8 % 94 80 85.1 % 20 17.5 87.5 % Dis. 178 125 70.2 % 86 57.5 66.9 % 36 19.5 54.1 % Source Terms Hypernym Meronym/Holonym Vehicl. 168 138 82.1 % 90 78 86.6 % 20 17.5 87.5 % Dis. 125 104 83.2 % 67 55.5 82.8 % 32 19.5 61.0 %
35
Evaluation Concept Extraction
Precision (strict): Precision (effective): Source Terms Hypernym Meronym/Holonym # in R # in B Vehicl. 149 161 92.5 % 80 91 87.9 % 17.5 27.5 63.6 % Dis. 125 131 95.4 % 57.5 68 83.3 % 19.5 29.5 66.1 % Source Terms Hypernym Meronym/Holonym Vehicl. 138 148 93.2 % 78 89 87.6 % 17.5 27.5 63.6 % Dis. 104 106 98.1 % 55.5 65 85.4 % 19.5 28.5 68.4 %
36
Evaluation Common Precision Problems
Complex words A minibus is a passenger carrying motor vehicle NN + VBG + NN + NN *Minibus is a passenger
37
Evaluation Common Precision Problems
Braces – Curse and Blessing A pedalo (British English) or paddle boat (US, Canadian, and Australian English) is a watercraft… *British English is a watercraft *US is a watercraft A powered parachute (motorized parachute, PPC, paraplane) is a parachute… Entirely correct extraction (4 source terms)
38
Evaluation Common Precision Problems
Compound determination A draisine is a light auxiliary rail vehicle ADJ + ADJ + NN + NN Expected: Rail vehicle Imprecise, but can be handled Gradual Modifier Removal / Compound Transitivity Ignoring adjectives and verbs (part.) not possible High school bathing suit Pulled rickshaw (rare)
39
Evaluation Common Precision Problems
Misleading nouns Jet pack, rocket belt, rocket pack and similar names are used for various types of devices Conclusion: *Similiar names are devices
40
Evaluation Common Recall/Precision Problems
Misleading nouns in target phrases Some phrases Is the act of... Is a method for... Is a noun describing... Is the heart of... Examples: Land sailing is the act of moving across land. Leipzig is the heart of the Central German Metropolitan Region. („*Leipzig is a heart“)
41
Evaluation Common Recall Problems
Too many auxiliary information in definition A passenger car (known as a coach or carriage in the UK, and also known as a bogie in India[1]) is a…
42
Evaluation Common Recall Problems
Erroneous POS-Tagging Example: A dog sled is a sled used for... DET + NN + VBD + is a + VBD + ... Solution: Check whether dubious word is in page name Page name was Dog sled „sled“ (VBD) is part of page name Ignore POS tag (handle it as noun)
43
Evaluation Improvement
Paretheo principle (80:20) Improvement by handling special cases 2 Ways for improvement Extend FSM Consider more specific cases of formulation Makes FSM complex and hard to manage More preprocessing Replace rare expressions by common expressions May lead to large lists of specific expressions
44
Conclusions Relation extraction quite succesful To do:
Some intricate articles not processable Some irrelevant or unsensible relations To do: Provide interface for Semantic Enrichment Module Combine with WordNet Possibly extend it with further sources (like Wiktionary)
45
Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.