Download presentation
Presentation is loading. Please wait.
Published byJordan Bennett Modified over 9 years ago
1
Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es
2
2 Outline Introduction Related approaches Methodology Evaluation Conclusions and future work
3
Introduction Problem: to automatically extract terminological units from specialized texts Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.
4
4 Related approaches Magnini et al., 2000 Montoyo et al., 2001 Missikoff et al., 2002 Vivaldi, Rodríguez, 2002 Vivaldi, Rodríguez, 2004 Bernardini et al., 2006 Cui et al., 2008
5
Graph structure of Wikipedia WP categoriesWP pages AB CDE F G P1 P2 P3 Redirection table … … … … … … …… Disamb. pages Interwiki links External links InfoBox
6
Methodology: overview domain Pages top categories domain categories domain pages final domain term set filtering Categories bootstrapping 1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops 4) Remove proper names and service classes5) Filter categories and pages Main steps: WP
7
Methodology: filtering Category level Page level
8
Methodology: filtering Category level Top Category of the Domain CatSet 1 C Direct super-categories CatSet1 Direct super-categories CatSet1 Direct neutral super-categories Category Score
9
Methodology: filtering Page level Top Category of the Domain CatSet 2 C categories CatSet2 Pages C...... neutral categories Page Score P categories CatSet2
10
Methodology: category filtering
11
Methodology: page filtering Additional category filtering using pages scores: catTerm: set of pages associated to a category -MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring -MicroLoose: Idem with greater or equal test. -Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.
12
Page filtering example: “semantics” (in Computing domain) theoretical computer science Computing semantics software software engineering formal methods semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic} WPCD(semantics) = 0.25
13
Category filtering example using pages score: “chemistry” #DTC Micro Strict Micro Loose Macro VoteResult okkookkookko 1electroquímica (electrochemistry) 1351623612+3Accept 2quesos (cheeses) 0862812Reject 3óxidos de carbono (carbon monoxide) 112043+2Accept
14
Evaluation Partial evaluation: “chemistry” and “astronomy”: –Test against Magnini et al., 2000 (WordNet 1.6) –Low coverage: 25% for Chemistry and 15% for Astronomy Full evaluation. “Medicine” –Test against SNOMED-CT Spanish Edition (2009) –Wide coverage of the clinical domain: 800K terms
15
Partial evaluation
16
Full evaluation Validation issues AcceptsReject whisky cigar udder fire oral cancer renal colic phoniatrics surgical instruments
17
17 Conclusions Good results when evaluated against a specialised resource Term list filtering must be improved (ex. Eliminate proper names)
18
18 Future work Apply this method to other languages/domains Improve filtering using in/out links of selected pages Improve filtering using also the page content Use this WP knowledge to improve a term extractor
19
19 Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.