Presentation is loading. Please wait.

Presentation is loading. Please wait.

Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Similar presentations


Presentation on theme: "Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez."— Presentation transcript:

1 Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es

2 2 Outline Introduction Related approaches Methodology Evaluation Conclusions and future work

3 Introduction Problem: to automatically extract terminological units from specialized texts Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

4 4 Related approaches Magnini et al., 2000 Montoyo et al., 2001 Missikoff et al., 2002 Vivaldi, Rodríguez, 2002 Vivaldi, Rodríguez, 2004 Bernardini et al., 2006 Cui et al., 2008

5 Graph structure of Wikipedia WP categoriesWP pages AB CDE F G P1 P2 P3 Redirection table … … … … … … …… Disamb. pages Interwiki links External links InfoBox

6 Methodology: overview domain Pages top categories domain categories domain pages final domain term set filtering Categories bootstrapping 1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops 4) Remove proper names and service classes5) Filter categories and pages Main steps: WP

7 Methodology: filtering Category level Page level

8 Methodology: filtering Category level Top Category of the Domain CatSet 1 C Direct super-categories  CatSet1 Direct super-categories  CatSet1 Direct neutral super-categories Category Score

9 Methodology: filtering Page level Top Category of the Domain CatSet 2 C categories  CatSet2 Pages  C...... neutral categories Page Score P categories  CatSet2

10 Methodology: category filtering

11 Methodology: page filtering Additional category filtering using pages scores: catTerm: set of pages associated to a category -MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring -MicroLoose: Idem with greater or equal test. -Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.

12 Page filtering example: “semantics” (in Computing domain) theoretical computer science  Computing  semantics software  software engineering  formal methods  semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic} WPCD(semantics) = 0.25

13 Category filtering example using pages score: “chemistry” #DTC Micro Strict Micro Loose Macro VoteResult okkookkookko 1electroquímica (electrochemistry) 1351623612+3Accept 2quesos (cheeses) 0862812Reject 3óxidos de carbono (carbon monoxide) 112043+2Accept

14 Evaluation Partial evaluation: “chemistry” and “astronomy”: –Test against Magnini et al., 2000 (WordNet 1.6) –Low coverage: 25% for Chemistry and 15% for Astronomy Full evaluation. “Medicine” –Test against SNOMED-CT Spanish Edition (2009) –Wide coverage of the clinical domain: 800K terms

15 Partial evaluation

16 Full evaluation Validation issues AcceptsReject whisky cigar udder fire oral cancer renal colic phoniatrics surgical instruments

17 17 Conclusions Good results when evaluated against a specialised resource Term list filtering must be improved (ex. Eliminate proper names)

18 18 Future work Apply this method to other languages/domains Improve filtering using in/out links of selected pages Improve filtering using also the page content Use this WP knowledge to improve a term extractor

19 19 Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra jorge.vivaldi@upf.edu Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya horacio@lsi.upc.es


Download ppt "Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez."

Similar presentations


Ads by Google