Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez.

Slides:



Advertisements
Similar presentations
Critical Reading Strategies: Overview of Research Process
Advertisements

A Human-Centered Computing Framework to Enable Personalized News Video Recommendation (Oh Jun-hyuk)
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Hierarchical Floorplanning of Chip Multiprocessors using Subgraph Discovery Javier de San Pedro Jordi Cortadella Antoni Roca Universitat Politècnica de.
Semantic Filtering of Textual Requirements Descriptions Jorge García-Flores LaLICC Université de Paris Sorbonne.
Features and Uses of a Multilingual Full-Text Electronic Theses and Dissertations (ETDs) System Yin Zhang Kent State University Kyiho Lee, Bumjong You.
Mining Wiki Resources for Multilingual Named Entity Recognition Alexander E. Richman & Patrick Schone Reporter: Chia-Ying Lee Advisor: Prof. Hsin-Hsi Chen.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
1 Question Answering in Biomedicine Student: Andreea Tutos Id: Supervisor: Diego Molla.
Constructing and Evaluating Web Corpora: ukWaC Adriano Ferraresi University of Bologna Aston University Postgraduate Conference.
1 Lecture 5: Automatic cluster detection Lecture 6: Artificial neural networks Lecture 7: Evaluation of discovered knowledge Brief introduction to lectures.
Research Proposal Development of research question
TERM PROJECT The Project usually consists of the following: Title
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Business Communication: Introduction to Report Writing Introduction to Report Writing.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
OMAP: An Implemented Framework for Automatically Aligning OWL Ontologies SWAP, December, 2005 Raphaël Troncy, Umberto Straccia ISTI-CNR
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
CSC 9010 Spring Paula Matuszek A Brief Overview of Watson.
Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Institute for System Programming of RAS.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Phenotype Capture in Genetic Variant Databases Peng Chen School of Computer and Information Science Supervisor: Dr Jan Stanek.
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Entity Recognition via Querying DBpedia ElShaimaa Ali.
Writing your dissertation. Overview Dissertation structure and components Writing Software assistance A look at past dissertations.
Syllabus and curriculum design From LETRAC to Bologna Belinda Maia University of Porto.
LREC 2008 AWN 1 Arabic WordNet: Semi-automatic Extensions using Bayesian Inference H. Rodríguez 1, D. Farwell 1, J. Farreres 1, M. Bertran 1, M. Alkhalifa.
© Paul Buitelaar – November 2007, Busan, South-Korea Evaluating Ontology Search Towards Benchmarking in Ontology Search Paul Buitelaar, Thomas.
Theory and Application of Database Systems A Hybrid Approach for Extending Ontology from Text He Wei.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Customer Insights Nationwide Center for Advanced Customer Insights Title of Presentation Release / completion date.
Prepared by: Mahmoud Rafeek Al-Farra College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining
BE-SECBS FISA 2003 November 13th 2003 page 1 DSR/SAMS/BASP IRSN BE SECBS – IRSN assessment Context application of IRSN methodology to the reference case.
Black-box Testing.
Intelligent Database Systems Lab Presenter : YAN-SHOU SIE Authors Mohamed Ali Hadj Taieb *, Mohamed Ben Aouicha, Abdelmajid Ben Hamadou KBS Computing.
Book: Bayesian Networks : A practical guide to applications Paper-authors: Luis M. de Campos, Juan M. Fernandez-Luna, Juan F. Huete, Carlos Martine, Alfonso.
Mining fuzzy domain ontology based on concept Vector from wikipedia category network.
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Overview of Form and Javascript fundamentals. Brief matching exercise 1. This is the software that allows a user to access and view HTML documents 2.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Software Quality in Use Characteristic Mining from Customer Reviews Warit Leopairote, Athasit Surarerks, Nakornthip Prompoon Department of Computer Engineering,
IB Computer Science – Logic
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Writing a research proposal
From institutional repositories to personal collections of learning resources Julià Minguillón 1,2, Jordi Conesa 1 1 Computer Science, Multimedia and Telecommunication.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation Bioinformatics, July 2003 P.W.Load,
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
Research Methodology II Term review. Theoretical framework  What is meant by a theory? It is a set of interrelated constructs, definitions and propositions.
Corpus Exploitation from Wikipedia for Ontology Construction Gaoying Cui, Qin Lu, Wenjie Li, Yirong Chen The Department of Computing The Hong Kong Polytechnic.
Mapping the NCI Thesaurus and the Collaborative Inter-Lingual Index Amanda Hicks University of Florida HealthInsight Workshop, Oslo, Norway.
What is PDF?  Each group is required to create a Product Development File (PDF).  The PDF is a series of documents that cover the entire history of the.
A Probabilistic Quantifier Fuzzification Mechanism: The Model and Its Evaluation for Information Retrieval Felix Díaz-Hemida, David E. Losada, Alberto.
Exploiting Wikipedia as External Knowledge for Document Clustering
Automatically Extending NE coverage of Arabic WordNet using Wikipedia
Semantic Parsing for Question Answering
Extraction, aggregation and classification at Web Scale
Poster Title Researchers’ Names Company or Institution
DBpedia 2014 Liang Zheng 9.22.
Searching with context
University/Department / Faculty (Arial Bold 28-32)
The Title of the Bachelor’s Thesis
Poster size: A0 (width 84 cm, height 119 cm)
Presentation transcript:

Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya

2 Outline Introduction Related approaches Methodology Evaluation Conclusions and future work

Introduction Problem: to automatically extract terminological units from specialized texts Result: list of all the WP categories and page titles that our system considers that belong to the domain of interest.

4 Related approaches Magnini et al., 2000 Montoyo et al., 2001 Missikoff et al., 2002 Vivaldi, Rodríguez, 2002 Vivaldi, Rodríguez, 2004 Bernardini et al., 2006 Cui et al., 2008

Graph structure of Wikipedia WP categoriesWP pages AB CDE F G P1 P2 P3 Redirection table … … … … … … …… Disamb. pages Interwiki links External links InfoBox

Methodology: overview domain Pages top categories domain categories domain pages final domain term set filtering Categories bootstrapping 1) To find in WP the domain name as a category.2) Look for all the subcategories/pages related to the domain3) Extract all descendants from the domain name avoiding loops 4) Remove proper names and service classes5) Filter categories and pages Main steps: WP

Methodology: filtering Category level Page level

Methodology: filtering Category level Top Category of the Domain CatSet 1 C Direct super-categories  CatSet1 Direct super-categories  CatSet1 Direct neutral super-categories Category Score

Methodology: filtering Page level Top Category of the Domain CatSet 2 C categories  CatSet2 Pages  C neutral categories Page Score P categories  CatSet2

Methodology: category filtering

Methodology: page filtering Additional category filtering using pages scores: catTerm: set of pages associated to a category -MicroStrict: accept cat if # elements of catTerm with positive scoring is greater that # elements with negative scoring -MicroLoose: Idem with greater or equal test. -Macro: instead of counting the pages with positive/negative scoring we use the components of such scores.

Page filtering example: “semantics” (in Computing domain) theoretical computer science  Computing  semantics software  software engineering  formal methods  semantics {linguistics, philosophy of language, semiotics, theoretical computer science, philosophical Logic} WPCD(semantics) = 0.25

Category filtering example using pages score: “chemistry” #DTC Micro Strict Micro Loose Macro VoteResult okkookkookko 1electroquímica (electrochemistry) Accept 2quesos (cheeses) Reject 3óxidos de carbono (carbon monoxide) Accept

Evaluation Partial evaluation: “chemistry” and “astronomy”: –Test against Magnini et al., 2000 (WordNet 1.6) –Low coverage: 25% for Chemistry and 15% for Astronomy Full evaluation. “Medicine” –Test against SNOMED-CT Spanish Edition (2009) –Wide coverage of the clinical domain: 800K terms

Partial evaluation

Full evaluation Validation issues AcceptsReject whisky cigar udder fire oral cancer renal colic phoniatrics surgical instruments

17 Conclusions Good results when evaluated against a specialised resource Term list filtering must be improved (ex. Eliminate proper names)

18 Future work Apply this method to other languages/domains Improve filtering using in/out links of selected pages Improve filtering using also the page content Use this WP knowledge to improve a term extractor

19 Finding Domain Terms using Wikipedia Jorge Vivaldi Palatresi Applied Linguistics Institute Universitat Pompeu Fabra Horacio Rodríguez Hontoria TALP Research Center Universitat Politécnica de Catalunya