Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.

Slides:



Advertisements
Similar presentations
An Ontology Creation Methodology: A Phased Approach
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
Natural Language Processing COLLOCATIONS Updated 16/11/2005.
Outline What is a collocation?
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Software Testing and Quality Assurance
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
Modern Information Retrieval
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
Keyword extraction for metadata annotation of Learning Objects Lothar Lemnitzer, Paola Monachesi RANLP, Borovets 2007.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Introduction to Lexical Semantics Vasileios Hatzivassiloglou University of Texas at Dallas.
Predicting the Semantic Orientation of Adjectives
Machine Translation Prof. Alexandros Potamianos Dept. of Electrical & Computer Engineering Technical University of Crete, Greece May 2003.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Comments on Guillaume Pitel: “Using bilingual LSA for FrameNet annotation of French text from generic resources” Gerd Fliedner Computational Linguistics.
Towards the automatic identification of adjectival scales: clustering adjectives according to meaning Authors: Vasileios Hatzivassiloglou and Kathleen.
Today Concepts underlying inferential statistics
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Chapter 5: Information Retrieval and Web Search
Collocations 09/23/2004 Reading: Chap 5, Manning & Schutze (note: this chapter is available online from the book’s page
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora Benjamin Arai Computer Science and Engineering Department.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Outline What is a collocation? Automatic approaches 1: frequency-based methods Automatic approaches 2: ruling out the null hypothesis, t-test Automatic.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
Statistical Natural Language Processing Diana Trandabăț
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Bilingual term extraction revisited: Comparing statistical and linguistic methods for a new pair of languages Špela Vintar Faculty of Arts Dept. of Translation.
Natural Language Processing Spring 2007 V. “Juggy” Jagannathan.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.
2007. Software Engineering Laboratory, School of Computer Science S E Towards Answering Opinion Questions: Separating Facts from Opinions and Identifying.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Classifier Evaluation Vasileios Hatzivassiloglou University of Texas at Dallas.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Finding High-frequent Synonyms of a Domain- specific Verb in English Sub-language of MEDLINE Abstracts Using WordNet Chun Xiao and Dietmar Rösner Institut.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Katrin Erk Vector space models of word meaning. Geometric interpretation of lists of feature/value pairs In cognitive science: representation of a concept.
Gerrit Schutte OHIM 9th of December, 2011 Trademark terminology control.
Chapter 6: Information Retrieval and Web Search
Identifying Disease Diagnosis Factors by Proximity-based Mining of Medical Texts Rey-Long Liu *, Shu-Yu Tung, and Yun-Ling Lu * Dept. of Medical Informatics.
GUIDE : PROF. PUSHPAK BHATTACHARYYA Bilingual Terminology Mining BY: MUNISH MINIA (07D05016) PRIYANK SHARMA (07D05017)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
Natural Language Processing for Information Retrieval -KVMV Kiran ( )‏ -Neeraj Bisht ( )‏ -L.Srikanth ( )‏
Chapter 23: Probabilistic Language Models April 13, 2004.
Methods for Automatic Evaluation of Sentence Extract Summaries * G.Ravindra +, N.Balakrishnan +, K.R.Ramakrishnan * Supercomputer Education & Research.
1 CSC 594 Topics in AI – Text Mining and Analytics Fall 2015/16 3. Word Association.
Layered MorphoSaurus Lexicon Extension. Problem Confuse and arbitrary synonym classes of non-medical concepts High ambiguity of general (non- terminological)
1 Masters Thesis Presentation By Debotosh Dey AUTOMATIC CONSTRUCTION OF HASHTAGS HIERARCHIES UNIVERSITAT ROVIRA I VIRGILI Tarragona, June 2015 Supervised.
Issues in Ontology-based Information integration By Zhan Cui, Dean Jones and Paul O’Brien.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Analysis of Experiments on Hybridization of different approaches in mono and cross-language information retrieval DAEDALUS – Data, Decisions and Language,
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
NTNU Speech Lab 1 Topic Themes for Multi-Document Summarization Sanda Harabagiu and Finley Lacatusu Language Computer Corporation Presented by Yi-Ting.
Statistical NLP: Lecture 7
A tool for automated extraction of multi-word expressions
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Multimedia Information Retrieval
Topics for Presentations
Presentation transcript:

Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas

Collocations Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993 Recurrent combinations of words that co- occur more often than chance, often with non-compositional meaning Technical and non-technical

Examples of collocations The Dow Jones average of industrials The Dow average The Dow industrials *The Jones industrials The Dow Jones industrial *The industrial Dow *The Dow industrial

Collocation properties Arbitrary (dialect dependent) –ride a bike, set the table Domain dependent –dry suit, wet suit Recurrent Cohesive –Part of a collocation primes for the rest

Applications Lexicography Grammatical restrictions (compare with/to but associate with) Generation Translation

Types of collocations Predicative relations –make a decision, hostile takeover –flexible (syntactic variability, intervening words) Rigid word groups –over the counter market Phrases with open slots –fluency in a domain

Issues in finding collocations Possibly more than two words –Need measure that extends beyond the binary case Possibly intervening words Possibly morphological and syntactic variation Semantic constraints (cf. doctors-dentists and doctors-hospitals)

Xtract stage one For a given word, find all collocates at positions -5 to +5 Three criteria: –strength (normalized frequency); 95% rejection vs. expected 68% under normal distribution –position histogram must not be flat –select peak from histogram

Xtract stage two Start from word pairs Look at each position in between, to the left, and to the right Keep words that appear very often If that fails, keep parts of speech that satisfy this criterion

Xtract stage three Applied to pairs of words Requires (partial) parsing Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)

Evaluation Ask lexicographer to evaluate output 40% precision after stages one and two 80% precision after stage three 94% conditional recall

Terminology Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994 Terms refer to concepts Terms key for populating a domain ontology Terms are typically nominal compounds of certain structure, e.g., NN, N of N

Defining terms Unique reference Unique translation Term extension by –modification (e.g., addition of an adjective) –substitution –extension of structure –coordination

Algorithm Apply syntactic constraints to match pairs of words in a candidate term Filter by application of an association measure Measures examined: pointwise mutual information, Φ 2 (chi-square), log-likelihood ratio

Observations Compare with reference list Frequency a strong predictor Log-likelihood ratio works best Additional criteria: –diversity of the distribution of each word –distance between the two words (determines flexibility but not term status)

Justeson and Katz Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.

Analysis Examined association measures Well-known problems: –eliminating general-language constructs (e.g., collocations) –what to do with single word terms?

Observations Frequency works well But a stronger predictor is P(k>1) compared to P(k≥1) in the same document Use syntactic patterns to propose terms, then check if they reappear in the same document Require this across multiple documents

Term Expansion Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL Need to expand a given list of terms, especially for scientific domains

Term variation Syntactic (same words, different structure) Morphosyntactic (derivational forms of words) Semantic (synonyms are used) In IR, normalization through stemming and removal of stop words

Approach Process corpus matching new candidate terms to old ones via unification Matching based on –inflectional morphology (transducer) –derivational morphology (rule-based) –syntactic transformations –additions of words

Results Manual inspection of several thousand proposed terms Precision of 89% Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)