Download presentation
Presentation is loading. Please wait.
Published byAlicia Hopkins Modified over 9 years ago
1
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas
2
Collocations Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993 Recurrent combinations of words that co- occur more often than chance, often with non-compositional meaning Technical and non-technical
3
Examples of collocations The Dow Jones average of industrials The Dow average The Dow industrials *The Jones industrials The Dow Jones industrial *The industrial Dow *The Dow industrial
4
Collocation properties Arbitrary (dialect dependent) –ride a bike, set the table Domain dependent –dry suit, wet suit Recurrent Cohesive –Part of a collocation primes for the rest
5
Applications Lexicography Grammatical restrictions (compare with/to but associate with) Generation Translation
6
Types of collocations Predicative relations –make a decision, hostile takeover –flexible (syntactic variability, intervening words) Rigid word groups –over the counter market Phrases with open slots –fluency in a domain
7
Issues in finding collocations Possibly more than two words –Need measure that extends beyond the binary case Possibly intervening words Possibly morphological and syntactic variation Semantic constraints (cf. doctors-dentists and doctors-hospitals)
8
Xtract stage one For a given word, find all collocates at positions -5 to +5 Three criteria: –strength (normalized frequency); 95% rejection vs. expected 68% under normal distribution –position histogram must not be flat –select peak from histogram
9
Xtract stage two Start from word pairs Look at each position in between, to the left, and to the right Keep words that appear very often If that fails, keep parts of speech that satisfy this criterion
10
Xtract stage three Applied to pairs of words Requires (partial) parsing Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)
11
Evaluation Ask lexicographer to evaluate output 40% precision after stages one and two 80% precision after stage three 94% conditional recall
12
Terminology Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994 Terms refer to concepts Terms key for populating a domain ontology Terms are typically nominal compounds of certain structure, e.g., NN, N of N
13
Defining terms Unique reference Unique translation Term extension by –modification (e.g., addition of an adjective) –substitution –extension of structure –coordination
14
Algorithm Apply syntactic constraints to match pairs of words in a candidate term Filter by application of an association measure Measures examined: pointwise mutual information, Φ 2 (chi-square), log-likelihood ratio
15
Observations Compare with reference list Frequency a strong predictor Log-likelihood ratio works best Additional criteria: –diversity of the distribution of each word –distance between the two words (determines flexibility but not term status)
16
Justeson and Katz Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.
17
Analysis Examined association measures Well-known problems: –eliminating general-language constructs (e.g., collocations) –what to do with single word terms?
18
Observations Frequency works well But a stronger predictor is P(k>1) compared to P(k≥1) in the same document Use syntactic patterns to propose terms, then check if they reappear in the same document Require this across multiple documents
19
Term Expansion Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997. Need to expand a given list of terms, especially for scientific domains
20
Term variation Syntactic (same words, different structure) Morphosyntactic (derivational forms of words) Semantic (synonyms are used) In IR, normalization through stemming and removal of stop words
21
Approach Process corpus matching new candidate terms to old ones via unification Matching based on –inflectional morphology (transducer) –derivational morphology (rule-based) –syntactic transformations –additions of words
22
Results Manual inspection of several thousand proposed terms Precision of 89% Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.