Download presentation
Presentation is loading. Please wait.
Published byGilbert Warren Modified over 9 years ago
1
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow
2
Distributional semantics new area of linguistic research inferring semantic properties of linguistic units from corpora Theoretical foundations: distributional methodology by Z. Harris, F. de Saussure, L. Wittgenstein. Distributional hypothesis: semantically similar words occur in similar contexts. J. R. Firth “You shall know a word by the company it keeps”.
3
Vector space drink coffee – occurred 1 time drink tea – occurred 2 times
4
Cosine measure of vector similarity
5
Main application areas lexical ambiguity resolution information retrieval dictionaries of semantic relations multilingual dictionaries semantic maps of different domains modelling of synonymy document topic detection sentiment analysis
6
The present research Goal: to apply distributional semantics models to extraction of translation correspondences from a parallel corpus. Vector space model + test corpus
7
Test corpus Patent texts in French translated into Russian Texts splitted into sentences Alignment at the sentence level – manually verified (in the visual editor MakeBilingua) Uploaded to the Sketch Engine corpus manager
8
Preprocessing Lemmatization Frequent words removed (prepositions, conjunctions etc.) Punctuation marks removed
9
Vector space model type of linguistic units: single words; type of context: aligned regions; frequency measure: Boolean frequency (equal either to 1 or 0); method used to compute the distance between vectors: cosine measure.
10
Example (aligned region as a context) Aligned region #1 présent invention concerner liant minéral notamment hydraulique настоящий изобретение касаться неорганический связующий частность гидравлический связующий
11
Example (vector space) Aligned region#1#2#3 présent1…… invention1…… concerner1…… настоящий1…… изобретение1…… касаться1……
12
Results A list of translation correspondences. Linguistic filter: the same part of speech. Precision: 78%.
13
Correspondences with different POS Syntactic transformations verbal infinitive (French) → noun (Russian) traiter (“to process”) → обработка (“processing”) noun (French) → adjective (Russian) crochet (“hook”) → крюкообразный (“hook-shaped”) verbal infinitive (French) → adjective (Russian) connaître (“to know”) → известный (“well-known”)
14
Correspondences with different POS Parts of multi-word expressions au moins (“at least”) → по меньшей мере (“at least”) The output of the program: moins → мера
15
Evaluation Eduardo Cendejas, Grettel Barceló, Alexander Gelbukh, Grigori Sidorov. Incorporating Linguistic Information to Statistical Word-Level Alignment // Proceedings of the 14th Iberoamerican Conference on Pattern Recognition, CIARP 2009, Guadalajara, Jalisco, Mexico, November 15-18, 2009. Vector space model + similarity measures PMI, T- score, Log-likelihood ratio and Dice coefficient. Precision – 53%.
16
Conclusion Distributional semantics methodology can be used to extract translation correspondences from a parallel corpus with a high level of precision. It can be used to study productive syntactic transformations occurring in translation. The present vector space model needs to be enhanced to take into account multi-word expressions.
17
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.