Download presentation
Presentation is loading. Please wait.
Published byYasmin Leemon Modified over 9 years ago
1
A CORPUS-BASED STUDY OF REFERENTIAL CHOICE: Multiplicity of factors and machine learning techniques Andrej A. Kibrik, Grigorij B. Dobrov, Mariya V. Khudyakova, Natalia V. Loukachevitch, and Aleksandr Pechenyj aakibrik@gmail.com
2
2 22 Referential choice in discourse When a speaker needs to mention (or refer to) a specific, definite referent, s/he chooses between several options, including: Full noun phrase (NP) Proper name (e.g. Pushkin) Common noun (with or without modifiers) = definite description (e.g. the poet) Reduced NP, particularly a third person pronoun (e.g. he)
3
3 Example Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done How is this choice made? Why does speaker/writer use a certain referential option in the given context? Full NP Pronoun antecedent anaphors
4
4 Why is this important? Reference is among the most basic cognitive operations performed by language users Reference constitutes a lion’s share of all information in natural communication Consider text manipulation according to the method of Biber et al. 1999: 230-232
5
5 Referential expressions marked in green Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done
6
6 Referential expressions removed Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done
7
7 Referential expressions kept Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done
8
8
9
9 99 Plan of talk I. Referential choice as a multi-factorial process II. The RefRhet corpus III. Machine learning-based approach IV. The probabilistic character of referential choice
10
10 I. MULTI-FACTORIAL CHARACTER OF REFERENTIAL CHOICE Multiple factors of referential choice Distance to antecedent Along the linear discourse structure (Givón) Along the hierarchical discourse structure (Fox) Antecedent role (Centering theory) Referent animacy (Dahl) Protagonisthood (Grimes)......................................... Properties of the discourse context Properties of the referent
11
11 What shall we do with that? Many authors have tried to emphasize one of these factors in particular studies But none of those factors can explain everything: sometimes factor A is more relevant, sometimes factor B, etc. One must recognize the inherently multi-factorial character of referential choice Factors must be somehow integrated Previous attempts of such integration The calculative model (Kibrik 1996, 1999) The neural networks study (Gruening and Kibrik 2005)
12
12 The calculative approach Each value of each factor is attributed a numerical value For each referent occurrence, all factor values are easily identifiable, and therefore all the corresponding numerical values are readily available At every pointŒ in discourse all factors’ contributions are summed and give rise to an integral characterization – the referent’s activation score Activation score can be understood In a more cognitive way, that is as the referent’s status with respect to the speaker’s working memory In a more superficial way, as a conventional integral characterization of the referent vis-à-vis referential choice Activation score predetermines referential choice Low full NP Medium full or reduced NP High reduced NP
13
13 Multi-factorial model of referential choice (Kibrik 1999) Various properties of the referent or discourse context Referent’s activation score Referential choice Relevant factors
14
14 The neural networks approach Neural networks Machine learning algorithm Automatic selection of factors’ weights Automatic reduction of the number of factors («pruning») However: Small data set Single method of machine learning Interaction between factors remains covert Hence a new study Large corpus Implementation of several machine learning methods Statistical model of referential choice
15
15 II. THE RefRhet CORPUS English Business prose Initial material – the RST Discourse Treebank Annotated for hierarchical discourse structure 385 articles from Wall Street Journal The added component – referential annotation The RefRhet corpus About 30 000 referential expressions 157 texts are annotated twice 193 texts are annotated once Why this particular corpus?
16
16 Example of a hierarchical graph, with rhetorical distances RhD = 1 LinD = 4 LinD = RhD = 2
17
17 Scheme of referential annotation The ММАХ2 program Krasavina and Chiarcos 2007 All markables are annotated, including: Referential expressions Their antecedents Coreference relations are annotated Features of referents and context are annotated that can potentially be factors of referential choice
18
18
19
19 Work on referential annotation O. Krasavina A. Antonova D. Zalmanov A. Linnik M. Khudyakova Students of the Department of Theoretical and Applied Linguistics, MSU
20
20 Current state of the RefRhet referential annotation 2/3 completed Further results are based on the following data: 247 texts 110 thousand words 26 024 markables 4291 reliable pairs «anaphor – antecedent» Proper names — 43% Definite descriptions — 26% Pronouns — 31% 69%
21
21 Factors of referential choice (2010) Properties of the referent: Animacy Protagonisthood Properties of the antecedent: Type of syntactic phrase (phrase_type) Grammatical role (gramm_role) Form of referential expression (np_form, def_np_form) Whether it belongs to direct speech or not (dir_speech)
22
22 Factors of referential choice (2010) Properties of the anaphor: First vs. nonfirst mention in discourse (referentiality) Type of syntactic phrase (phrase_type) Grammatical role (gramm_role) Whether it belongs to direct speech or not (dir_speech) Distance between the anaphor and the antecedent: Distance in words Distance in markables Linear distance in clauses Hierarchical distance in elementary discourse units
23
23 Factors 2011 Gender and number (agreement): masculine, feminine, neuter, plural Antecedent length, in words Number of markables from the anaphor back to the nearest full NP antecedent Number of referent mention in the referential chain Distance in sentences Distance in paragraphs
24
24 III. MACHINE LEARNING: TECHNIQUES AND RESULTS Independent variables: All potential activation factors implemented in corpus annotation Dependent variable: Form of referential expression (np_form) Binary prediction: Full NP vs. pronouns Three-way prediction: Definite description vs. proper name vs. pronoun Accuracy maximization: Ratio of correct predictions to the overall number of instances
25
25 Machine learning methods (Weka, a data mining system) Easily interpretable methods: Logical algorithms Decision trees (C4.5) Decision rules (JRip) Logistic regression Quality control – the cross-validation method
26
26 Examples of decision rules generated by the JRip algorithm (Antecedent’s grammatical role = subject) & (Hierarchical distance ≤ 1.5) & (Distance in words ≤ 7) => pronoun (Animate) & (Distance in markables ≥ 2) & (Distance in words ≤ 11) => pronoun 26
27
27 2010 results with single machine-learning algorithms Accuracy Binary prediction: logistic regression – 85.6% logical algorithms – up to 84.5% Three-way prediction: logistic regression – 76% logical algorithms – up to 74.3% 27
28
28 Composition of classifiers: boosting Base algorithm (C4.5 Decision trees) Iterative process Each additional classifier applies to the objects that were not properly classified by the already constructed composition At each iteration the weights of each wrongly classified object increase, so that the new classifier focuses on such objects
29
29 Composition of classifiers: bagging Base algorithm (C4.5 Decision trees) Bagging randomly selects a subset of the training samples to train the base algorithm Set of algorithms built on different, potentially intersecting, training subsamples A decision on classification is done through a voting procedure in which all the constructed classifiers take part
30
30 Binary prediction: full noun phrase vs. pronoun AlgorithmAccuracy 2010 Accuracy 2011 Logistic regression Decision tree algorithm Deciding rules algorithm Boosting Bagging 85.6% 84.3% 84.5% 87.0% 86.3% 86.2% 89.9% 87.6%
31
31 Three-way prediction: description vs. proper name vs. pronoun AlgorithmAccuracy 2010 Accuracy 2011 Logistic regression Decision tree algorithm Deciding rules algorithm Boosting Bagging 76.0% 74.3% 72.5% 77.4% 76.7% 75.4% 80.7% 79.5%
32
32 Comparison of single- and multi-factor accuracy FeatureBinary predictionThree-way prediction The largest class69%43% Distance in words76%55% Hierarchical distance74.8%53.5% Anaphor’s grammatical role 70%45.2% Anaphor in direct speech 70%43.8% Animate71.5%47.3% Combination of factors 89.9%80.7%
33
33 Significance of factors in the three- way prediction Factors Accuracy All factors, including the newly added ones without the anaphor’s grammatical role without the antecedent’s grammatical role without grammatical role without the antecedent’s referential form without protagonism without animacy 80.7% 79.3% 80.2% 79.2% 77.0% 80.0% 80.68%
34
34 Significance of factors in the three-way task of referential choice (continued) Factors – distances (6) Accuracy All factors, including the newly added ones without all distances - except for rhetorical distance only - except for the distance in words only - except for the distances in words and paragraphs - except for the distances in words and sentences - except for rhetorical distance and the distances in words and sentences - except for the distances in words, markables, and paragraphs 80.7% 73.5% 74.9% 79.0% 79.5% 79.7% 80.47%
35
35 IV. REFERENTIAL CHOICE IS A PROBABILISTIC PROCESS According to Kibrik 1999 Potential referential expressions Actual referential expressions Full NP only (19%) Full NP (49%) Full NP, ?pronoun (21 %) Pronoun or full NP (28%) Pronoun (51%) Pronoun, ?full NP (23%) Pronoun only (9%)
36
36 Probabilistic character of referential choice in the RefRhet study Prediction of referential choice cannot be fully deterministic There is a class of instances in which referential choice is random It is important to tune up the model so that it could process such instances in a special manner We are beginning to explore this problem Logistic regression generates estimates of probability for each referential option This estimate of probability can be interpreted as the activation score from the earlier model
37
37 Probabilistic multi-factorial model of referential choice Activation score = probability of using a certain referential expression Referential choice Relevant factors Various properties of the referent or discourse context
38
38 Conclusions about the RefRhet study Quantity: Large corpus of referential expressions Quality: A high level of accurate prediction is already attained And we keep working! Theoretical significance: the following fundamental properties of referential choice are addressed: Multi-factorial character of referential choice Contribution of individual factors, assessed automatically statistically in a variety of ways Probabilistic character of referential choice This approach can be applied to a wide range of linguistic and other behavioral choices
39
39 Thank you in the CML languages cпасибо благодаря хвала mulţumesc ευχαριστώ
40
40 5 th International conference on cognitive science See www.conf.cogsci.ruwww.conf.cogsci.ru Abstract submission: between October 1 and November 15
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.