Presentation is loading. Please wait.

Presentation is loading. Please wait.

REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov

Similar presentations


Presentation on theme: "REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov"— Presentation transcript:

1 REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov aakibrik@gmail.com

2 2 22 Referential choice in discourse  When a speaker needs to mention (or refer to) a specific, definite referent, s/he chooses between several options, including:  Full noun phrase (NP) Proper name (e.g. Pushkin) Common noun (with modifiers) = definite description (e.g. the poet)  Reduced NP, particularly a third person pronoun (e.g. he)

3 3 Example  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done  How is this choice made? Full NP Pronoun antecedentcoreference anaphors

4 4 Why is this important?  Reference is among the most basic cognitive operations performed by language users  It is the linguistic representation of what is known as attention and working memory in psychology  Reference constitutes a lion’s share of all information in natural communication  Consider text manipulation according to the method of Biber et al. 1999: 230-232

5 5 Referential expressions marked in green  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

6 6 Referential expressions removed  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

7 7 Referential expressions kept  Tandy said consumer electronics sales at its Radio Shack stores have been slow, partly because a lack of hot, new products. Radio Shack continues to be lackluster, said Dennis Telzrow, analyst with Eppler, Guerin Turner in Dallas. He said Tandy has done

8 8 88 Plan of talk  I. Referential choice as a multi-factorial process  II. The RefRhet corpus and the machine learning-based approach  III. The probabilistic character of referential choice

9 9 99 Multi-factorial character of referential choice  Many factors of referential choice  Distance to antecedent Along the linear discourse structure Along the hierarchical discourse structure  Antecedent role  Referent animacy  Protagonisthood.........................................  None of these factors alone can explain referential choice

10 10 Factors integration  At every poing in discourse factors are somehow summed and give rise to an integral characterization – the referent’s activation score  Activation score is the referent’s status with respect to the speaker’s working memory  Activation score predetermines referential choice  Low  full NP  Medium  full or reduced NP  High  reduced NP

11 11 Multi-factorial model of referential choice (Kibrik 1999) Various properties of the referent or discourse context Referent’s activation score Referential choice Activation factors

12 12 Modeling multi-factorial processes: machine learning-based methods  Neural networks approach (Gruening and Kibrik 2005)  Machine learning algorithm Automatic selection of factors’ weights Automatic reduction of the number of factors («pruning»)  However: Small data set Single method of machine learning Low interpretability of results  Hence a new study  Large corpus  Implementation of several machine learning methods  Statistical model of referential choice

13 13 The RefRhet corpus  English  Business prose  Initial material – the RST Discourse Treebank  Annotated for hierarchical discourse structure  385 articles from Wall Street Journal  The added component – referential annotation  The RefRhet corpus  Over 30 000 referential expressions

14 14 Example of a hierarchical graph

15 15 Scheme of referential annotation  The ММАХ2 program  Krasavina and Chiarcos 2007  All markables are annotated, including:  Referential expressions  Their antecedents  Coreference relations are annotated  Features of referents and context are annotated that can potentially be factors of referential choice

16 16

17 17 Work on referential annotation  O. Krasavina  A. Antonova  D. Zalmanov  A. Linnik  M. Khudyakova  Students of the Department of Theoretical and Applied Linguistics, MSU

18 18 Current state of the RefRhet referential annotation  2/3 completed  Further results are based on the following data:  247 texts  110 thousand words  26 024 markables 7097 proper names 8560 definite descriptions 1797 third person pronouns  3756 reliable pairs «anaphor – antecedent» Proper names — 1623 (43%) Definite descriptions — 971 (26%) Pronouns — 1162 (31%)

19 19 Factors of referential choice  Properties of the referent:  Animacy  Protagonisthood  Properties of the antecedent:  Type of syntactic phrase (phrase_type)  Grammatical role (gramm_role)  Form of referential expression (np_form, def_np_form)  Whether it belongs to direct speech or not (dir_speech)

20 20 Factors of referential choice  Properties of the anaphor:  First vs. nonfirst mention in discourse (referentiality)  Type of syntactic phrase (phrase_type)  Grammatical role (gramm_role)  Whether it belongs to direct speech or not (dir_speech)  Distance between the anaphor and the antecedent:  Distance in words  Distance in markables  Linear distance in clauses  Hierarchical distance in elementary discourse units

21 21 Goals for the machine learning-base study  Dependent variable:  Form of referential expression (np_form)  Binary prediction:  Full NP vs. pronouns  Three-way prediction:  Definite description vs. proper name vs. pronoun  Accuracy maximization:  Ratio of correct predictions to the overall number of instances 21

22 22 Machine learning methods (Weka, a data mining system)  Easily interpretable methods:  Logical algorithms Decision trees (C4.5) Decision rules (JRip)  Higher quality:  Logistic regression  Quality control – the cross-validation method

23 23 Examples of decision rules generated by the JRip algorithm  (Antecedent’s grammatical role = subject) & (Hierarchical distance ≤ 1.5) & (Distance in words ≤ 7) => pronoun  (Animate) & (Distance in markables ≥ 2) & (Distance in words ≤ 11) => pronoun 23

24 24 Main results  Accuracy  Binary prediction:  logistic regression – 86.1%  logical algorithms – 85%  Three-way prediction:  logistic regression – 74%  logical algorithms – 72% 24

25 25 Comparison of single- and multi-factor accuracy FeatureThree-way prediction Binary prediction The largest class43%69% Distance in words55%76% Hierarchical distance53.5%74.8% Anaphor’s grammatical role 45.2%70% Anaphor in direct speech 43.8%70% Animate47.3%71.5% Combination of factors 74%86.1% 25

26 26 Referential choice is a probabilistic process  According to Kibrik 1999 Potential referential expressions Actual referential expressions Full NP only (19%) Full NP (49%) Full NP, ?pronoun (21 %) Pronoun or full NP (28%) Pronoun (51%) Pronoun, ?full NP (23%) Pronoun only (9%)

27 27 Probabilistic character of referential choice in the RefRhet study  Prediction of referential choice cannot be fully deterministic  There is a class of instances in which referential choice is random  It is important to tune up the model so that it could process such instances in a special manner  We are working on this problem  Logistic regression generates estimates of probability for each referential option  This estimate of probability can be interpreted as the activation score from the cognitive model

28 28 Probabilistic multi-factorial model of referential choice Activation score = probability of using a certain referential expression Referential choice Activation factors Various properties of the referent or discourse context

29 29 Conclusions about the RefRhet study  Quantity: Large corpus of referential expressions  Quality: A high level of accurate prediction is already attained  And this is not the limit  Theoretical significance: the following fundamental properties of referential choice are addressed:  Multi-factorial character of referential choice  Probabilistic character of referential choice  This approach can be applied to a wide range of linguistic and other behavioral choices


Download ppt "REFERENTIAL CHOICE AS A PROBABILISTIC MULTI-FACTORIAL PROCESS Andrej A. Kibrik, Grigorij B. Dobrov, Natalia V. Loukachevitch, Dmitrij A. Zalmanov"

Similar presentations


Ads by Google