TSD, Brno, Institute of Formal and Applied Linguistics, 1 Czech Verbs of Communication and the Extraction of their Frames Václava Benešová and Ondřej Bojar
TSD, Brno, Institute of Formal and Applied Linguistics, 2/14 Introduction 1. VALLEX, Valency Lexicon of Czech Verbs 2. Automatic Identification of Verbs of Communication 3. Frame Suggestion 4. Conclusion
TSD, Brno, Institute of Formal and Applied Linguistics, 3/14 1. Valency lexicon of Czech Verbs, VALLEX 1.x, and its Verb Classes Verb Classes in VALLEX Verbs of Communication
TSD, Brno, Institute of Formal and Applied Linguistics, 4/14 VALLEX Theoretical background: Functional Generative Description (FGD) Valency: “ability of lexical units to bind other lexical units” Versions: 1.0, internal 1.5, 2.0 (autumn 2006) (almost 4300 entries) Corpus coverage (Czech National corpus): ● about 10% verbs occurrences with low corpus frequency, not covered (cca lemmas)
TSD, Brno, Institute of Formal and Applied Linguistics, 5/14 Verb Entry in VALLEX Verb Entry: set of valency frame(s) Valency frame: sequence of slots (functor, morphemic realization, type of complement) Attributes of valency frames: gloss, example, … class
TSD, Brno, Institute of Formal and Applied Linguistics, 6/14 Verb Classes in VALLEX Classification: in progress built from below emphasis on syntactic criteria communication, mental action, perception, psych verb, exchange, change, phase verbs, phase of action, modal verbs, motion, transport, location, … VALLEX 1.0VALLEX 1.5 Total Verb Entries Total Verb Lemmas Total Valency Frames Valency Frames with Class [37.5%] [44.6%] Total Classes Frame Types in Class on Average
TSD, Brno, Institute of Formal and Applied Linguistics, 7/14 Communication verbs in VALLEX ‘a speaker conveys information to a recipient’ ACT ADDR PAT/EFF {nom} {gen/dat/acc} {dc,...} simple information: {říci: say, informovat: inform, …} + THAT: že → verbs of announcement question: {ptát se: ask, …} + WHETHER, IF: zda, jestli → interrogative verbs commands, bans, warning, …: {nakázat: order, zakázat: prohibit, …} + IN ORDER TO, LET: aby,ať → imperative verbs VALLEX 1.0 VALLEX 1.5 verbs of announce ment: že interrogati ve verbs: zda imperative verbs: aby 74105
TSD, Brno, Institute of Formal and Applied Linguistics, 8/14 2. Automatic Identification of Verbs Communication Evaluation VALLEX vs. FrameNet
TSD, Brno, Institute of Formal and Applied Linguistics, 9/14 Automatic Identification of Verbs Communication Search corpus for V+N234+subord{aby,zda,že} marks each as a communication verb if enough occurrences are found. weak points: 1. eliminates nominal structures: ‘He said the truth about the killer.’ ‘He gave her many presents.’ (verb of exchange) 2. ignores examples where a complement was not expressed on the surface layer: ‘He said that …’ 3. homonymy of conjunctions: že (that) and aby (in order to) ‘He has done it in order to make money…’
TSD, Brno, Institute of Formal and Applied Linguistics, 10/14 Evaluation against VALLEX and FrameNet golden standards: VALLEX 1.0, VALLEX 1.5, FrameNet 1.2 ROC curves TP … true positives (communication verbs according to a golden standard and above the threshold) FP … false positives (non communication verbs and above the given threshold) TPR = TP / P (P the total number of communication verbs) … true positive rate TNR = TN / N (N the total number of verbs with no sense of communication) 40 – 50 % communication verbs identified correctly (for both VALLEX and FrameNet) 20% falsely marked
TSD, Brno, Institute of Formal and Applied Linguistics, 11/14 3. Frame Suggestion Frame Edit Distance and Verb Entry Similarity Experimental Results
TSD, Brno, Institute of Formal and Applied Linguistics, 12/14 Frame Edit Distance and Verb Entry Similarity insert, delete, replace FED (number of edit operations: insert, delete, replace necessary to convert a hypothesized frame to a correct frame) ES (entry similarity or expected saving) min FED(G,H) ES=1- FED(G,Ø)+FED(H,Ø) G … golden verb entries of this base lemma H … hypothesized entries Ø … blank verb entry ES 0% (suggesting nothing), ES 100% (golden frames)
TSD, Brno, Institute of Formal and Applied Linguistics, 13/14 Experimental Results with ES Suggested framesES [%] Specific frame for verbs of communication, default for others Baseline 1: ACT(1)26.69 Baseline 2: ACT(1) PAT(4)37.55 Baseline 3: ACT(1) ADDR(3,4) PAT(4) Baseline 4: Two typical frames: ACT(1) PAT(4) 39.11
TSD, Brno, Institute of Formal and Applied Linguistics, 14/14 Conclusion Automatic identification of communication verbs according to the proposed pattern V+N234+subord{aby,zda,že} performs satisfactorily (40-50% true positives against VALLEX and FrameNet, 20% false positives) FED reveals that more lexicographic labour could be saved by suggesting more than one frame per verb -> need to focus on other classes, too