语料库研究中的 主题词分析方法及其扩展 中国外语教育研究中心 梁茂成 An extension to the keyword approach in corpus analysis
主要内容 Keywords Applications of corpus comparison Limitations to the keyword approach Keywords+ Demo
Keywords ☻ Keywords: ☺ Keywords are words whose frequency is unusually high (or low) in comparison with some norm. (Scott, 2003)
Keywords ☻ Positive keywords: ☺ Words which occur more often than would be expected by chance in comparison with the reference corpus.
Keywords ☻ Negative keywords: ☺ Words which occur less often than would be expected by chance in comparison with the reference corpus.
Keywords ☻ Positive and negative keywords ☺ In a corpus of business English, words such as business, profit and companies are likely to be positive keywords if the corpus is to be compared with a general corpus.
Keywords ☻ Positive and negative keywords ☺ In a corpus of academic English, words such as morning, afternoon and evening are likely to be negative keywords if the corpus is to be compared with a general corpus.
Keywords ☻ Calculating keyness (Rayson et al. 2004, Oakes 1998) ☺ Chi-square
Keywords Chi-square
Keywords Chi-square with Yate’s correction
Keywords Loglikelihood References:
Keywords ☻ Previous research has revealed that loglikelihood is a better measure than chi-square when comparing word frequencies in corpora.
Keywords ☻ Ways to find keywords: ☺ Top-down: corpus-based ☺ Buttom-up: corpus-driven
Applicatons of… ☺ Comparison across users ☺ Comparison across genres ☺ Comparison across times ☺ Comparison across (varieties of) languages
Applicatons of… ☺ Compiling a specialized dictionary ☺ Detecting the topic ☺ Genre analysis ☺ Contrastive Interlanguage Analysis ☺ ……
Limitations to… ☻ Keywords: ☺ Do keywords have to be single words? Phraseology seems more interesting! ☺ Do keywords have to be lexical words? POS tag sequences may also be interesting. ☺ Can we bring together the bottom-up approach and the top-down approach?
Limitations to… ☻ Top-down: the problem is I do not yet know what may be interesting.
Limitations to… ☻ Buttom-up: the problem is that I have been given a long list of keywords, only some of which are interesting, buried among many others which do not seem interesting at all.
Keywords+ ☻ Support multiword sequences ☻ Support online search ☻ Support POS tag sequences ☻ Support regex search
Demo ☻ demo
Thank you.