Download presentation
Presentation is loading. Please wait.
Published byArthur Reeves Modified over 9 years ago
1
Reading news for information: How much vocabulary a CFL learner should know 笪 骏 Middle Tennessee State University jda@mtsu.edu
2
Outline 1.Reading and vocabulary size 2.Estimates on Chinese vocabulary 3.My research 4.Remaining issues
3
1. Reading and vocabulary size Past research indicates that vocabulary knowledge is the single most important factor contributing to reading comprehension (c.f., for example, Laflamme 1997).
4
1.1 Reading and vocabulary size: Native speakers Number of words that a native speaker knows –Educated native English speakers know about 16,000 to 20,000 word families (Goulden, Nation and Read, 1990; Zechmeister, Chronis, Cull, D’Anna and Healy, 1995), where a word family is defined as a headword, its inflected forms and its closely related derived forms (from affixation, etc.) (Nation 2001). Number of words necessary for adequate reading comprehension (Carver 1994) –Easy reading: 0% –Difficult reading: 2% –Appropriate reading: 1% –Those observations suggest that 99% coverage rate is needed for pleasure reading of difficult materials.
5
1.2 Reading and vocabulary size: Non-native speakers Number of words that a non-native needs to know –3,000 word families for general English Laufer (1992) –Hirsh and Nation (1992), Nation and Waring (1997) put the number between 3000 and 5,000 word families Coverage rate for pleasant comprehension –Liu and Nation (1985) and Laufer (1989) suggested a level of 95% for adequate reading comprehension. –Hu and Nation (2000) found that unless there is at least 98% or higher coverage rate of the running words in a text, the probability of successful guessing of unknown words will be severely reduced.
6
2. Estimates on Chinese vocabulary Our knowledge about similar issues in Chinese is less conclusive. For example, while we have some idea about the number of characters that an educated native speaker knows, we are much less clear about the vocabulary size of native speakers.
7
2.1 Estimates on Chinese vocabulary: Precompiled lists Estimates on vocabulary size –The Unabridged Chinese Dictionary ( 《汉语大词典》 ) contains more than 370,000 entries –Modern Chinese Dictionary ( 《现代汉语词典》 ) that are intended for daily use by educated native speakers contain between 50,000 and 60,000 entries of characters, words, phrases and idioms, etc. –HSK list ( 国家对外汉语教学领导小组办公室《汉语水 平词汇与汉字等级大纲》 ) contains 8,882 characters, words and phrases.
8
2.2 Reading and vocabulary size: Chinese: Empirical study Previous empirical studies on Chinese vocabulary –In real language use, a study on Chinese textbooks used in both elementary and middle schools in mainland China conducted by the Modern Education Technology Research Institute, Beijing Normal University in the 1990s found that out of the 704,841 words identified, only 39,601 are unique. –Hong Kong Polytechnic University conducted a study between 1991- 1997 on a 6-million character corpus containing news articles collected between 1990 – 1992 from newspapers published in mainland China, Taiwan and Hong Kong. Their Chinese Word Bank from Mainland China, Taiwan and Hong Kong ( 《中国大陆、台湾、香港汉语词库》 ) contains 60,811 entries. Further research by Chen and Tang (1999) on the word bank identified 12,700 frequently used words and found that the three regions share a common collection of high- and medium-frequency words that makes up 90% of the total number of words identified and covers 95% of the text materials. The remaining 10% words that are not shared among the three regions concentrate on the low-frequency range.
9
3. This project Objectives: –How many words and phrases are there in the news media that are made of 2, 3 and 4 characters? –Estimate on vocabulary size necessary for news reading comprehension. Results are available at –http://lingua.mtsu.edu/chinese- computing/newscorpus/
10
3.1 The news corpus The Chinese news texts used in this study were collected between the middle of 2003 and the end of 2004 from the current news collection of the World Forum website ( 世 界论坛网, http://www.wforum.com/gbindex.html). Statistics on the corpus
11
Table 1 News corpus CategoriesNumber of articles Commentary997 Culture and education560 Economy and finance1580 Entertainment693 Headline news4726 Hong Kong and Macao695 International3767 Mainland China2652 Military and defense2547 North America2586 Overseas Chinese439 Science and technology877 Social1936 Sports1249 Taiwan2661 Total27,965
12
3.2 Research methodology Brutal force used to generate lists of bigrams, trigrams and quadrigrams; Identify words/phrases with reference to a precompiled word/phrase list. The precompiled list is based on six manually edited word/phrase lists available on the Internet.
13
3.3 Consolidated word list HSK: http://www.chinese-forums.com/vocabulary/ CEDICT: CEDICT was created by Paul Denisowski and is currently maintained by Erik Peterson. Data from CEDICT was retrieved from http://www.mandarintools.com/cedict.html on 2005-05-20. Adrian Robert: http://kamares.ucsd.edu/~arobert/chinese_f.html Word85: Chinese Word Frequency Statistics and Analysis ( 《汉语词频的 统计和分析》 ) by Beijing Language and Culture University (formerly Beijing Institute of Languages) was retrieved from the Chinese Pinyin and Input Method Forum ( 〖汉语拼音与输入法论坛〗 ) at http://sh.netsh.com/bbs/1951/. ICTCLAS: Information about ICTCLAS ( 中科院计算所汉语词法分析系统 ) can be found at http://mtgroup.ict.ac.cn/~zhp/ICTCLAS/index.html. Vocabulary data incorporated in our consolidated list was retrieved from http://download.pchome.net/php/dl.php?sid=12405. Richwin: Word and phrase list from Richwin was retrieved from http://technology.chtsai.org/wordlist/duoyuanpinyin.zip on 2005-04-30. The list is intended for Chinese input used in the Richwin system and hence may contain entries that are portions of words or phrases.
14
Table 2 Consolidated word and phrase list based on six online sources Characters /words /phrasesHSK CEDIC TRobertWord85ICTCLASRichwin Consolidat ed Single character18666851 Two- character6373129442316711014431647339682532 Three- character30626863692636178771941131965 Four- character1881983265168292872586830806 More than four character s105634871446516542529 Subtotal874325027299971234670793120329147832
15
3.4 Results: Character frequency distribution 10%25%50%75%90%95%98%99%99.5%100% 73815541983812041742218426516364 Table 3 Cumulative number of characters in terms of percentages Total number of characters: 22,256,047 Unique characters: 6,364
16
3.4 Results: Character frequency distribution 1005001000150020002500300035005000 40.6%79.2%92.3%97.0%98.7%99.4%99.7%99.8%100% Table 4 Cumulative frequency distribution in terms of individual characters
17
3.5 Bigram frequency distribution
18
Frequency rangeRaw frequencyIn Da’s consolidated list In the HSK list >10000011 0 50001-10000022 1 20001-500001816 14 10001-200007671 56 5001-10000242205 147 1001-500021851505 956 501-100026911511 702 101-500199607568 2145 51-100217675666 854 21-50521259705 792 11-20629957895 352 6-10882497557 163 <664077620363 145 Total89108762065 6327 Table 5 Bigram frequency distribution
20
3.6 Trigram frequency distribution
21
Frequency range Raw frequency In Da’s consolidated list In the HSK list >1000043 0 5001-10000137 0 1001-5000469129 22 501-10001147189 15 101-500141361203 67 51-100218201102 39 11-501865384268 72 6-102157992319 23 2-512121874917 33 126388703164 9 Total429098317301 280 Table 6 Trigram frequency distribution
23
3.7 Quadrigram frequency distribution
24
Frequency range Raw frequency In Da’s consolidated list In the HSK list >2000148 0 1501-2000177 0 1001-15005615 2 501-100022960 0 101-5004774603 24 51-1009254675 23 11-501125683409 70 6-101734802070 22 2-516306923956 28 158541562290 9 Total778524013093 178 Table 7 Quadrigram frequency distribution
26
3.8 Discussions: Character frequency Corpus10%25%50%75%90%95%98%99%99.5%100% News corpus73815541983812041742218426516364 Modern Chinese 633152481105615662284283834239933 Table 8 Cumulative frequency in terms of percentages
27
3.8 Discussions: Character frequency Corpus1005001000150020002500300035005000 News corpus40.6%79.2%92.3%97.0%98.7%99.4%99.7%99.8%100% Modern Chinese 41.8%75.8%89.1%94.6%97.1%98.5%99.2%99.5%99.9% Table 9 Cumulative frequency in terms of individual characters
28
3.9 Discussions: the HSK list
29
Characters /words /phrasesLevel 1Level 2Level 3Level 4SubtotalHSK itself Single Character36146435543416141866 Two-character52813431635282163276373 Three-character207369118280306 Four-character2530141178188 Total91118852089351483998733 Table 10 HSK characters, words, phrases and idioms found in the news corpus
31
3.10 Discussion: Targets for CFL learners
32
Targets Frequency range 2- character3-character 4- characterTotal First High to medium (X>50)165452633136820546 Second Medium-low to low (50≥X>5)251576587547937223 ThirdVery low (≤5)203638081624634690 Total62065173011309392459 Table 12 Three targets for CFL vocabulary acquisition
35
4. Concluding remarks No manual editing
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.