Tatsuhiko Matsushita LALS, Victoria University of Wellington

Slides:



Advertisements
Similar presentations
AIR マスターへの 抜け道!? ~といいつつ王道話です~ 平成 20 年 2 月 6 日 図書系のための アプリケーション開発講習会.
Advertisements

コンピュータとコーパスを使用 して、英語を理解し、使いこな す 北尾 謙治. コーパス テキストの集積したもの テキストファイルや、それを集めたも の 電子コーパスとも呼ばれる (音声コーパス)
第 3 回 iPhone アプリ開発勉強会 Objective-C 基礎講座 - インスタンスメソッド - 三井 相和.
9.線形写像.
「のだ」と「のか」の使用・非使用に関する 文法および語彙知識の影響 趙萍(麗澤大学大学院生) 玉岡賀津雄(名古屋大学) 木山幸子(麗澤大学大学院生) 言語科学会第 11 回年次国際大会 (JSLS2009)
麻雀ゲーム 和島研究室 ソ 小林巧人
5.連立一次方程式.
つくばだいがくについて 芸術専門学群のこと. 筑波大学ってこんなところ 東京教育大学を前身とする大学で、その 創立は日本で最も古い大学のひとつ。 大学の敷地面積は日本で二番目に広い大 学で、やたら坂が多い。移動時間が15分 しかないのに上り坂を三つ超えることがよ くある。
人工知能特論 II 第 6 回 二宮 崇 1. 今日の講義の予定 確率的文法 品詞解析 HMM 構文解析 PCFG 教科書 北研二 ( 著 ) 辻井潤一 ( 編 ) 言語と計算 4 確率的言語モデル 東大出版会 C. D. Manning & Hinrich Schütze “FOUNDATIONS.
09bd135d 柿沼健太郎 重不況の経済学 日本の新たな 成長へ向けて.
広告付き価格サービ ス 小園一正. はじめに 世の中には様々な表現方法の広告があり ます。その中でも私たち学生にとって身 近にあるものを広告媒体として取り入れ られている。 価格サービス(無料配布のルーズリー フ)を体験したことにより興味を惹かれ るきっかけとなった。主な目的は、これ.
素数判定法 2011/6/20.
1 ヤマセに関する 2-3 の話題 (2) 川村 宏 東北大学大学院理学研究科 H 弘前大学.
本宮市立白岩小学校. 1 はじめに 2 家庭学習プログラム開発の視点 ① 先行学習(予習)を生かした 確かな学力を形成する授業づく り ② 家庭との連携を図った家庭学習の習慣化.
Excelによる積分.
1 6.低次の行列式とその応用. 2 行列式とは 行列式とは、正方行列の特徴を表す一つのスカ ラーである。すなわち、行列式は正方行列からスカ ラーに写す写像の一種とみなすこともできる。 正方行列 スカラー(実数) の行列に対する行列式を、 次の行列式という。 行列 の行列式を とも表す。 行列式と行列の記号.
1 0章 数学基礎. 2 ( 定義)集合 集合については、 3セメスタ開講の「離散数学」で詳しく扱う。 集合 大学では、高校より厳密に議論を行う。そのために、議論の 対象を明確にする必要がある。 ある “ もの ” (基本的な対象、概念)の集まりを、 集合という。 集合に含まれる “ もの ” を、集合の要素または元という。
4.プッシュダウンオートマトンと 文脈自由文法の等価性
1 0章 数学基礎. 2 ( 定義)集合 集合については、 3セメスタ開講の「離散数学」で詳しく扱う。 集合 大学では、高校より厳密に議論を行う。そのために、議論の 対象を明確にする必要がある。 ある “ もの ” (基本的な対象、概念)の集まりを、 集合という。 集合に含まれる “ もの ” を、集合の要素または元という。
人工知能特論II 第7回 二宮 崇.
1 9.線形写像. 2 ここでは、行列の積によって、写像を 定義できることをみていく。 また、行列の積によって定義される写 像の性質を調べていく。
通信路(7章).
創成C PROGRAMMING PROJECT 中部大学工学部情報工学科:創成Cインタラクティブデザイン( アプリ名: ZIP 2 GPS 作成者: EP00000 藤吉 弘亘.
重不況の経済学 第2章第2節 山下 真弘. 不均等成長 不均等成長=市場の特定の製品または特定の国・ 地域で付加価値の縮小が生じること 要因は2つ 製品別の「生産性向上速度の差」 付加価値総額の天井(=需要制約)
A 「喫煙率が下がっても肺ガン死亡率が減っていないじゃな いか」 B 「喫煙を減らしてもガン減るかどうか疑問だ 」 1.
研究会 “Harmonies and Surprises on the Lattice” 地域社会と連携した大学教育と 研究プロセスの類似性 ~松本大学での帰納的教育手法の展開~ 報告者: Matsumoto University 松本大学/松本大学松商短期大学部 Hiroyuki Sumiyoshi.
平成22年度 第4回 Let’s Enjoy English 平成22年度 第4回 Let’s Enjoy English 期日:平成22年10月30日 場所:旭川市立北光小学校 基調提言.
平成22年度予算の国立大学法人関連要望事項に係るパブリックコメント説明会
本日のプレゼンテーション はじめに:生産現場でのあたらしい挑戦 濱田 「ごっつぉプレミアム」ブランドの取り組みについて 長澤組合長
学習者の意欲を高める音読指導の 一時例 1 Speak を使った 音読指導 鈴木政浩(西武文理大学)
名古屋工業大学 電気電子工学科 岩波・岡本研究室 野々村嘉人
論理回路 第1回. 今日の内容 論理回路とは? 本講義の位置づけ,達成目標 講義スケジュールと内容 受講時の注意事項 成績の評価方法.
伝わるスライド 中野研究室 M2 石川 雅 信. どのようなスライドを作れば良 いか 伝えたいこと.
JPN 312 (Fall 2007): Conversation and Composition Contraction (2); 意見を言う (to express your opinion)
SUPJ2010 Japanese Ⅱ( A ) Elementary Japanes e ‐ in twenty hours- Chapter 7.
Three-Year Course Orientation International Course.
JPN 311: Conversation and Composition 勧誘 (invitation)
平成 23 年 6 月 16 日もも脳ネット 脳卒中連携パス結果報告 担当 岡山医療センター 大森 信彦.
方程式を「算木」で 解いてみよう! 愛媛大学 教育学部 平田 浩一.
C言語応用 構造体.
実装の流れと 今後のスケジュール 03k0014 岸原 大祐. システム概要 天気データをもとに、前向き推論をし ていき、親の代わりに子供に服装、持 ち物、気をつけることなどを教える。
二月度御書学習会 開 目 抄 ***支部. 二月度御書学習会 開 目 抄 ***支部 背景と大意 文永九年 51才御作 於佐渡 対告衆 門下一同 人本尊開顕の書 佐渡流罪により弟子が退転 大聖人こそ末法の御本仏.
Automatic Language Acquisition, an Interactive Approach † Robert J. Martin † 大西昇 ‡ 山村毅 † 名古屋大学 ‡ 愛知県立大学.
1 中野研究室 4 年ゼミのイロハ 斉藤(修士 2 年) ( 2009 年 ”4 年ゼミのイロハ ” を参考に作りました)
言語とジェンダー. 目的 言語には、性的な存在である人間の自己認識や 世界認識を決定する力が潜んでいる。 – 言語構造の面(言語的カテゴリー ) – 言語運用の面 日常に潜む無意識の言語の力を、記述し、意識 化することが本講義の目的である。 同時に、さまざまな言語、さまざまな文化には、 それぞれに特徴的な問題があり、ジェンダーの.
「ネット社会の歩き方」レッスンキット プレゼンテーション資料集 15. チャットで個人情報は 言わない プレゼンテーション資料 著作権は独立行政法人情報処理推進機構( IPA )及び経済産業省に帰属します。
図書館の使い方 webペー ジ企画 グループ:いよかん メンバー: c07133 c クライアント情報  情報大学図書館の使い方  学生や学外からの来館者向け.
主要穀物の生産について 1班 07A059 下久保 三奈 07A060 新家 智恵梨 07A061 末田 麻彩 07A095 野澤 彩
「英語」を対象とした データ化 英語の文章を電子化する 山内、宿久、田口, 波多野、伊藤、 金、菅野、星、北尾.
日本語 IB: 口頭発表 平成 20 年度 後学期 担当 : 大島 義和 第 3 回 (10 月 16 日 )
実験5 規則波 C0XXXX 石黒 ○○ C0XXXX 杉浦 ○○ C0XXXX 大杉 ○○ C0XXXX 高柳 ○○ C0XXXX 岡田 ○○ C0XXXX 藤江 ○○ C0XXXX 尾形 ○○ C0XXXX 足立 ○○
ことばとコンピュータ 2007 年度 1 学期 第 1 回. 2 ことばとコンピュータ 授業科目名:言語情報処理論 授業題目名:ことばとコンピュータ 履修コード: 5067 教室: 323 一学期開講 授業の進め方 – 基本的に講義中心ですすめ,時々コンピュー タを使う.
図書館における 個人対応検索システム                03k1001 赤塚 拓巳.
LANG3910 Japanese Ⅲ Lesson 14 依頼・現在進行形. 学習項目 1. 「て -form 」 2. 依頼表現 An expression of request 3. 相手の意向を尋ねる Ask someone’s mind 4. 現在進行形 Actions in Progress.
Exercise IV-A p.164. What did they say? 何と言ってましたか。 1.I’m busy this month. 2.I’m busy next month, too. 3.I’m going shopping tomorrow. 4.I live in Kyoto.
JPN 312 (Fall 2007): Conversation and Composition 面接 ( めんせつ )
Self-efficacy(自己効力感)について
日本語 IB: 口頭発表 平成 19 年度 後学期 担当 : 大島 義和 第 3 回 (10 月 18 日 )
本文. 考えながら読みましょ う 「いろいろなこと」( 3 行目)は何で すか 「①電話料金はコンビニで支払いをしていま す。いつでも払えますから、便利です。」 「②夕食はコンビニで買います。お弁当やお かずがいろいろありますから。」今、若者に 人気のあるコンビニは、いろいろなことをす るのに非常に便利な場所になった。
Tatsuhiko Matsushita (University of Tokyo) 2013 Victoria University of Wellington 1.
National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.
B12433 Midori Maezawa 1. 2  GGI (= Gender Gap Index ) ジェンダー・ギャップ指数 世界経済フォーラムが、各国内の男女間の格差を 数値化しランク付けしたもの。経済分野、教育分野、 政治分野及び保険分野のデータから算出される。0 が完全平等、1が不完全平等を意味する。
The Climate as the Major Determinant Shaping Japanese National Character : True or False? B11567 Saki Yokomuro.
B 04 How to Type in Japanese How do you TYPE in Japanese?
2015/11/19. To foster Historical Thinking Skill by Creating Story Necessary Relationships and Elements of Characters In historical learning, historical.
英語勉強会(坂田英語) B4 詫間 風人. A Corrected English Composition Sharing System Classification Display and Interface for Searching A corrected English composition.
英語勉強会 10/13 住谷 English /21 三木 裕太. 原文 The purpose of this study is Development of system for Automated Generation of Deformed Maps. My study become.
英語勉強会 名手⇒詫間 2015/10/22. 原文 This study says acquiring motor skills support system. There is how to acquire moor skills that coach advises learner. Motor.
英語勉強会 (橋本さんの) 10月9日 坂田梨紗. 英語の文章の 成り立ち 言いたいこと 説明 言いたいこと I went to the library to read Harry Potter.

Presentation transcript:

Tatsuhiko Matsushita LALS, Victoria University of Wellington

Main findings VDRJ is useful for designing curriculum (material, tests etc.) AWLAD The more domains a words is shared as AW or LAD by, the more abstract the meaning of the word is. general words LW Conversation and non-academic texts contain more general words and LW AWLAD Academic texts: more AW and LAD but less LW in any academic domain proper nouns low frequency words Wikipedia: more proper nouns and low frequency words AW LAD. Newspapers and academic items of Wikipedia can be a good resource for learning AW and LAD. Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts LWAW LAD Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AW and LAD originate in Chinese LAD LAD contains more Western origin words (Gairaigo)

Contents 1. Motive for this research 2. Goals of this presentation 3. Vocabulary Database for Reading Japanese 4. Tiers of Japanese vocabulary (Basic words, academic words, limited-academic domain words, literary words) 5. Text coverage by word tier 6. Proportions of word origin types by word tiers 7. Number of characters required to cover the word tiers 7. Implications from the findings 8. Conclusion

1. Motive for this research How efficiently can we learn vocabulary? Learning burden is big! More effective choice of target words More efficient order for learning the words  Effective choice and efficient order: to maximize the coverage of text which the learner would encounter in his/her domain = Reading comprehension and lexical density (Hu & Nation, 2000; Komori et al., 2004)  Q. What words should learners learn first? And second and next?

Studies on EAP vocabulary Basic: General Service List (West, 1953) Academic: AWL (Coxhead, 2000) UWL (Xue & Nation, 1984) EGAP-A/S, EGAP-HM/SS etc. (Tajino, Dalsky, & Sasao, 2009) Science-specific Word List (Coxhead & Hirsh, 2007) Technical: e.g. Chung (2003) Literary vocabulary?

Studies on JAP vocabulary Basic: The former JLPT list, Tamamura (1987) etc. Academic: Butler (2010), Matsushita (2011) ? Technical: Komiya (1995), Oka (1992) etc. Others No list for words between academic and technical words Literary vocabulary?

2. Goals of this presentation To introduce I. the Vocabulary Database for Reading Japanese II. extracted domain-specific words such as Academic Words (AW), Limited-Academic-Domain Words (LAD), Literary Words (LW) To argue about IV. how the word tiers work in different types of text (register variation) V. how learner’s language background possibly affects the understanding of texts in different genres

3. Vocabulary Database for Reading Japanese VDRJ Vocabulary Database for Reading Japanese ( VDRJ ) ( Matsushita, 2010; 2011 ) Created from the Balanced Contemporary Corpus of Written Japanese, 2009 monitor version (NINJAL, 2009) 33 million token (28 million from books and 5 million from the Internet forum sites (Yahoo Chiebukuro)) 19 million content words and 14 million function words Unit of counting: Lexeme – considerably inclusive but less inclusive than the word family (Level 6 in Bauer & Nation, 1993) in English “Short unit of lexemes” are ranked by U (usage coefficient) (Juilland & Chang-Rodrigues, 1964) Short unit of lexeme: more inclusive than “lemma”, less inclusive than “word family”

Some problems of existing Japanese word frequency lists Lack of representativeness Too old The corpus size is not large enough: low reliability for low frequency words No good sub frequency data which enable us to calculate dispersion to downgrade unevenly distributed words

Advantages of word lists * Various types of word lists can be created from the vocabulary database (VDRJ) A) Reference for developing vocabulary tests = Checking learners’ vocabulary levels B) Reference for checking vocabulary level of material = Checking vocabulary levels of materials C)  Specify vocabulary for learners to learn and for teachers to teach For better choice of material, modification of text Cf. Nation (2011), Word profiler

How to make VDRJ A) Method I. Classify all the texts into some sub corpora to see the range and dispersion cf. Nippon Decimal Classification, BCCWJ (NINJAL, 2009 ) II. Parse (made word segmentation of ) all the texts by a morphological analyzer with a dictionary (if the text is not segmented by space between words.) cf. MeCab, UniDic III. Make word lists by AntConc and/or AntWordProfiler

Content and construct of VDRJ Vocabulary Database for Reading Japanese The list is for reading as it is made from written corpus of books and internet forum sites Written and spoken languages are different in word frequency, domain and required language processing skills ⇒ A good corpus of spoken language is necessary to develop a good word list for it(, but there is no very good corpus of spoken Japanese…)

Content of the sub corpora

Different word rankings The word ranking problem mainly exists in Basic Words This is mainly due to lack of good spoken corpora Compromise: frequency weighted to limited domains which seem to reflect basic daily needs For International Students For General Learners Non-weighted (ranking for overall written Japanese)

Multidimensional scaling (MDS) 10 domains + word familiarity

4. Tiers of Japanese vocabulary (1) The concept of “word tiers” Domain / Level Level = general importance = frequency × dispersion Some words are frequent only in a particular domain e.g. 発送 (shipping) 振り込み (paying by bank transfer) 古墳 (tumulus / burial mound)

Assumed word tiers for students Level Basic: Top 1288 = Former JLPT Level 4 &3 Intermediate: Ranked Advanced 1: 6K-10K Advanced 2: 11K-15K Super-Advanced: 15K-20K 21K+ Assumed Known Words (AKW) Domain *General / Academic / Literary

4. Tiers of Japanese vocabulary (2) Basic words (BW) Feature of the corpus: formal written language similar to BNC (Nation, 2004) No good spoken corpus for vocabulary studies Compromise For learners and teachers lists, the former JLPT Level 4 $ 3 vocabulary is put at the top of the list as basic words To order the basic words Identify closer domains to word familiarity (basic needs) by Multidimensional Scaling (MDS) Frequency in literary works and the Internet-forum sites (Yahoo-Chiebukuro) is weighted

4. Tiers of Japanese vocabulary (3) Academic domain words Extracting academic domain words Log-likelihood ratio (LLR)(Dunning, 1993) Target texts: Technical texts Classified into four large academic domains Total number of tokens: approx. 2.9 million Reference texts: General texts in BCCWJ 2009 Total number of tokens: approx million Extract keywords shared by domains Cut off point: higher for more narrowly distributed words

4. (3) Academic domain words Academic words Academic words (AW): high specificity in 3+ academic domains 4-domain words 4-domain words (cut off point: LLR > 0) 3-domain words 3-domain words (cut off point: LLR > 0) Limited-academic-domain words Limited-academic-domain words (LAD) 2-domain words 2-domain words (cut off point: LLR > 1) 1-domain words 1-domain words (cut off point: LLR > average value) Eliminate the former JLPT Level 4 vocabulary (Top 700 words) Eliminate the words ranked at or lower Classify all the AW and LAD by word ranking levels for International Students (U=Usage Coefficient): 5 levels: Basic / Inter. / Adv. 1 / Adv. 2 / Super-adv.

4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) JAWL = Japanese Academic Word List High specificity in 3 or 4 academic domains 4-domain words 4-domain words (cut off point: LLR > 0) 3-domain words 3-domain words (cut off point: LLR > 0) 9 levels2590 words in total Level 0 - VIII 9 levels , 2590 words in total JAWL I JAWL I (Intermediate): most essential for learning Basic words contains much fewer academic words JAWL I: 559 words JAWL I: 559 words Close to AWL in number and text coverage Coverage in the academic corpus used for extracting AW AWL: 10.0 % JAWL I: 11.1 %

Distribution and examples of JAWL

Semantic features of AW (1) 4. (3) -1 Academic words (AW) Semantic features of AW (1) Highly abstract, essential for operating logic i.e. Range: 占める (occupy, account for), 特殊 (special, particular) Relation: 属する (belong to), 依存 (rely/reliance) Comparison/Evaluation: 後者 (the latter), 優れる (superior), Quantitative change: 減少 (decrease), 強化 (reinforce) Stage: 当初 (beginning), 現状 (present condition) Development of enunciation: 取り上げる (take up [an issue]), まとめる (summarize) Cause-effect, degree, agent, action, object, direction, goal, instrument, time etc.

The most frequent Kanji used for AW 合 (combine, together), 定 (fix, certain), 分 (divide, minute), 一 (one), 同 (same), 数 (number), 上 (up), 体 (body), 出 (out), 大 (large) 3-domain words: Some words have concrete meanings e.g. 署名 (signature), 保健 (health, hygiene) 4-domain words: Few words have concrete meanings The nature of the words are the same at all levels Semantic features of AW (2) 4. Tiers of Japanese vocabulary (3) -1 Academic words (AW) Semantic features of AW (2)

POS of Japanese AW (1) Common noun: 1072 words (41.4 %) e.g. 背景 (background) Verbal noun: 882 words (34.0 %) e.g. 連続 (establish/-ment)  Adding other types of nouns together, 81.2 %noun 2104 words ( 81.2 % ) can be a noun Verb (excluding verbal nouns): 225 words (8.7 %) e.g. 認める (recognize/approve) 述べる (describe/mention)  Adding other types of verbs together, 42.7%verb 1107 words ( 42.7% ) can be a verb Adjectival noun: 95 words (3.7 %) e.g. 詳細 (detail/-ed), 平等 (equal/-ity) Adjective9 words Adjective : Only 9 words (0.3 %) e.g. 著しい (remarkable)

POS of Japanese AW (2) Affix: 106 words (4.1 %) e.g. - 期 (period), - 種 (type) substantial in Japanese academic words Adverb: 34 words (1.3 %) e.g. しばしば (frequently) Other (particle, auxiliary verb etc.): 22 words (0.8 %) Remarkably many archaic words Remarkably many archaic words e.g. のみ (only), つつ (while doing), べし (ought to), あらゆる (every) いかなる (any), 我が (my), 漠然 (vague) PassivePotentialSpontaneous れる / られる (Passive/Potential/Spontaneous) specific in academic texts specific in academic texts

4. (3) -2 Limited-academic-domain words (LAD) Limited-academic-domain words Limited-academic-domain words (LAD) High specificity in 2 or 1 domain(s) 2-domain words 2-domain words (cut off point: LLR > 1) 1-domain words 1-domain words (cut off point: LLR > average value) Something between “academic” and “technical” The “scams” from extracting AW? Tiers of curriculum cf. Tajino et al. (2007) Words correspondent to the curriculum Basic: all the learners Academic words: prep. to first year Limited-academic-domain words (?): prep. to major Technical words: major to postgrad.

4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

4. (3) -2 Limited-academic-domain words (LAD) 2 domain words

Examples of 2 domain words: Words which are shared by only 2 main academic domains

4. (3) -2 Limited-academic-domain words (LAD) 2 domain words Semantic features More concrete and specific than academic words Ah & Ss: Social, overlap in history and ethnology Ss & Tn: Industrial Ss & Bn: Social security, medical and nursing service Tn & Bn: Scientific Ah & Tn, Ah & Bn: not clear

4. (3) -2 Limited-academic-domain words (LAD) 1 domain words It is merely a trial The corpus is not the best for academic purpose, especially for natural sciences Extracting something common across domains is much easier while extracting words by only one target corpus will require more complete target corpus Therefore, AW (4 domain words and 3 domain words) will be more reliable than LAD (2 domain words and 1 domain words)

4. (3) -2 Limited-academic-domain words (LAD) 1 domain words

4. (3) -2 Limited-academic-domain words (LAD) 1 domain words Semantic features are much clearer than 2 domain words

4. (3) -2 Limited-academic-domain words (LAD) 1 domain words Semantic features are much clearer than 2 domain words

POS of Japanese LAD (1) Common noun: 1605 words (63.1 %) – more than AW (41.4%) Verbal noun : 633 words (24.9 %) e.g. 融資 (finance) cf. AW (34.0%)  Adding other types of nouns together, 87.9 %noun 2104 words ( 87.9 % ) can be a noun – more than AW (81.2%) Verb (excl. verbal nouns): 81 words (3.2 %) cf. AW (8.7%) e.g. 訳す (translate) 向き合う (face (v.))  Adding other types of verbs together, 28.1%verb 714 words ( 28.1% ) can be a verb – less than AW (42.7%) Adjectival noun: 88 words (3.5 %) cf. AW (3.7%) e.g. フル (full), 偉大 (great) Adjective3 words Adjective : Only 3 words (0.1 %) cf. AW (0.3%) e.g. 硬い (stiff)

POS of Japanese LAD (2) Affix: 109 words (4.3%) cf. AW (4.1%) e.g. – 犯 (offense) substantial in Japanese academic domain words Adverb: 15 words (0.6 %) cf. AW (1.3%) e.g. 現に (surely) Other (particle, auxiliary verb etc.): 9 words (0.8 %) cf. AW (0.8%) Remarkably many archaic words Remarkably many archaic words – similar to AW e.g. なり [affirmative aux.], とも (even though), たり [affirmative aux.], ごとし (as/like), 単なる (mere), しめる(=しむ) [causative aux.], かかる (such)

4. Tiers of Japanese vocabulary (4) Literary words (LW) Extracting literary words: Extracting literary words: Words for reading literary works Log-likelihood ratio (Keyness in AntConc) Target corpus: literary works (identified by NDC and C-code) in BCCWJ 2009 (NINJAL, 2009) – Over 8 million tokens 4 different reference corpus: Technical texts, general texts in arts and humanities, general texts in the other 3 academic domains, Internet forum texts (Yahoo Chiebukuro) Extract keywords shared by the four results (Cutoff point: average value) Eliminate the former JLPT Level 4 vocabulary (Top 700 words) Eliminate the words ranked at or lower Classify all the LW by word ranking levels for International Students (U=Usage Coefficient)

4. (4) Literary words (LW) Distribution and examples

4. (4) Literary words (LW) POS of LW More verbs, adverbs and interjections than AW and LAD Less verbal nouns and adjectival nouns This inevitably means LW have less loan words but more Japanese-origin words.

4. (4) Literary words (LW) Q. How many LW overlap with AW and LAD? Only 27 words (0.5% of academic domain words, 1.7% of LW) are overlapping Most of the overlapping words (24/27) overlap with 1 domain words (17 words overlap with words in biological natural science) Many physical words such as words for body parts e.g. 左手 (left hand), こぶし (fist), 血 (blood), 頭上 (overhead) No LW words overlap with 4 domain words Overlapping words are mainly at the intermediate level No overlapping words in or above 11K+ Some examples of overlapping words: 音 (sound), 光 (light), 棚 (shelf), 組 (class), 岩 (rock), ひざ (knee), 興奮 (excite/-ment), 全身 (whole body), 帝 (emperer), ネズミ (mouse), 帆 (sail)

Word tiers: In what order should students learn them? Basic General AW/LAD LW Intermediate General AW/LAD LW Advanced General AW/LAD LW Highly Advanced General AW/LAD LW Super-Advanced General AW/LAD LW Assumed known words Proper names Fillers, Signs (Transparent compounds *) Others

5. Text coverage by word tier The word tier analyser: An Excel sheet where word profiling of a text can be checked automatically by cutting and pasting the result of AntWordProfiler with the word tier base word list. The word tier analyser Text covering efficiency High efficiency in vocabulary learning = Fewer unique lexemes cover more texts (Reciprocal Type/Token Ratio = Token/Type Ratio?) *Comparison should be made between equally-sized texts)

Coverage of Japanese texts by word tier

Findings from the text coverage general wordsLW Conversation and Non-academic texts: more general words and LW proper nouns low frequency words Wikipedia: more proper nouns and low frequency words AW AW Academic items of Wikipedia: 15-20% of the texts of are estimated to be covered by JAWL 1 (559 types) – encyclopaedic nature of AW?  can be a good resource for learning AW LAD AW Newspapers: similar to academic texts, but contains more LAD and AW at the advanced level AW  can be a good resource for learning AW (esp. in social sci.) AWLADLW Academic texts: more AW and LAD but less LW in any academic domain Academic texts in natural sciences: more academic domain words at lower frequency levels (technical vocabulary) than Ah. and Ss. texts – similar to Coxhead, Stevens, & Tinkle (2010)

6. Proportion of word origin types by word tier Proportion of Unique Lexemes by Word Origin and Word Tier in 01K-20K (*) (Matsushita, 2011)

Findings from the proportion of word origin types by word tier LW LW: Japanese origin words are significantly dominant AWLAD AW and LAD: Chinese origin words are significantly dominant LAD LAD: more Western origin words (Gairaigo)  Western origin words tend to appear more at lower frequency levels in academic domain words Origins of academic and literary words are considerably clearly separated: Academic – Chinese origin Literary – Japanese origin

7. Implications from the findings Q. Word Tiers: In what order should students learn them? Basic Academic LAD General Intermediate Academic LAD General Advanced Academic LAD General Highly Advanced Academic LAD General Super-Advanced Academic LAD General Assumed known words Proper names Fillers Signs (Transparent compounds *) Others

Implications for teaching and research A vocabulary conscious curriculum should be designed and incorporated in Japanese programs depending on the learners’ needs and language backgrounds The gap between Chinese-background learners (CBLs) and non-CBLs will be less in basic conversation and reading literary works than in reading academic texts Good curriculum for learning academic domain words is particularly desired for non-CBLs of academic Japanese Autonomous mode for learning vocabulary will be necessary particularly when the learners’ needs and language backgrounds are various

8. Conclusion Limitations of the word lists Less valid in narrower domain words (2D/1D words) and less reliable in higher frequency levels  Need refining by more complete academic corpus Multi-word units not extracted Not sensitive to different usages in different domains (polysemy) Remaining issues Many transparent compounds in Japanese  What is Kanji tier? How is it related to word tier?

Download sites for VDRJ/JAWL Matsushita Laboratory for Language Learning atsu.html (Interface: English) Google it with “matsushita” and “language” 松下言語学習ラボ (Interface: Japanese) Google it with “ 松下 ” and “ 言語 ”

Main findings VDRJ is useful for designing curriculum (material, tests etc.) AWLAD The more domains a words is shared as AW or LAD by, the more abstract the meaning of the word is. general words LW Conversation and non-academic texts contain more general words and LW AWLAD Academic texts: more AW and LAD but less LW in any academic domain proper nouns low frequency words Wikipedia: more proper nouns and low frequency words AW LAD. Newspapers and academic items of Wikipedia can be a good resource for learning AW and LAD. Natural science texts contain more academic domain words at lower frequency levels than arts and social science texts LWAW LAD Origins of academic and literary words are considerably clearly separated; 3/4 of LW originate in Japanese while 3/4 of AW and LAD originate in Chinese LAD LAD contains more Western origin words (Gairaigo)

References (1) Anthony, L. (2007). AntConc Version (text analysis tool) (Version 1.0 first published in 2002) Anthony, L. (2009). AntWordProfiler Version 1.2 w (word profiler) (Version 1.0 first published in 2008) Beck, I. L., McKeown, M. G., & Kucan, L. (2002). Bringing Words to Life: Robust Vocabulary Instruction. Solving problems in the teaching of literacy. New York: Guilford Press. Butler, Y. G. (バトラー後藤裕子). (2010). 小中学生のため の日本語学習語リスト(試案) (A list of Japanese academic vocabulary for elementary and junior high school students in Japan). 母語・継承語・バイリンガル教育 (MHB) 研究 (Studies in Mother Tongue, Heritage Language, and Bilingual Education), 6,

References (2) Chung, T. M. (2003). Identifying technical terms. Unpublished PhD dissertation, Victoria University of Wellington. Corson, D. J. (1995). Using English Words. Dordrecht: Kluwer Academic Publishers. Corson, D. J. (1997). The learning and use of academic English words. Language Learning, 47(4), Coxhead, A. (2000). A new academic word list. TESOL Quarterly, 34(2), Coxhead, A., & Hirsh, D. (2007). A pilot science-specific word list. Revue Francaise de Linguistique Appliquee, 12(2), Coxhead, A., Stevens, L., & Tinkle, J. (2010). Why might secondary science textbooks be difficult to read? New Zealand Studies in Applied Linguistics, 16(2),

References (3) Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61– 74. Eldridge, J. (2008). No, there isn’t an academic vocabulary but... TESOL Quarterly, Hyland, K., & Tse, P. (2007). Is there an “Academic Vocabulary”? TESOL Quarterly, 41(2), Hu, M. H. & Nation, P. (2000). Vocabulary density and reading comprehension. Reading in a Foreign Language, 13(1), Juilland, A., & Chang-Rodrigues, E. (1964). Frequency Dictionary of Spanish Words. London: Mouton & Co.

References (4) Komiya, C. (小宮千鶴子). (1995). 専門日本語教育の専門 語 -経済の基本的な専門語の特定を目指して- [Technical terms for teaching technical Japanese: Aiming at identifying basic technical terms for economics]. 日本語教育 [Teaching Japanese as a Foreign Language], 86, Komori, K. (小森和子), Mikuni, J. (三國純子), & Kondo, A. (近藤安月子). (2004). 文章理解を促進する語彙知識 の量的側面 ― 既知語率の閾値探索の試み ― (What percentage of known words in a text facilitates reading comprehension: a case study for exploration of the threshold of known words coverage). 日本語教育 [Teaching Japanese as a Foreign Language], 125,

References (5) Matsushita, T. (松下達彦). (2010) What words are essential to read Japanese? Making word lists from a large corpus of books and internet forum sites [ 日本語を読むために必要な 語彙とは? -書籍とインターネットの大規模コーパスに 基づく語彙リストの作成- ]. Proceedings for the Conference of the Society for Teaching Japanese as a Foreign Language, Spring 2010 [2010 年度日本語教育学会春季大会予稿集 ], Matsushita, T. (松下達彦). (2011). 日本語を読むための語 彙データベース (The Database for Reading Japanese). Downloaded from 22 May 2011http:// Nation, I. S. P. (2004). A study of the most frequent word families in the British National Corpus. P. Bogaards & B. Laufer (Eds.), Vocabulary in a Second Language: Selection, Acquisition, and Testing (p 3-13). Amsterdam: John Benjamins.

References (6) Nation, I. S. P. (2011). Making and using word lists. I. S. P. Nation & Stuart Webb (Eds.), Researching and analysing vocabulary. Boston: Heinle Cengage Learning. Oka, M. (岡 益巳). (1992). 非漢字圏の留学生のための日 本経済基本用語表 [Basic terms of the Japanese economy for non-Kanji background students]. 岡山大学経済学会雑誌 (Okayama Economic Review), 23(4),

References (7) Tajino, A., Terauchi, H., Sasao, Y., & Maswana, S. ( 田地野 彰・ 寺内 一・笹尾洋介・マスワナ紗矢子 ). (2007). 総合研究大 学における英語学術語彙リスト開発の意義 - EAP カリ キュラム開発の観点から- (The development of academic words lists at a multi-disciplinary university in Japan: A fundamental step in EAP curriculum design). 京都大学高等教 育研究 (Kyoto University Researches in Higher Education), 13. Tajino, A., Dalsky, D., & Sasao, Y. (2009). Academic vocabulary reconsidered: An EAP curriculum-design perspective. Journal of Teaching English as a Foreign Language and Literature, 1(4), 3-21.

References (8) Tamamura, F. (玉村文郎). (1987). 日本語教育基本 2570 語 [Basic 2570 words for teaching Japanese as a second language]. 日 本語の語彙・意味 (2) [Japanese Vocabulary and Meaning], NAFL Institute 日本語教師養成通信講座 [Training Course of Teachers of Japanese as a Second Language]. アルク (Alc). Townsend, D., & Collins, P. (2008). Academic vocabulary and middle school English learners: an intervention study. Reading and Writing, 22(9), doi: /s y Ward, J. (1999). How large a vocabulary do EAP Engineering students need? Reading in a Foreign Language, 12(2), West, M. (1953). A General Service List of English Words. London: Longman, Green & Co.