Download presentation
Presentation is loading. Please wait.
Published byAdela Marshall Modified over 9 years ago
2
Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute of Information and Communications Technology (NICT) † Chiba University
3
Outline Background Dependency Structure in the CSJ Dependency-structure Annotation Word-level Dependency-structure Analysis Towards Construction of Middle Words Summary and future work
4
Background (1) Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] l The largest spontaneous-speech corpus in the world l Include transcriptions of speeches as well as audio recordings l One tenth of the CSJ has been manually annotated with Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc
5
Background (2) Syntactic structure of a sentence l Represented by dependency relationships between bunsetus l As represented in the Kyoto University text corpus Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)
6
Dependency Structure in the CSJ (1) Dependency relationships between bunsetsus l Annotated within “sentences” in the CSJ Dependency relationships between words l Annotated within bunsetsus l Word segments in the word-level dependency structure: short words Short word approximates a term found in an ordinary dictionary Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)
7
Dependency Structure in the CSJ (2) Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between bunsetsus, and label D is assigned to them Yamada (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) D (Yamada, Mr. Yamada said that he had a strong body.)
8
Dependency Structure in the CSJ (3) Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between words, and label D is assigned to them kokuritsu (national) Nihon (Japanese) go (word) kokugo (Japanese language) kenkyuu (research) jo (institure) de case marker D (At National Japanese word, Japanese language research institute)
9
Dependency-structure Annotation Manual annotation l 199 speeches for dependency relationships between bunsetsus l 50 speeches for dependency relationships between words Human annotation by using a tool l Initial: every bunsetsu depends on the next l Step 1: two annotators examined each dependency and modified it if it was inappropriate l Step 2: a checker examined all dependencies Referred to audio recordings as well as transcriptions
10
Each line represents a bunsetsu Modified by mouse drag- and-drop Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse
11
Each line represents a word Modified by mouse drag- and-drop
12
Word-level Dependency-structure Analysis (1) Finding a modifiee for each word in a bunsetsu l Each dependency goes from left to right l The rightmost word is assumed to have no modifiee Existing methods were applied l Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp gatanihon … kokusaikoukenga Input words stack
13
Word-level Dependency-structure Analysis (2) Experiments l 50 speeches in the CSJ Word-level dependencies (total: 33,429) –Every rightmost dependency in a bunsetsu was not counted l 10-fold cross validation l Features: words and their POS categories MethodDependency accuracy Baseline Shift-reduce (Nivre & Scholz, 2004) MST parser (McDonald et al., 2005) CaboCha (Kudo and Matsumoto, 2000) 98.6% 99.1%
14
Application of Word-level Dependency-structure In text-to-speech synthesis l Basic unit is required to indicate appropriate pronunciation and accent Long word dandanba^takegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) “rendaku” (Weijer et al., 2005)
15
Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Application of Word-level Dependency-structure A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995)
16
Construction of Middle Words Construction rule l Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word Morphological information l If a middle word corresponds to a long word Extracted from the long word. l Otherwise Extracted from the rightmost short word in the middle word. Example kihon / shuuha / suu / pataan Noun Noun Suffix Noun (basic frequency pattern) kihon | shuuha suu pataan Noun
17
Middle Words and Accent Phrases Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ Long words (LW) (97,167) No accent phrase boundary (APB) in LW Accent phrase boundary (APB) in LW 94,0383,129 LW = MWLW > MWAPB in MW No APB in MW MW boundary corresponds to LW boundary or APB MW boundary corresponds neither to LW boundary nor to APB 93,7972412,9421873,07554 nihonjin/gakushuusha rittai/chuushajou kaku|zokusei gen|jiten zen|shikiichi emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun should be reduced
18
Summary and Future Work Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ) Application of a word-level dependency-structure l Constructing new basic units, middle words l Middle words: useful as constituents of accent phrases Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) l Supported by the priority area program ‘Japanese Corpus’, a five-year (2006-2010) project
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.