Presentation is loading. Please wait.

Presentation is loading. Please wait.

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute.

Similar presentations


Presentation on theme: "Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute."— Presentation transcript:

1

2 Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute of Information and Communications Technology (NICT) † Chiba University

3 Outline  Background  Dependency Structure in the CSJ  Dependency-structure Annotation  Word-level Dependency-structure Analysis  Towards Construction of Middle Words  Summary and future work

4 Background (1)  Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] l The largest spontaneous-speech corpus in the world l Include transcriptions of speeches as well as audio recordings l One tenth of the CSJ has been manually annotated with Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc

5 Background (2)  Syntactic structure of a sentence l Represented by dependency relationships between bunsetus l As represented in the Kyoto University text corpus  Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

6 Dependency Structure in the CSJ (1)  Dependency relationships between bunsetsus l Annotated within “sentences” in the CSJ  Dependency relationships between words l Annotated within bunsetsus l Word segments in the word-level dependency structure: short words Short word approximates a term found in an ordinary dictionary Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

7 Dependency Structure in the CSJ (2)  Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between bunsetsus, and label D is assigned to them Yamada (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) D (Yamada, Mr. Yamada said that he had a strong body.)

8 Dependency Structure in the CSJ (3)  Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between words, and label D is assigned to them kokuritsu (national) Nihon (Japanese) go (word) kokugo (Japanese language) kenkyuu (research) jo (institure) de case marker D (At National Japanese word, Japanese language research institute)

9 Dependency-structure Annotation  Manual annotation l 199 speeches for dependency relationships between bunsetsus l 50 speeches for dependency relationships between words  Human annotation by using a tool l Initial: every bunsetsu depends on the next l Step 1: two annotators examined each dependency and modified it if it was inappropriate l Step 2: a checker examined all dependencies Referred to audio recordings as well as transcriptions

10 Each line represents a bunsetsu Modified by mouse drag- and-drop Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse

11 Each line represents a word Modified by mouse drag- and-drop

12 Word-level Dependency-structure Analysis (1)  Finding a modifiee for each word in a bunsetsu l Each dependency goes from left to right l The rightmost word is assumed to have no modifiee  Existing methods were applied l Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp gatanihon … kokusaikoukenga Input words stack

13 Word-level Dependency-structure Analysis (2)  Experiments l 50 speeches in the CSJ Word-level dependencies (total: 33,429) –Every rightmost dependency in a bunsetsu was not counted l 10-fold cross validation l Features: words and their POS categories MethodDependency accuracy Baseline Shift-reduce (Nivre & Scholz, 2004) MST parser (McDonald et al., 2005) CaboCha (Kudo and Matsumoto, 2000) 98.6% 99.1%

14 Application of Word-level Dependency-structure  In text-to-speech synthesis l Basic unit is required to indicate appropriate pronunciation and accent Long word dandanba^takegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) “rendaku” (Weijer et al., 2005)

15 Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Application of Word-level Dependency-structure  A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995)

16 Construction of Middle Words  Construction rule l Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word  Morphological information l If a middle word corresponds to a long word Extracted from the long word. l Otherwise Extracted from the rightmost short word in the middle word.  Example kihon / shuuha / suu / pataan Noun Noun Suffix Noun (basic frequency pattern) kihon | shuuha suu pataan Noun

17 Middle Words and Accent Phrases  Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ Long words (LW) (97,167) No accent phrase boundary (APB) in LW Accent phrase boundary (APB) in LW 94,0383,129 LW = MWLW > MWAPB in MW No APB in MW MW boundary corresponds to LW boundary or APB MW boundary corresponds neither to LW boundary nor to APB 93,7972412,9421873,07554 nihonjin/gakushuusha rittai/chuushajou kaku|zokusei gen|jiten zen|shikiichi emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun should be reduced

18 Summary and Future Work  Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ)  Application of a word-level dependency-structure l Constructing new basic units, middle words l Middle words: useful as constituents of accent phrases  Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) l Supported by the priority area program ‘Japanese Corpus’, a five-year (2006-2010) project


Download ppt "Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute."

Similar presentations


Ads by Google