Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute.

Slides:

Advertisements

Similar presentations

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Advertisements

LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.

The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Sub-Project I Prosody, Tones and Text-To-Speech Synthesis Sin-Horng Chen (PI), Chiu-yu Tseng (Co-PI), Yih-Ru Wang (Co-PI), Yuan-Fu Liao (Co-PI), Lin-shan.

Syntax. Definition: a set of rules that govern how words are combined to form longer strings of meaning meaning like sentences.

Dependency Parsing Some slides are based on:

® Towards Using Structural Events To Assess Non-Native Speech Lei Chen, Joel Tetreault, Xiaoming Xi Educational Testing Service (ETS) The 5th Workshop.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Results ISI Variance in STP Corpus ISI Variance in BU Corpus * p

Text to Speech for In-car Navigation Systems Luisa Cordano August 8, 2006.

Probabilistic Detection of Context-Sensitive Spelling Errors Johnny Bigert Royal Institute of Technology, Sweden

Introduction to Linguistics n About how many words does the average 17 year old know?

Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.

Stemming, tagging and chunking Text analysis short of parsing.

Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.

Stockholm 6. Feb -04Robust Methods for Automatic Transcription and Alignment of Speech Signals1 Course presentation: Speech Recognition Leif Grönqvist.

1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.

Extracting Interest Tags from Twitter User Biographies Ying Ding, Jing Jiang School of Information Systems Singapore Management University AIRS 2014, Kuching,

Japanese Dependency Structure Analysis Based on Maximum Entropy Models Kiyotaka Uchimoto † Satoshi Sekine ‡ Hitoshi Isahara † † Kansai Advanced Research.

Czech-to-English Translation: MT Marathon 2009 Session Preview Jonathan Clark Greg Hanneman Language Technologies Institute Carnegie Mellon University.

CHAPTER 1: Language in Our Lives

1 NLP in Thailand by Asanee Kawtrakul Kasetsart University.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Results: Prominence prediction without lexical information Each type of feature reduces the error rate over the baseline. SRF and INF features appear to.

A Graph-based Approach to Named Entity Categorization in Wikipedia Using Conditional Random Fields Yotaro Watanabe, Masayuki Asahara and Yuji Matsumoto.

Phonetics and Phonology

Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning

Morphology & Syntax Dr. Eid Alhaisoni. Basic Definitions Language : a system of communication by written or spoken words, which is used by people of a.

Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.

National Institute of Informatics Kiyoko Uchiyama 1 A Study for Introductory Terms in Logical Structure of Scientific Papers.

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

1 Statistical NLP: Lecture 9 Word Sense Disambiguation.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

SYNTAX Lecture -1 SMRITI SINGH.

ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong

CSA2050 Introduction to Computational Linguistics Lecture 3 Examples.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

Japanese Dependency Analysis using Cascaded Chunking Taku Kudo 工藤拓 Yuji Matsumoto 松本裕治 Nara Institute Science and Technology, JAPAN.

A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart

1 Multi-Perspective Question Answering Using the OpQA Corpus (HLT/EMNLP 2005) Veselin Stoyanov Claire Cardie Janyce Wiebe Cornell University University.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

For Friday Finish chapter 24 No written homework.

Multilingual Opinion Holder Identification Using Author and Authority Viewpoints Yohei Seki, Noriko Kando,Masaki Aono Toyohashi University of Technology.

Natural Language Processing Chapter 1 : Introduction.

Page 1 NAACL-HLT 2010 Los Angeles, CA Training Paradigms for Correcting Errors in Grammar and Usage Alla Rozovskaya and Dan Roth University of Illinois.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Communicative and Academic English for the EFL Professional.

Reading in a Second Language Ch. 11 & 13 Patrick Sitima Keisuke Murahata.

Detecting Accent Sandhi in Japanese Using a Superpositional F0 Model Atsuhiro Sakurai Hiromichi Kawanami Keikichi Hirose Depart. of Communication and Information.

Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Evaluating NLP Features for Automatic Prediction of Language Impairment Using Child Speech Transcripts Khairun-nisa Hassanali 1, Yang Liu 1 and Thamar.

Welcome to All S. Course Code: EL 120 Course Name English Phonetics and Linguistics Lecture 1 Introducing the Course (p.2-8) Unit 1: Introducing Phonetics.

Chapter 5 The Oral Approach.

Identifying Expressions of Opinion in Context Eric Breck and Yejin Choi and Claire Cardie IJCAI 2007.

Language Identification and Part-of-Speech Tagging

Statistical NLP: Lecture 9

Língua Inglesa - Aspectos Morfossintáticos

Dependency Model Using Posterior Context

Artificial Intelligence 2004 Speech & Natural Language Processing

Extracting Why Text Segment from Web Based on Grammar-gram

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

Word-level Dependency-structure Annotation to Corpus of Spontaneous Japanese and Its Application Kiyotaka Uchimoto* Yasuharu Den † *National Institute of Information and Communications Technology (NICT) † Chiba University

Outline  Background  Dependency Structure in the CSJ  Dependency-structure Annotation  Word-level Dependency-structure Analysis  Towards Construction of Middle Words  Summary and future work

Background (1)  Corpus of Spontaneous Japanese (CSJ) [Maekawa et al., 2000] l The largest spontaneous-speech corpus in the world l Include transcriptions of speeches as well as audio recordings l One tenth of the CSJ has been manually annotated with Morphemes, sentence boundaries, syntactic structures, discourse structures, prosodic information, etc

Background (2)  Syntactic structure of a sentence l Represented by dependency relationships between bunsetus l As represented in the Kyoto University text corpus  Syntactic structure of a bunsetsu is not considered nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

Dependency Structure in the CSJ (1)  Dependency relationships between bunsetsus l Annotated within “sentences” in the CSJ  Dependency relationships between words l Annotated within bunsetsus l Word segments in the word-level dependency structure: short words Short word approximates a term found in an ordinary dictionary Long word represents various compounds nihon gata kokusai kouken ga (Japanese style international contribution) motome rare te iru (is required)

Dependency Structure in the CSJ (2)  Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between bunsetsus, and label D is assigned to them Yamada (Yamada) Yamada san wa (Mr. Yamada) kyoujin na (strong) nikutai no (body) mochinushi da to (possessor) it te mashi ta ne (said) D (Yamada, Mr. Yamada said that he had a strong body.)

Dependency Structure in the CSJ (3)  Disfluencies characteristic to spontaneous speech l Self-correction Represented as dependency between words, and label D is assigned to them kokuritsu (national) Nihon (Japanese) go (word) kokugo (Japanese language) kenkyuu (research) jo (institure) de case marker D (At National Japanese word, Japanese language research institute)

Dependency-structure Annotation  Manual annotation l 199 speeches for dependency relationships between bunsetsus l 50 speeches for dependency relationships between words  Human annotation by using a tool l Initial: every bunsetsu depends on the next l Step 1: two annotators examined each dependency and modified it if it was inappropriate l Step 2: a checker examined all dependencies Referred to audio recordings as well as transcriptions

Each line represents a bunsetsu Modified by mouse drag- and-drop Self-corrections, coordination, and appositives can be annotated with labels D, P, and A by right-clicking the mouse

Each line represents a word Modified by mouse drag- and-drop

Word-level Dependency-structure Analysis (1)  Finding a modifiee for each word in a bunsetsu l Each dependency goes from left to right l The rightmost word is assumed to have no modifiee  Existing methods were applied l Ex. shift-reduce method [Nivre and Scholz, 2004] nihon/noun gata/Suffix kokusai/noun kouken/noun ga/ppp gatanihon … kokusaikoukenga Input words stack

Word-level Dependency-structure Analysis (2)  Experiments l 50 speeches in the CSJ Word-level dependencies (total: 33,429) –Every rightmost dependency in a bunsetsu was not counted l 10-fold cross validation l Features: words and their POS categories MethodDependency accuracy Baseline Shift-reduce (Nivre & Scholz, 2004) MST parser (McDonald et al., 2005) CaboCha (Kudo and Matsumoto, 2000) 98.6% 99.1%

Application of Word-level Dependency-structure  In text-to-speech synthesis l Basic unit is required to indicate appropriate pronunciation and accent Long word dandanba^takegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) “rendaku” (Weijer et al., 2005)

Long word dandanba^takegairaigokanahyoukimanyogana Middle word dandanbatakegairaigokanahyoukimanyogana Short word da^ndan (layered) hatake (fields) gairai (foreign) go (word) kana (kana) hyouki (orthography) manyo (myriad) kana (kana) Application of Word-level Dependency-structure  A sound change or an accent change are blocked by right branched tree structures (Kubozono, 1995)

Construction of Middle Words  Construction rule l Combining adjacent short words that have dependency relationships under the condition that a middle word is not longer than a long word  Morphological information l If a middle word corresponds to a long word Extracted from the long word. l Otherwise Extracted from the rightmost short word in the middle word.  Example kihon ／ shuuha ／ suu ／ pataan Noun Noun Suffix Noun (basic frequency pattern) kihon | shuuha suu pataan Noun

Middle Words and Accent Phrases  Relationships between middle words and accent phrases (BI=2, 2+p, 2+b, 2+bp, 3) in the CSJ Long words (LW) (97,167) No accent phrase boundary (APB) in LW Accent phrase boundary (APB) in LW 94,0383,129 LW = MWLW > MWAPB in MW No APB in MW MW boundary corresponds to LW boundary or APB MW boundary corresponds neither to LW boundary nor to APB 93, , ,07554 nihonjin/gakushuusha rittai/chuushajou kaku|zokusei gen|jiten zen|shikiichi emuten|chuuouchi/heikatsuka yuudo/saidaika|kijun should be reduced

Summary and Future Work  Dependency structure of a large, spontaneous, Japanese-speech corpus, Corpus of Spontaneous Japanese (CSJ)  Application of a word-level dependency-structure l Constructing new basic units, middle words l Middle words: useful as constituents of accent phrases  Annotation to the Balanced Corpus of Contemporary Written Japanese (BCCWJ) l Supported by the priority area program ‘Japanese Corpus’, a five-year ( ) project