With 6,500 languages in the world, we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

Extraction and Visualisation of Emotion from News Articles Eva Hanser, Paul Mc Kevitt School of Computing & Intelligent Systems Faculty of Computing &
Computational language: week 10 Lexical Knowledge Representation concluded Syntax-based computational language Sentence structure: syntax Context free.
CS Morphological Parsing CS Parsing Taking a surface input and analyzing its components and underlying structure Morphological parsing:
Morphology.
The Study Of Language Unit 7 Presentation By: Elham Niakan Zahra Ghana’at Pisheh.
Statistical NLP: Lecture 3
LANGUAGE RESEARCH IN SERVICE TO THE NATION Creating a dual-use pandialectal Pashto grammar AF-PAK LEARN Omaha May 17, 2010 Corey Miller
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
LING NLP 1 Introduction to Computational Linguistics Martha Palmer April 19, 2006.
Brief introduction to morphology
1 Words and the Lexicon September 10th 2009 Lecture #3.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
An interactive environment for creating and validating syntactic rules Panagiotis Bouros*, Aggeliki Fotopoulou, Nicholas Glaros Institute for Language.
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 2.
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
Artificial Intelligence 2004 Natural Language Processing - Syntax and Parsing - Language Syntax Parsing.
Creation of a Russian-English Translation Program Karen Shiells.
1 Basic Parsing with Context Free Grammars Chapter 13 September/October 2012 Lecture 6.
March 1, 2009 Dr. Muhammed Al-Mulhem 1 ICS 482 Natural Language Processing INTRODUCTION Muhammed Al-Mulhem March 1, 2009.
1 A Chart Parser for Analyzing Modern Standard Arabic Sentence Eman Othman Computer Science Dept., Institute of Statistical Studies and Research (ISSR),
Natural Language Processing DR. SADAF RAUF. Topic Morphology: Indian Language and European Language Maryam Zahid.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Paradigm based Morphological Analyzers Dr. Radhika Mamidi.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
ICS611 Introduction to Compilers Set 1. What is a Compiler? A compiler is software (a program) that translates a high-level programming language to machine.
Lecture 12: 22/6/1435 Natural language processing Lecturer/ Kawther Abas 363CS – Artificial Intelligence.
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Chapter 1 Introduction Dr. Frank Lee. 1.1 Why Study Compiler? To write more efficient code in a high-level language To provide solid foundation in parsing.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Phonemes A phoneme is the smallest phonetic unit in a language that is capable of conveying a distinction in meaning. These units are identified within.
Lecture 3, 7/27/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 3 27 July 2005.
CRESST ONR/NETC Meetings, July 2003, v1 ONR Advanced Distributed Learning Linguistic Modification of Test Items Jamal Abedi University of California,
Morphology An Introduction to the Structure of Words Lori Levin and Christian Monson Grammars and Lexicons Fall Term, 2004.
Morphological Analysis Lim Kay Yie Kong Moon Moon Rosaida bt ibrahim Nor hayati bt jamaludin.
Linguistics The ninth week. Chapter 3 Morphology  3.1 Introduction  3.2 Morphemes.
Linguistic Essentials
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
By: Jeremy Pagnotti.  Phonetic language (no silent letters)  No particular word order  Grammatical function of nouns and verbs displayed by endings.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
The Greek Verb System: A Bird’s Eye View Chapter 2.
Artificial Intelligence: Natural Language
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
1 ASSESSING LANGUAGE KNOWLEDGE: GRAMMAR & VOCABULARY Prepared by Maria Verbitskaya, Elena Solovova, Svetlana Sannikova Based on material by Carolyn Westbrook.
CS 4705 Lecture 7 Parsing with Context-Free Grammars.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Leonid Iomdin Institute for Information Transmission Problems, Russian Academy of Sciences
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
NATURAL LANGUAGE PROCESSING
Composing Music with Grammars. grammar the whole system and structure of a language or of languages in general, usually taken as consisting of syntax.
MORPHOLOGY. PART 1: INTRODUCTION Parts of speech 1. What is a part of speech?part of speech 1. Traditional grammar classifies words based on eight parts.
Inflection. Inflection refers to word formation that does not change category and does not create new lexemes, but rather changes the form of lexemes.
Grammar Log #2 Cornell Notes Out! The Verb & Verb Phrase / The verb is an action or linking word / The “Verb Phrase” (VP) indicates ALL of the words.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,
INTRODUCTION ADE SUDIRMAN, S.Pd ENGLISH DEPARTMENT MATHLA’UL ANWAR UNIVERSITY.
Ms. Rasha Ali Inflection.
Child Syntax and Morphology
Morphology Morphology Morphology Dr. Amal AlSaikhan Morphology.
Lecture – VIII Monojit Choudhury RS, CSE, IIT Kharagpur
عمادة التعلم الإلكتروني والتعليم عن بعد
Statistical NLP: Lecture 3
Revision Outcome 1, Unit 1 The Nature and Functions of Language
Basic Parsing with Context Free Grammars Chapter 13
Chapter 6 Morphology.
LING/C SC 581: Advanced Computational Linguistics
Natural Language - General
Linguistic Essentials
Artificial Intelligence 2004 Speech & Natural Language Processing
Introduction to Linguistics
Presentation transcript:

With 6,500 languages in the world, we must explore new ways to learn, document, and share our linguistic knowledge. John J. Kovarik NSA/CSS Senior Language Technology Authority

Unlocking and Sharing LTCL Linguistic Knowledge Keywords: CFG parsing, language generation, computational linguistics CALICO ’05 University of Michigan Ann Arbor, MI May 17-20, 2005

The Challenges of Learning and Sharing Knowledge of an LCTL in the 21 st Century John J. Kovarik National Security Agency

Presentation Overview  General LCTL Challenges  Challenges of Learning Mongolian  Recipe for New Approach  Khalka Mongolian Parts of Speech  Mongolian Morphological Affixes  Method of Lexical Knowledge Representation  Analyze, Parse, Build Grammar Model, Test  Iterate Repeatedly

LCTL Learning Challenges  Fewer Learned Resources to Learn from  Less Recognition Nationally  Less Opportunities to Document What’s Learned  Very Few Students to Learn from You  Almost All Learning Done Manually  Few Reliable 21 st Century Applications –Microsoft IME –Font

Mongolian Learning Challenges  Input Method Emulator (IME) –MicroSoft IME Keyboard arranged for native Mongols American Mongolists prefer phonetic keyboard –“a” key on Mongolian keyboard mapped to ASCII “a” etc.  Fonts commonly used on Internet –Russian Cyrillic fonts are commonly used “|” and “0” commonly substituted for “ү” and “ө” “у” and “о” often freely extended to “ү” and “ө”

Recipe for a New Approach  Take a student with a computational linguistics background  Infuse with curiosity and energy  Stir in access to the Internet  Add Mongolian syntax and morphology  Create morphological analyzer, context free parser, and grammatical generator for Mongolian  Resulting lexicons, software, and grammar models can be used by other linguistically adept students

Khalkha Mongolian Parts of Speech  Declinable Nouns  Declinable Adjectives  Inflected Verbs  Unchanging Adverbs  Declinable Converbs  Unchanging Postpositions  Unchanging Conjunctions  Unchanging Particles

Mongol Morphological Affixes  27 verbal suffixes denoting tense and mood  2 verb infixes denoting verb manner –Consultative –Passive  6 verb paradigms or verb types  3 irregular common verbs  6 cases in singular and plural number  Both nouns and adjectives are declined

Lexical Knowledge Representations  Unchanging adverbs, conjunctions, particles, etc. and irregular verb forms (unchanging.txt file)  Lemmas of declinable nouns and adjectives (declinables.txt file)  Inflected verbs and nominalized verbs (regvb.txt file)  Affix files (casendings.txt, reflex.txt, infixes.txt, vbforms.txt)

Some Examples  declinables.txt file –N нэрQ хэн  regverb.txt file –V ирV өс  Affix files –casendings.txtg нийd дa ыг b оос –reflex.txtааээ оо –infixes.txtC лцR лд P гд –vbforms.txt)ipf нөi1p вi3p чээ Ypf охгүй  unchanging.txt file –Pg->талаарPc->холбогдуулан

Merge Morphology Knowledge with the Power of the Computer Wrote yalgah.pl to become tireless lexical pedagogue  Searches for identifiable affixes by comparison with lexical knowledge affix files  Matches resulting lemma against lexical knowledge declinables, verbs, and unchanging words, then outputs word/part of speech tag to standard output file plus expository lexicon  Depending whether lemma can or cannot be matched, outputs: Lemma to Out Of Vocabulary (oov) file noting affixes found Word/part of speech tag to standard output file

Additional Outputs  Expository Morphology File (named morphlex.txt) IR->verb command imperative 2nd person singular IREEREY->converb future perfect continuative IREG-> verb command concessive 3rd person singular/plural BAGA->adjective HURAL->noun nominative IH->adjective AJILDAA->reflexive noun dative-locative ORLOO->verb indicative second past  Out Of Vocabulary File (named oov) [C = : = > 5 = 0 E 0 0 A 0 0 ] (UNKNOWNAHAASAA) WORD 0 LINE 2 FALLS OUTSIDE OF VOCABULARY possible reflexive ending - possible declinable case ending - - possible verbal part of speech - - possible participial/converbal stem 5 = >--

Feed Analytic Output to Parser  Developed context-free grammar (CFG) rules for both discourse and newspaper texts S->Sbj PrdS->PrdSbj->NnSbj->NP NP->Tg NnNP->Tg Ng NnPrd->J  Wrote parse.pl to validate CFG rules against input text tagged as to part of speech  When each sentence can be fully parsed, outputs a parse tree and an English gloss. Working on "BAGA HURAL IH AJILDAA ORLOO." ENGLISH GLOSS: large hural great work began. The sentence does parse. Branch nodes on tree: S -> (Sbj Prd) Sbj -> (NP) NP -> (J Nn) Prd -> (NPd Vi2p) NPd -> (J Nd) POS: J Nn J Nd Vi2p

Feed Output to Generator  Wrote gramgen.pl to generate sentences based on lexical knowledge, morphological knowledge, and syntactic knowledge gained  Output routinely reviewed for accuracy and Chomskian explanatory adequacy of the grammar models created for the parser and generator engines

Iterative Process  First take new newspaper article or dialogue and run morphological analyzer on it until all words are listed within vocabulary (no output in the oov [Out Of Vocabulary] file  Run output through parser, creating new CFG rules until new text parses  Run generator for a hundred or more examples to ensure adequacy of new rules

Morpho-analyzer, Parser, Generator Software Led This Student to Deeper Understanding of Mongolian  A linguistically adept learner can thus write software to help one learn deeper & faster  Language tool development is thus grounded in gaining and applying language knowledge in a systematic and linguistically principled manner for oneself and others

Contact Information John Kovarik   Home Page:  Phone: