Applying Word Sketches to Russian Máša Khokhlova St.Petersburg State University

Slides:



Advertisements
Similar presentations
Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.
Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING PoS-Tagging theory and terminology COMP3310 Natural Language Processing.
TAAL EN COMPUTER INTRO Paola Monachesi. Why Taal en Computer?
Corpus Processing and NLP
Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Linguistics, Morphology, Syntax, Semantics. Definitions And Terminology.
Statistical NLP: Lecture 3
Estonian Word Sketches: the Case of Multi-Word Lexical Verbs Maria Khokhlova (St. Petersburg State University) Jelena Kallas (Institute of the Estonian.
Regular expressions and the Corpus Query Language
CS4025: Advanced Information Extraction. Overview CS4025, Department of Computing Science, University of Aberdeen 2 Overview of aspects of IE and General.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
1 A Hidden Markov Model- Based POS Tagger for Arabic ICS 482 Presentation A Hidden Markov Model- Based POS Tagger for Arabic By Saleh Yousef Al-Hudail.
1 Words and the Lexicon September 10th 2009 Lecture #3.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Making useful wordlists for ELT Topical vocabulary from the WWW Simon Smith & Scott Sommers Ming Chuan University, Taipei Adam Kilgarriff, Lexical Computing.
Introduction to Computational Linguistics Lecture 2.
Chapter 6 Identifying Grammatical Morphemes Morphology Lane 333.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Linguistics StructuralGenerative Ferdinand de Saussure 1916 Noam Chomsky 1950s As an approach to linguistics, structural linguistics involves collecting.
Input-Output Relations in Syntactic Development Reflected in Large Corpora Anat Ninio The Hebrew University, Jerusalem The 2009 Biennial Meeting of SRCD,
Russian National Corpus today: overview and perspectives Vladimir A. Plungian (Moscow)
Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.
Creation of a Russian-English Translation Program Karen Shiells.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Language Translators By: Henry Zaremba. Origins of Translator Technology ▫1954- IBM gives a demo of a translation program called the “Georgetown-IBM experiment”
Latin Grammar: Singular and Plural Magister Henderson Latin I.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Copyright © Cengage Learning. All rights reserved.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
Survey of Semantic Annotation Platforms
A Survey of NLP Toolkits Jing Jiang Mar 8, /08/20072 Outline WordNet Statistics-based phrases POS taggers Parsers Chunkers (syntax-based phrases)
H OW D OES S YNTACTIC S TRUCTURE M ANIFEST I TSELF T HROUGH T EXT C ORPORA : O SSETIC N OMINALIZATION Pavel Graschenkov, Institute of Oriental Culture.
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Text Mining In InQuery Vasant Kumar, Peter Richards August 25th, 1999.
Arabic Tokenization, Part-of-Speech Tagging and Morphological Disambiguation in One Fell Swoop Nizar Habash and Owen Rambow Center for Computational Learning.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Terminology-finding in the Sketch Engine Miloš Jakubíček, Adam Kilgarriff, Vojtěch Kovář, Pavel Rychlý, Vit Suchomel Lexical Computing Ltd., Brighton,
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
The Sketch Engine as Infrastructure for Large Scale Text Collections for Humanities Research Adam Kilgarriff Lexical Computing Ltd. & Univ of Leeds, UK.
Linguistics The eleventh week. Chapter 4 Syntax  4.1 Introduction  4.2 Word Classes.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Corpus lexicography in Russia: recent trends and perspectives Maria Khokhlova St.Petersburg State University Philological Faculty
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
Communicative and Academic English for the EFL Professional.
Unit 8 Syntax. Syntax Syntax deals with rules for combining words into sentences, as well as with relationship between elements in one sentence Basic.
The Unreasonable Effectiveness of Data
POS Tagger and Chunker for Tamil
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Regular expressions and the Corpus Query Language Albert Gatt.
Natural Language Processing Group Computer Sc. & Engg. Department JADAVPUR UNIVERSITY KOLKATA – , INDIA. Professor Sivaji Bandyopadhyay
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
Pronoun/Antecedent Agreement Wednesday, Jan. 9 Thursday, Jan. 10.
Making trouble-free corpus tasks in 10 minutes Jennie Wright.
1 The grammatical categories of words and their inflections Kuiper and Allan Chapter 2.1.
Lecturer: Abrar Mujaddidi LANE 321 P HRASES AND S ENTENCES : G RAMMAR.
Grammar Grammar analysis.
Vocabulary connections: multi-word items in English
Statistical NLP: Lecture 3
NLP Assignments for Undergraduates (1)
Corpus-Based ELT CEL Symposium Creating Learning Designers
Linguistic Essentials
Artificial Intelligence 2004 Speech & Natural Language Processing
Extracting Why Text Segment from Web Based on Grammar-gram
Presentation transcript:

Applying Word Sketches to Russian Máša Khokhlova St.Petersburg State University

Word Sketches for Russian  Grammatical rules that take into account syntactic constructions of the Russian language based on the morphologically tagged corpus;  Regular expressions and query language IMS Corpus Workbench;  The system searches for tags which correspond to word forms. For example, tag Ncfpnn means common noun (Nc) female gender (f) plural (p) noun case (n).

Word Sketch Rules Below there is an example of grammatical rules for the phrases «adjective+noun»: *DUAL =a_modifier/modifies 2:"A....n." (([word=","]|[word="и"]|[word="или"]){0,1} [tag="A....n."]){0,7} 1:"N...n." 2:"A....g." (([word=","]|[word="и"]|[word="или"]){0,1}[tag="A....g."]) {0,3} 1:"N...g." 2:"A....d." (([word=","]|[word="и"]|[word="или"]){0,1} [tag="A....d."]){0,3} 1:"N...d." 2:"A....a." (([word=","]|[word="и"]|[word="или"]){0,1} [tag="A....a."]){0,3} 1:"N...a." 2:"A....i." (([word=","]|[word="и"]|[word="или"]){0,1} [tag="A....i."]){0,3} 1:"N...i." 2:"A....l." (([word=","]|[word="и"]|[word="или"]){0,1} [tag="A....j."]){0,3} 1:"N...l."

Word Sketch Rules (2) =Verb X/X Verb 2:[tag="V.*"] 1:[tag!="SENT"&tag!=","&tag!="-"] 1:[tag!="SENT"&tag!=","&tag!="-"] [lemma=”не”]? 2:[tag="V.*"] =Noun X 2:[tag="N.*"&lemma!=")."] 1:[tag!="SENT"&tag!=","&tag!="-"&lemma!=")."]

Text Corpora  Russian Web Corpus – 190 mln tokens  Rbc (РосБизнесКонсалтинг) – 22.5 mln tokens  Romip (Российский семинар по Оценке Методов Информационного Поиска) – 2.7 mln tokens  Corpus Linguistics – 2.7 mln tokens

Word sketches for the word “čaj” (Russian Web Corpus)

Word sketches for the word “čaj” (news)

Word sketches for the word “zelenyj” (Russian Web Corpus)

Word sketches for the word “imet’” (Russian Web Corpus)

Word sketches for the word “korpus” (texts on corpus linguistics)

Word sketches for the word “korpus” (news)

Word sketches for the word “korpus” (Web corpus)

Word sketches for the word “polucit’” (texts on corpus linguistics)

Word sketches for the word “polucit’” (news)

Word sketches for the word “polucit’” (Russian Web Corpus)

Word sketches for the word “dat’” (Russian Web Corpus)

Word sketches for the word “dat’” (texts on corpus linguistics)

Word sketches for the word “dat’” (news)

Thank you!