Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics.

Slides:



Advertisements
Similar presentations
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Advertisements

Annotation of Grammatemes in the Prague Dependency Treebank 2.0 Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
Unsupervised Dependency Parsing David Mareček Institute of Formal and Applied Linguistics Charles University in Prague Doctoral thesis defense September.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Alphabetic Indexing Rules OT 122 Chapter Two. Intro Must be a consistent system to work! Indexing? – Selecting the filing segment under which to store.
The SALSA experience: semantic role annotation Katrin Erk University of Texas at Austin.
Annotating language data Tomaž Erjavec Institut für Informationsverarbeitung Geisteswissenschaftliche Fakultät Karl-Franzens-Universität Graz Tomaž Erjavec.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
The Unreasonable Effectiveness of Data Alon Halevy, Peter Norvig, and Fernando Pereira Kristine Monteith May 1, 2009 CS 652.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Foundations of Language Science and Technology - Corpus Linguistics - Silvia Hansen-Schirra.
DS-to-PS conversion Fei Xia University of Washington July 29,
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Layering the Annotation Jan Hajič Institute of Formal and Applied Linguistics.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks: Language-specific Issues Czech Jan Hajič Institute of Formal and Applied Linguistics.
Boosting Applied to Tagging and PP Attachment By Aviad Barzilai.
April 26, 2007Workshop on Treebanking, NAACL-HTL 2007 Rochester1 Treebanks and Parsing Jan Hajič Institute of Formal and Applied Linguistics School of.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Building the Valency Lexicon of Arabic Verbs Viktor Bielický Otakar Smrž LREC 2008, Marrakech, Morocco.
Capitalization and punctuation By Cristian walle.
PDT 2.0 Prague Dependency Treebank 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague.
1/21 Introduction to TectoMT Zdeněk Žabokrtský, Martin Popel Institute of Formal and Applied Linguistics Charles University in Prague CLARA Course on Treebank.
Gail Palmer Mechanics and Style School of Electrical and Computer Engineering Georgia Institute of Technology.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
IKTA-27/2000 Development of a Part-of-Speech (POS) Tagging Method for Hungarian Using Machine Learning Algorithms Project duration: July June.
Survey of Semantic Annotation Platforms
Profile The METIS Approach Future Work Evaluation METIS II Architecture METIS II, the continuation of the successful assessment project METIS I, is an.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Natural Language Processing Lecture 6 : Revision.
Morphological Meanings in the Prague Dependency Treebank Magda Razímová Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Tree-based Machine Translation using syntax and semantics
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
A semantic based methodology to classify and protect sensitive data in medical records Flora Amato, Valentina Casola, Antonino Mazzeo, Sara Romano Dipartimento.
A Language Independent Method for Question Classification COLING 2004.
Systematic Parameterized Description of Pro-forms in the Prague Dependency Treebank 2.0 Magda Ševčíková Zdeněk Žabokrtský Institute of Formal and Applied.
Semiautomatic domain model building from text-data Petr Šaloun Petr Klimánek Zdenek Velart Petr Šaloun Petr Klimánek Zdenek Velart SMAP 2011, Vigo, Spain,
Resemblances between Meaning-Text Theory and Functional Generative Description Zdeněk Žabokrtský Institute of Formal and Applied Linguistics Charles University,
Semantic Construction lecture 2. Semantic Construction Is there a systematic way of constructing semantic representation from a sentence of English? This.
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
1 / 5 Zdeněk Žabokrtský: Automatic Functor Assignment in the PDT Automatic Functor Assignment (AFA) in the Prague Dependency Treebank PDT : –a long term.
Chapter 3 Describing Syntax and Semantics
CSA2050 Introduction to Computational Linguistics Parsing I.
PDT Grammatemes in the PDT 2.0 Zdeněk Žabokrtský Dept. of Formal and Applied Linguistics Charles University, Prague
nd PIRE project workshop1 Tectogrammatical Representation of English Silvie Cinková Lucie Mladová, Anja Nedoluzhko, Jiří Semecký, Jana Šindlerová,
March 5, 2008Companions Semantic Representation and Dialog Interfacing Workshop - Intro 1 The Prague Dependency Treebank (PDT) Introduction Jan Hajič Institute.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
Sampling Design & Measurement Scaling
Arabic Syntactic Trees Zdeněk Žabokrtský Otakar Smrž Center for Computational Linguistics Faculty of Mathematics and Physics Charles University in Prague.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)
Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Named Entities in Czech Texts and Their Processing Magda Ševčíková Zdeněk Žabokrtský ÚFAL MFF UK.
D.L.P. – Week Eight GRADE SEVEN. Day One – Skills Punctuation – Titles When referring to a title when writing, it must be punctuated properly. Shorter.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
Named entities recognition Jana Kravalová. Content 1. Task 2. Data 3. Machine learning 4. SVM 5. Evaluation and results.
Netgraph – a Tool for Searching in the Prague Dependency Treebank 2.0 Defence of the Doctoral Thesis, Prague, September 3 rd, 2008 Author: Mgr. Jiří Mírovský.
Prague Dependency Treebank 2. 0 Zdeněk Žabokrtský Dept
Statistical NLP: Lecture 9
Text Mining & Natural Language Processing
The development of PDT 3.0 Introduction to the discussion
Statistical NLP : Lecture 9 Word Sense Disambiguation
KNOW YOUR STYLE Part 1 The Associated Press is the be-all-end-all of journalism style guides. Start learning style.
Presentation transcript:

Proper Nouns in Czech Corpora Magda Ševčíková Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University in Prague Czech Republic

Corpus Linguistics 2007, July 30 Outline Introduction Proper nouns in corpora of Czech: current state Corpus SYN2000 Prague Dependency Treebank 2.0 Proposal of a complex proper noun annotation within the Prague Dependency Treebank 2.0 Final remarks

Corpus Linguistics 2007, July 30 Introduction proper nouns lacking a generic meaning denoting individuals, institutions etc. identifying them as unique items proper nouns in NLP question answering information extraction machine translation pan Zelený should not be translated into Mr Green Frankfurt am Main or Frankfurt nad Mohanem, but not a combination of both (e.g., Frankfurt nad Main) explicit annotation of proper nouns needed

Corpus Linguistics 2007, July 30 Proper nouns in corpora of Czech: current state two large corpora of Czech as sources of proper nouns: SYN million tokens morphological annotation morphological lemmas and positional tags no explicit annotation of proper nouns Prague Dependency Treebank 2.0 (PDT 2.0) morphologically and syntactically annotated very basic annotation of proper nouns at the morphological layer at the deep-syntactic (tectogrammatical) layer

Corpus Linguistics 2007, July 30 Proper nouns in SYN proper nouns were not marked other characteristics used for searching for proper nouns capitalization only proper nouns capitalized in Czech (in comparison, e.g., to German) however, it is not a sufficiently distinctive feature (sentence beginnings) context patterns for instance, Mr Xxx / President Xxx

Corpus Linguistics 2007, July 30 Searching SYN2000 for QueryNumber of occurrences in SYN2000 Precision in 500 randomly selected occurrences (in %) names/surnames (or their parts) lemmas pan/paní/slečna (Mr/Mrs/Miss) followed by a capitalized token 41, names/surnames (or their parts) (un)capitalized short versions of Czech academic titles doc./dr./ing./JUDr./MUDr./ prof./RNDr. followed by a capitalized token 26, names/surnames (or their parts) lemmas of academic titles doktor/profesor/ docent/inženýr (doctor/professor/docent/ engineer) followed by a capitalized token 9, town names (or their parts) digit combination corresponding to Czech zip code format followed by a capitalized token 7, street/square names (or their parts) lemmas ulice/náměstí (street/square) followed by a capitalized token 6, street names (or their parts) (un)capitalized abbreviation ul. (for street) followed by a capitalized token company names (or their parts) abbreviation s.r.o. (for Ltd.) preceded by a capitalized token (and optionally by a comma) 2, company names (or their parts) abbreviation a.s. (for PLC) preceded by a capitalized token (and optionally by a comma) 4,

Corpus Linguistics 2007, July 30 Proper nouns in PDT basic annotation of proper nouns at the morphological layer each token was assigned a morphological lemma and a positional tag lemma flag for marking of proper nouns at the tectogrammatical layer each sentence represented by a labeled dependency tree structure (consisting of nodes and edges) special means for annotation of selected phenomena concerning proper nouns

Corpus Linguistics 2007, July 30 PDT 2.0: Morphological layer proper noun type indicated by a value of a special flag which was attached to lemmas of proper nouns by a separator Jan_;Y, Zelený_;S seven flag values first names, surnames, inhabitant names, geographical names, institution names, product names, other names convenient for annotation of one-word proper nouns insufficient for more complex proper nouns misinterpretations: Frankfurt_;G nad Mohanem_;G Vysoký_;K škola ekonomická (University of Economics)

Corpus Linguistics 2007, July 30 PDT 2.0: Tectogrammatical layer no complex annotation of proper nouns annotation means for selected phenomena only person names node attribute is_name_of_person non-inflected street names, book titles etc. accompanied by a generic noun functor ID book titles etc. which have a form of a prepositional group and are not accompanied by a generic noun an ‘artificial’ node with lemma #Idph besides these individual cases, proper nouns were treated as common parts of a sentence

Corpus Linguistics 2007, July 30 (c) Šli jsme ulicí Spálená (We walked through the street.instr Spálená.nom) (d) Šli jsme ulicí Spálenou (We walked through the street.instr Spálená.instr) (e) Šli jsme Spálenou (We walked through Spálená.instr) (instr for instrumental case, nom for nominative case) (a) person name Klára Nováková Malá (b) V sobotu v poledne je hezký film (lit.: ‘On Saturday at Noon’ is a nice film) (a)(b) (c) (d) (e)

Corpus Linguistics 2007, July 30 Proposal of a complex proper noun annotation within PDT 2.0 proper noun type defined at each proper noun proper noun classification annotation of one-word proper nouns as well as more complicated proper noun structures four structure types to be annotated the inner structure of more complex proper nouns described as a non-dependency relation tectogrammatical layer

Corpus Linguistics 2007, July 30 Proper noun classification for Czech two-level classification 1st level: five super-types of proper nouns personal names, geographical names, institution names, artefact names, media names (+ two more types: temporal expressions, numerical expression occurring in postal addresses) 2nd level: proper noun types e.g., types of geographical names: street/square names, city/town names, state names etc. underspecification allowed each type encoded by a unique two-character tag gs for street/square names, gu for city/town names g_ for a geographical name of an unknown type

Corpus Linguistics 2007, July 30 Structure types to be annotated (i) one-word proper nouns John (ii) multi-word proper noun expressions Vysoká škola ekonomická (University of Economics) (iii) complex proper noun expressions Frankfurt nad Mohanem (iv) containers Jan Zelený

Corpus Linguistics 2007, July 30 (i) Annotation of one-word proper nouns proper noun type indicated at each proper noun new node attribute: NE_roles value set corresponds to all proper noun type tags (and container tags) substitutes the current is_name_of_person attribute

Corpus Linguistics 2007, July 30 (ii) Annotation of multi-word proper noun expressions every constituent of a multi-word proper noun expression has a node of its own at all nodes, the same value of the NE_roles attribute occurs edges in the sub-tree labeled with a new functor NEPART syntactic function of the whole expression indicated by the functor of the governing node Vyučuje na Vysoké škole ekonomické (He teaches at University of Economics)

Corpus Linguistics 2007, July 30 (iii) Annotation of complex proper noun expressions every constituent has a node of its own a main part (Frankfurt) and an embedded part (Mohan) type of the embedded part indicated by the value of the NE_roles attribute at the embedded part, type of the whole expression at the main part relation between the main and the embedded part labeled with the NEPART functor Navštívil Frankfurt nad Mohanem (He visited Frankfurt am Main)

Corpus Linguistics 2007, July 30 (iv) Annotation of containers the #Idph node as the governing node of the whole container container type indicated by the value of the NE_roles attribute at the #Idph node proper noun types of the constituents defined by the values of their belonging NE_roles attributes relations between the #Idph node and constituents labeled with the NEPART functor Novým ředitelem je Jan Zelený (Jan Zelený is the new director)

Corpus Linguistics 2007, July 30 Final remarks annotation of proper nouns in corpora linguistic research NLP subtasks complex proper noun annotation within PDT 2.0 tectogrammatical layer more convenient than the morphological one annotation means and rules proposed future work further elaborate the proposed means and rules manual annotation of sample data development of automatic annotation tools