Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics.

Slides:



Advertisements
Similar presentations
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Advertisements

1 C++ Syntax and Semantics The Development Process.
Software Applications for Processing Romanian Texts. Demonstration and Comparison Sanda Cherata Babeş-Bolyai University Faculty of Letters.
Part of Speech Tagging Importance Resolving ambiguities by assigning lower probabilities to words that don’t fit Applying to language grammatical rules.
Project topics Projects are due till the end of May Choose one of these topics or think of something else you’d like to code and send me the details (so.
CS 898N – Advanced World Wide Web Technologies Lecture 8: PERL Chin-Chih Chang
176 Formal Languages and Applications: We know that Pascal programming language is defined in terms of a CFG. All the other programming languages are context-free.
Chapter 3 Program translation1 Chapt. 3 Language Translation Syntax and Semantics Translation phases Formal translation models.
28-Jun-15 Recognizers. 2 Parsers and recognizers Given a grammar (say, in BNF) and a string, A recognizer will tell whether the string belongs to the.
JavaScript, Third Edition
Chapter 3: Introduction to C Programming Language C development environment A simple program example Characters and tokens Structure of a C program –comment.
Regular Expressions. String Matching The problem of finding a string that “looks kind of like …” is common  e.g. finding useful delimiters in a file,
Computer Science 101 Introduction to Programming.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
CPSC 388 – Compiler Design and Construction Parsers – Context Free Grammars.
A Variable is symbolic name that can be given different values. Variables are stored in particular places in the computer ‘s memory. When a variable is.
English Language Arts Level 7 #49 Ms. Walker. Today’s Objectives Academic Vocabulary Words The Final Draft (Publishing Your Human Disease Essay) Compare.
Chapter 10: Compilers and Language Translation Invitation to Computer Science, Java Version, Third Edition.
REGULAR EXPRESSIONS. Lexical Analysis Lexical analysers can be constructed by programs such as LEX These programs employ as input a description of the.
COMP Parsing 2 of 4 Lecture 22. How do we write programs to do this? The process of getting from the input string to the parse tree consists of.
Lexical Analysis Hira Waseem Lecture
Macedonian DELAS – first results Aleksandar Petrovski Tetovo, Macedonia.
Languages, Grammars, and Regular Expressions Chuck Cusack Based partly on Chapter 11 of “Discrete Mathematics and its Applications,” 5 th edition, by Kenneth.
Floating point numerical information. Previously discussed Recall that: A byte is a memory cell consisting of 8 switches and can store a binary number.
Getting Started with MATLAB 1. Fundamentals of MATLAB 2. Different Windows of MATLAB 1.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
A Simple Guide to Using SPSS ( Statistical Package for the Social Sciences) for Windows.
CONTENTS Processing structures and commands Control structures – Sequence Sequence – Selection Selection – Iteration Iteration Naming conventions – File.
XML 2nd EDITION Tutorial 4 Working With Schemas. XP Schemas A schema is an XML document that defines the content and structure of one or more XML documents.
1 Tutorial 14 Validating Documents with Schemas Exploring the XML Schema Vocabulary.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Getting Started with MATLAB (part2) 1. Basic Data manipulation 2. Basic Data Understanding 1. The Binary System 2. The ASCII Table 3. Creating Good Variables.
The Functions and Purposes of Translators Syntax (& Semantic) Analysis.
May 2008CLINT-LIN Regular Expressions1 Introduction to Computational Linguistics Regular Expressions (Tutorial derived from NLTK)
Grammar Review Parts of Speech Sentences Punctuation.
Joey Paquet, 2000, Lecture 2 Lexical Analysis.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
 2008 Pearson Education, Inc. All rights reserved JavaScript: Introduction to Scripting.
Compiler Construction By: Muhammad Nadeem Edited By: M. Bilal Qureshi.
Programming Fundamentals. Overview of Previous Lecture Phases of C++ Environment Program statement Vs Preprocessor directive Whitespaces Comments.
1Computer Sciences Department. Book: INTRODUCTION TO THE THEORY OF COMPUTATION, SECOND EDITION, by: MICHAEL SIPSER Reference 3Computer Sciences Department.
1 Compiler Construction (CS-636) Muhammad Bilal Bashir UIIT, Rawalpindi.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
1 Local Grammars – 2 nd part Cvetana Krstev University of Belgrade Faculty of Philology.
Chapter 2 Scanning. Dr.Manal AbdulazizCS463 Ch22 The Scanning Process Lexical analysis or scanning has the task of reading the source program as a file.
1 Dictionary priorities, e- dictionaries of compounds, morphological mode Cvetana Krstev & Duško Vitas.
ICS611 Lex Set 3. Lex and Yacc Lex is a program that generates lexical analyzers Converting the source code into the symbols (tokens) is the work of the.
Winter 2016CISC101 - Prof. McLeod1 CISC101 Reminders Quiz 3 this week – last section on Friday. Assignment 4 is posted. Data mining: –Designing functions.
FILES AND EXCEPTIONS Topics Introduction to File Input and Output Using Loops to Process Files Processing Records Exceptions.
FG Group -Afrilia BP -Liana F.B.I -Maulidatun Nisa -Riza Amini F.
Parsing 2 of 4: Scanner and Parsing
Representation, Syntax, Paradigms, Types
Variables and Primative Types
Automata and Languages What do these have in common?
Introduction to Scripting
Front End vs Back End of a Compilers
Topics Introduction to File Input and Output
C Preprocessor(CPP).
Representation, Syntax, Paradigms, Types
CS 3304 Comparative Languages
Introduction to Primitive Data types
Coding Concepts (Basics)
CS 3304 Comparative Languages
Representation, Syntax, Paradigms, Types
Representation, Syntax, Paradigms, Types
Chapter 10: Compilers and Language Translation
Lexical Elements & Operators
Topics Introduction to File Input and Output
Introduction to Primitive Data types
PYTHON - VARIABLES AND OPERATORS
Presentation transcript:

Dictionary graphs Duško Vitas University of Belgrade, Faculty of Mathematics

Dictionaries of a text  The words in the text not found in the dictionaries that are usually called „unknown words“ (it is better to call them „unrecognized words“).  They are recoreded in a file err in a text folder. 2

What are unrecognized words  Proper names as Gluck, Goethe, Gohr, Glindebourne...  Acronyms as GMBH, GmbH, GNP...  Occasional elements as Goallllll,... ( ) or in Bulgarian, as наздравеее! ( )  Typographic errors  Deriavtional elemenst, like in Seribian aviotransport, osmostruki, devedestodnevni... but also 28-godišnji, 1.5%-tni...  Words from other languages as in Serbian texts offshore, tabacum,... ...

Dictionary graphs Dictionary graphs – they are transducers that if applied for searching a pattern in a text (option Locate pattern) in a mode Merge, produce sequences that are valid DELAF entries.

Problem Is it possible to approximate a unrecognized word on the basis of its structure (that is, elements already in e-dictionaries)? Text contains words that are listed in the err file.

The first approximation recognize any sequence of letters in upper case, a graphs name is Acr+.grf (lower priority)

If a compiled graph Acr+.fst2 is put in a directory DELA (that contains dictionaries), than the forms recognized by a graph will be listed among recognized words!

Proper Names any simple word with capitalized first letter (NProp+.grf) They can use the results of previously applied dictionaries. As a matter of fact, a dictionary graph can be given a lower priority and it is then applied only to simple word forms that standard dictionaries didn’t cover. This graph tags as nouns all simple word forms with an upper-case initial that are not in the dictionary of simple forms. This words receive semantic tags +NProp (a proper name) and +Unknown (of unknown kind). Green brackets define a context (later).

9 Other advantages of dictionary graphs

Priority A form GmbH corresponds to a pattern for proper names (NProp), and not to a pattern for acronyms (Acr), so it will be marked as a proper name. For Serbian, ther is a separate dictionary of acronyms, so GMBH is tagged twice: As a acronym, according to the graph Acr+.grf As a line from the DELAF type dictionary

11 Forcing case  One of advantages of these transducers is that they can use quotation marks to force case.  One example of this is recognition of chemical elements. For instance, “Na” will recognize only Na while pattern Na recognizes both Na and NA. Such possibility does not exist in normal dictionaries.

12 An example of a dictionary graph that recognizes some chemical elements This graph recognizes symbols of chemical elements sodium, potassium, lithium, etc. and assign them as a PoS ABB (abbreviation) with addition of a semantic marker +ChemElem. It has the same effect (except for forcing the upper-case initial) as a line in a DELAF dictionary: Na,.ABB+ChemElem

13 One dictionary graph – compound interjections  Dictionary graphs can recognize as one unit something that consists of several components that can combine in more or less free fashion.  Why can’t we use usual dictionary lemmas for this?  Because we don’t know how many repetitions there can be.  This graph covers only repetitions of separated components (by a space or a hyphen) and not cases like Aaaaah. This graph recognizes compound interjections

14 Appication of dictionary graphs  They can be given a lower priority if a plus sign + is added to their name. It means that they are applied only to unknown words (content of err after applying regular dictionaries).  Compile them and obtained.fst2 include in a list of dictionaries that are applied to a text.  Recognized sequences with corresponding output will become a content of the DLC of analyzed text.  For instance, a line in DLC for one of recognized interjections is: Sx-sx-sx-sx,.INT+C

15 Dictionary graphs that use morphological filters  Dictionary graphs can use morphological filters – actually they can use anything that syntactic graphs can use.  This graph recognizes interjections in which some letters repeat several times.  What is recognized in text with a lexical mask ?  What is recognized in text with a lexical mask ?

 The file err contains: goal, goallll and nazdraveee...  If we produce a DELAF type dictionary INT.dic that contains lemmas goal and nazdrave as interjections.  The application of this dictionary to the text recognizes these two interjections, but not nazdraveeee and goallll.

Morphological filter

18 More on dictionary graphs  Recognition of various compounds in which some components are numerals written with digits.  Lemmas and grammatical categories are assigned to recognized compounds.  That way correct DELAF entries are obtained.

19

20 What does this graph do?  It recognizes multi-word units that begin with a numeral written with digits (a sub- graph BrojCifre ) followed by a hyphen (no spaces around a hyphen) followed by some form of the adjective minutni.  The recognized numeral becomes a value of a variable $1, a separator becomes a value of a variable $2.  This variables are used in the output of a transducer to form a canonic form (lemma) - $1$$2$minutni  PoS assigned to a canonic form is – A – (an adjective) and the additional markers are: +PosQ+C  Every form of the adjective minutni is followed by its set of codes of grammatical categories.

21 What does such dictionary graphs recognize in a text (used as syntactic graphs)?  A dictionary graph Minutni recognizes in a collection 5izvora Minutni Minutni  Subordinate graph Razno recognizes various multiword units formed in a similar way: nouns, adjectives and adverbs.  A dictionary graph Razno recognizes in a collection 5izvora various MWU with digitsvarious MWU with digits

22 Dictionary graphs – the second example  Recognizes as nouns (the masculine gender, inanimate) all acronyms followed by the case ending.  Acronyms are recognized by a morphological filter >  A recognized acronym becomes a value of a variable $1 that is in the transducer’s output used as a canonic form.  The recognized acronym gets as a PoS a tag ABB and additional markers - +Acr+Noun+D

23 What does such dictionary graphs recognize in a text (used as syntactic graphs)?  In a text 5izvora-izvod retrieves acronyms with a dictionary graph Acr+. acronyms  Attention- in order to obtain this output a graph has to be applied to a text for location that has not been processed with it (because of the mask).  A subordinate graph NaKraju recognizes adjectives, noun, roman numerals and various interjection. In the same text it recognizes at the end..at the end.

24 Dictionary graphs – the third example  Dictionary graphs recognize numerals written with digits, words and their combination.  They take care about the agreement various numerals impose.

25 A sub-graph of a dictionary graph for numerals– BrojSamoSifreJ.grf  Recognizes all numerals written with digits that end with a digit 1 (but not 11).  Includes a recognition of decimal numbers with a decimal comma.  Includes a recognition of great numbers with digits grouped three by three (separated by a point or a space).

26 What else contains a dictionary graph for recognition of numerals?  A subordinate graph NoviBrojSlovJ.grf recognizes all numerals that impose agreement as a numeral 1 and which can be written with digits, words or their combination.  A subordinate graph NoviBrojSlovima recognizes all numerals, written in any possible way, with various types of agreement.

27 What is recognizes by the graph NoviBrojSlovima?  In a short text 5izvora-izvod recognizes and tags following numerals. numerals  Sub-graphs cannot be used on their own, there is a lot of false – strange recognitions. They are useful only when used together. strange recognitions  There are other errors, what about them? Other graphs – e.g. for recognition of dates – will remove them.

28 A local grammar (a syntactic graph) for recognition of dates  It is not a dictionary graph.  It is a transducer that produces XML tags.  It could become a dictionary graph if we would delimit our recognitions only to, for instance, adverbial constructions.

29 One sub-graph of a syntactic graph Datum  It recognizes a date – precisely or vaguely expressed

30 What does this graph recognize?  In a text 5izvora recognizes and tags following dates.dates  A tagged text looks like this: XML textXML text

Finally, compounds  err contains also compounds that can be approximated by the content of DELAF dictionaries when using morphological dictionary graphs (dictionary graphs used in morphological mode)  One such pattern A( +-)(N+V+A) Can be used for examples as visokotehnički, visokotehnološki, prvospomenuti, devedesetodnevni,...

One morphological dictionary graph Graph has at the begining / as a marker that it is a morphological graph Words that are not in applied dictionaries: ![ ] Switch to morphological mode $p$, $a$, $b$, $c$ - variables that keep the recognized part of input Value of the variable p, followed by a lemma and grammatical code of variables a, b or c is produced as an output.

tags.ind $p$ $a$, $a.LEMMA$ $a.CODE$ {високо,A}{технолошки,технолошки.A+PosQ:aems4q} {високо,A}{технолошки,технолошки.A+PosQ:aems5g} {високо,A}{техничког,технички.A+PosQ:adms2g} {високо,A}{техничког,технички.A+PosQ:adms4v} {високо,A}{техничког,технички.A+PosQ:adns2g} {високо,A}{технолошког,технолошки.A+PosQ:adms2g}

Elimination of unrecognized words from err

Thanks!