The Sketch Engine for Dutch with the ANW corpus Carole Tiberius.

Slides:



Advertisements
Similar presentations
Day 1 Punctuation and Capitalization
Advertisements

The CLARIN INFRASTRUCTURE Jan Odijk MA Rotation Utrecht,
XML and General Dutch Dictionary (ANW) Van der Kamp, Lexical databases and digital tools, april 29 th, 2005, 1 Peter van der Kamp
Day 1 Punctuation and Capitalization
“I Can” Learning Targets
Day 1 Punctuation and Capitalization
Day 1 Punctuation and Capitalization
Day 1 Punctuation and Capitalization
© Paradigm Publishing, Inc Word 2010 Level 2 Unit 1Formatting and Customizing Documents Chapter 2Proofing Documents.
Guidelines for Writing Technical Documents Computer Science 312.
WG3: Innovative e-dictionaries Simon Krek „Jožef Stefan“ Institute, Ljubljana, Slovenia Carole Tiberius Institute of Dutch Lexicology, Leiden, the Netherlands.
Chapter 4 Basics of English Grammar
Ian Cushing English teacher, Surbiton High School UK Linguistics Olympiad Committee Education Committee, Linguistics Association of Great Britain Grammar.
What is a national corpus. Primary objective of a national corpus is to provide linguists with a tool to investigate a language in the diversity of types.
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
1 Words and the Lexicon September 10th 2009 Lecture #3.
The user entered the query “What is the historical relation between Greek and Roma”. Here are the query’s results. The user clicked the topic “Roman copies.
Day 1 Punctuation and Capitalization
Day 1 Punctuation and Capitalization
© 2006 SOUTH-WESTERN EDUCATIONAL PUBLISHING 11th Edition Hulbert & Miller Effective English for Colleges Chapter 11 CAPITALIZATION AND NUMBERS.
Unit 30 Subject Relative Clauses (Adjective Clauses with Subject Relative Pronouns)
GRAMMAR IN SPEECH AND WRITING. A12.1 Variety in English ❏❏ between different dialects of English, for example, British and American forms e.g. I have.
INSTRUCTOR: TSUEIFEN CHEN TERM:   Participial phrase: what is it and what does it do?  Participle forms: 1. General form –ing participial phrases.
Grammar Skills Workshop
How to Make Your Parts of Speech PowerPoint Book
CIG Conference Norwich September 2006 AUTINDEX 1 AUTINDEX: Automatic Indexing and Classification of Texts Catherine Pease & Paul Schmidt IAI, Saarbrücken.
Content of the Data Category Registry 10 May /20111CLARIN-NL ISOcat workshop.
Marijke Mooijaart INL Leiden Bullay Deutschland april 2010 Fifteen centuries Dutch vocabulary Four historical dictionaries in one internet application.
Daily Grammar Practice
Free Powerpoint Templates Page 1 Free Powerpoint Templates NOUN PHRASES.
EFL 084 Grammar 4 Modal Auxiliaries –Meaning Probability Necessity Advisability Ability –Time Present/future structure Past structure.
Unit 15 Webpage Creator. Outlines Introduction Starter Listening Language Work Work study Speaking Writing.
Reasons to Study Lexicography  You love words  It can help you evaluate dictionaries  It might make you more sensitive to what dictionaries have in.
Language Learning Targets based on CLIMB standards.
Online Dutch-Frisian Dictionary Pieter Duijff, Frits van der Kuip, Hindrik Sijens Fryske Akademy Introduction The most recent bilingual Dutch-Frisian dictionary.
Phrases and Clauses L/O: to revise/learn how to analyse larger units of language – phrases and clauses to revise/learn how to analyse larger units of language.
Corpus Evaluation Adam Kilgarriff Lexical Computing Ltd Corpus evaluationPortsmouth Nov
1 DATABASE ARTISTS OF THE WORLD (AKL) ONLINE. 2 Content AKL Online is the world’s most contemporary, reliable, and extensive reference work on artists.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
ENeL WG3 meeting: Automatic Knowledge Acquisition for Lexicography Herstmonceux, August 2015 STARTS AT 2:30 PM.
$100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100.
LANGUAGE ARTS LA WORKS UNIT 3 REVIEW STUDY GUIDE.
Subcorpus configuration Adam Kilgarriff. Feb 2010Kilgarriff: IWSG: Subcorpora2 “you can’t get away from genre” Bonnie Weber, Keynote Lecture ICON (Indian.
English Review for Final These are the chapters to review. In Textbook: Chapter 9 Nouns Chapter 10 Pronouns Chapter 11 Adjectives Chapter 12 Verbs Chapter.
LING 388: Language and Computers Sandiway Fong Lecture 21.
Daily Grammar & Vocabulary Practice
Grammar Slides KAPITEL 16. Relative Pronouns Recognizing Relative Clauses.
Building Sub-Corpora Suitable for Extraction of Lexico-Syntactic Information Ondřej Bojar, Institute of Formal and Applied Linguistics, ÚFAL.
◦ Process of describing the structure of phrases and sentences Chapter 8 - Phrases and sentences: grammar1.
General characteristics As any other part of speech, the noun can be characterized by three criteria:  Semantic (the meaning)  Morphological (the form.
Basic Syntactic Structures of English CSCI-GA.2590 – Lecture 2B Ralph Grishman NYU.
Unit 1 Language Parts of Speech. Nouns A noun is a word that names a person, place, thing, or idea Common noun - general name Proper noun – specific name.
Week 11 New Media November 16, 2012 New Media November 16, 2012.
Clauses and Phrases Quick recap from Day 1. Clauses and phrases Clauses and phrases are groups of words Clauses have a subject and verb.
THE PROCESS OF WORDS BEING ENTERED IN A DICTIONARY WORD FORMATION IN ENGLISH Magdalena Soklevska April, 2016.
Editing Editing – the process of updating a word processing document to: make changes correct errors make it visually appealing.
Written Communication Writing Guidelines
Yr 9 grammar review Using venir de + infinitive
Daily Grammar Practice (DGP)
Daily Grammar Practice Week One Grade 8
Figure 2: Aanrijding / Botsing
Week 11 Warm-Ups English 12 Mrs. Fountain.
How to Make Your Parts of Speech PowerPoint Book
The Difference Between Revision and Editing
Daily Grammar Practice
Daily Grammar Practice Week One Grade 8
Daily Grammar Practice
B2-Lesson 22 Archaeological Find
Jeopardy Game Grammar Edition
Editing Process: English 10 Spoken Language
Presentation transcript:

The Sketch Engine for Dutch with the ANW corpus Carole Tiberius

Outline The Algemeen Nederlands Woordenboek –Main features –The ANW corpus The Sketch Engine –Background –Word Sketches for Dutch

The ANW dictionary Online scholarly dictionary Contemporary standard Dutch in the Netherlands and Flanders General (mainly written) language Period: Size: main entries and subentries Users: from laymen to professionals No clone of an existing printed dictionary Semasiological and onomasiological Modular editing and publication Corpus-based

+ + = sportveld meaning grammar morphology and compounding spelling combinations; collocations multimedia ANW: main content features

ANW Corpus Compiled from: Electronic texts already available at the INL Internet Scanning Subcorpora: Corpus of domains32 million tokens Corpus of literary texts 20 million tokens Newspaper corpus 40 million tokens Corpus of neologisms 5,5 million tokens Pluscorpus 5 million tokens Total 102,5 million tokens

Corpus preparation Conversion to vertical format: word-form tag lempos –Inclusion of tag for punctuation –Removal of double occurring texts Conversion to UTF8 More uniform document headers –subcorpus; ID; variant; dates etc.

Changes to the editor The ANW editor was adapted such that the lexicographers can automatically copy examples plus source information from the Sketch Engine into the editor.

ANW Grammatical Relations for nouns object-of with ‘dat’ (that)-compl subject-of with wh-compl with auxiliary with ‘of’ (whether)-compl premodifying adjective with ‘alsof’ (as if)-compl premodifying present participle with demonstrative pronoun premodifying past participle with possessive pronoun with infinitive plus ‘om te’ with PP in PP with indefinite pronoun with personal pronoun premodifying noun premodifying genitive postmodifying noun postmodifying genitive premodifying numeral with proper noun postmodifying numeral with article postmodifying adjective with coordinated noun with infinitive plus ‘te’ other

Dutch Sketch Grammar Geared completely towards the ANW requirements Covers ± 50 of the 70 relations Types of relations: –Symmetric (e.g. and/or ) –Trinary (e.g. headword + pp + noun) –Dual (e.g. adj + headword) –Unary (e.g + relative clause – dat)

Specific problems for Dutch Verb-subject and verb-object relations as word order not a reliable source, e.g. BOONEN zou Voigt in de sprint geklopt hebben Boonen would Voigt in the sprint beaten have ‘Boonen would have beaten Voigt in the sprint.’ VOIGT zou Boonen in de sprint geklopt hebben Voigt would Boonen in the sprint beaten have ‘Voigt would have beaten Boonen in the sprint.’ (Bouma 2008:20)

Sketch Grammar rules *DUAL =object/object_of # hij ziet de man / hij heeft de man gezien "P.*pers.*nom.*" 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag!="N.*" & tag!="S.pre.*"] "P.*pers.*nom.*" "V.*aux.*" [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" # gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”][tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" # omdat Piet Jan ziet "C.*sub.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" 1:"V.*mai.*" *DUAL =subject/subject_of # gisteren zag Piet Jan [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] 1:"V.*mai.*" [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" [word=“[gisteren|morgen|vanmorgen|vandaag|jaar]”] [tag="V.*aux.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [tag=“[T|D|M|R|A].*"]{0,3} "N.*" 1:"V.*mai.*" # omdat Piet Jan ziet [word="omdat" | word="dat" & tag="C.*sub.*"] [tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*" [[tag=“[T|D|M|R|A].*"]{0,3} "N.*" 1:"V.*mai.*" # gepleegd door de moordenaar 1:"V.*mai.*part.*past.*" [word="door"] [[tag=“[T|D|M|R|A].*"]{0,3} 2:"N.*"

Specific problems for Dutch Separable verbs, e.g. Hij at een hele boterham op (from ‘opeten’) He ate a whole sandwich up ‘He ate a whole sandwich’ omdat hij een hele boterham op heeft gegeten because he a whole sandwich up has eaten ‘because he has eaten a whole sandwich’

Sketch Grammar rules =bijw+WW # separable verbs "N.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "A.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*partpast.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "V.*mai.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "N.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "A.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" "V.*aux.*"{0,3} 1:"V.*mai.*inf.*" "V.*mai.*" "C.*sub.*" 2:"S.*nonpre.*" "U.*infmark.*" 1:"V.*mai.*inf.*"

Subcorpora Within the ANW corpus, 7 subcorpora were defined: –Belgian Dutch –Dutch Dutch –Corpus Literary Texts –Domain-dependent Texts –Newspaper Texts –Neologisms –Pluscorpus

Language variety: BelgianDutch

Language variety: DutchDutch

Wish list / Questions Fixed order of display Efficient dealing with different tag sets Correct display of unary relations Possible formats of dates in document headers Use of morphological information in Sketch Engine