DeepDict en ny type online-ordbog til leksikografi og undervisning Eckhard Bick Syddansk Universitet & GrammarSoft ApS.

Slides:



Advertisements
Similar presentations
What is a corpus?* A corpus is defined in terms of  form  purpose The word corpus is used to describe a collection of examples of language collected.
Advertisements

What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Corpus Linguistics and Second Language Acquisition – The use of ACORN in the teaching of Spanish Grammar Guadalupe Ruiz Yepes.
Gender Issues in Systems Design and User Satisfaction for e- testing software Prepared by Sahel AL-Habashneh. Department of Business information systems.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Corpora and Language Teaching
IMT530- Organization of Information Resources1 Feedback Like exercises –But want more instructions and feedback on them –Wondering about grading on these.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Albert Gatt LIN 3098 Corpus Linguistics. In this lecture Some more on corpora and grammar Construction Grammar as a theoretical framework Collostructional.
Online Corpora in L2 Writing Class Zawan Al Bulushi Indiana University Bloomington November 15,
FishBase Summary Page about Salmo salar in the standard Language of FishBase (English) ENBI-WP-11: Multilingual Access to European Biodiversity Sites through.
ENG 626 CORPUS APPROACHES TO LANGUAGE STUDIES exploring frequencies in texts Bambang Kaswanti Purwo
GoogleDictionary Paul Nepywoda Alla Rozovskaya. Goal Develop a tool for English that, given a word, will illustrate its usage.
Web software. Two types of web software Browser software – used to search for and view websites. Web development software – used to create webpages/websites.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Corpora and Concordancers in ESL/EFL Class: Truly Authentic Language for Language Learning. and opening.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Friday Finish chapter 24 No written homework.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
A process of taking your best guesses. Companies have web sites where you can access your information.
Corpus search What are the most common words in English
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark
DeepDict A Graphical Corpus-based Dictionary of Word Relations Eckhard Bick University of Southern Denmark & GrammarSoft ApS.
A Graphical Corpus-based Dictionary of Word Relations
WP4 Models and Contents Quality Assessment
Automatic Writing Evaluation
Writing Inspirations, 2017 Aalto University
Criterial features If you have examples of language use by learners (differentiated by L1 etc.) at different levels, you can use that to find the criterial.
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
Collecting Written Data
What do these mean? Your time is up Ready for anything (Red E)
Vocabulary Module 2 Activity 5.
Corpus Linguistics Anca Dinu February, 2017.
Measuring Monolinguality
Introduction to Corpus Linguistics
Statistical NLP: Lecture 7
Distributional analysis
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Statistical NLP: Lecture 3
Searching corpora.
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Corpus Linguistics I ENG 617
Constraint Grammar ESSLLI
Writing Inspirations, Spring 2016 Aalto University
Corpora and Concordancers in ESL/EFL Class:
A Statistical Model for Parsing Czech
Error Analysis  What is EA? EA is a technique which aims to describe and explain the systematic nature of deviations or errors generated in the learner’s.
European Network of e-Lexicography
Corpus Linguistics I ENG 617
Topics in Linguistics ENG 331
An ICALL writing support system tunable to varying levels
ICEweb 2 a new way of compiling high-quality web-based components for ICE corpora Martin Weisser Center for Linguistics & Applied Linguistics, Guangdong.
Stylistics and Stylometry
Statistical n-gram David ling.
Using GOLD to Tracking L2 Development
Introduction to Text Analysis
Applied Linguistics Chapter Four: Corpus Linguistics
A User study on Conversational Software
Prime Time Simply the best From online corpora to word clouds
User’s Perspective Laurie Gerber.
Presentation transcript:

DeepDict en ny type online-ordbog til leksikografi og undervisning Eckhard Bick Syddansk Universitet & GrammarSoft ApS

Motivation – the lexicographer's view ● a) lexicography: better data – corpus data have better coverage and better legitimacy than introspection or chance quotes from literature – Quantity: the more the better – Authenticity: real running data from mixed sources (Internet), or at least a source with mixed topics (Wikepedia, news text) – given a corpus, a grammatically annotated one is best for the extraction of lexical patterns and statistics – lexical patterns are best based on linguistic relations (subject, object etc.) rather than mere adjacency in text – even given a corpus that satisfies all of the above, it is cumbersome to search for and quantify lexical patterns – even a statistics-integrating interface like our CorpusEye will only provide data for one pattern at a time

Motivation – the user's view ● b) lexicography: better accessibility – electronic vs. paper: no size limitations, easy searching, “depth-on-demand” (QuickDict vs. DeepDict) – passive (“definitional”) vs. active (“productive-contextual”) – Advanced Learner's Dictionary: information on how to use a word in context – syntactic and semantic restrictions and combinatorial aspects, e.g. A gives x to B (A,B = +HUM, x,y = -HUM) – “live” examples – but how to show, for a given entry word, all constructions and examples, and how not to forget any?

Idea: graphical presentation of lexical complements based on dependency statistics ● annotate a corpus – Peter ate a couple of apples – Cats eat an mice ● generalize by collecting and counting dependency pairs of lemmas (simplified): – PROP_SUBJ -> eat, cat_SUBJ -> eat – apple_ACC -> eat, mouse_ACC -> eat ● present the result in list form – {PROP,cat} SUBJ -> eat <- {apple,mouse} ACC

how to distinguish between typical and non- informative complements? ● use frequency counts for dependency pairs ● normalize for lexical frequency – C * p(a->b) / ( p(a) * p(b) ) ● use thresholds for minimum co-occurence strength and minium absolute number of occurences – ask for strong positive correlation (mutual information) – log 2 frequency classes: 1 (1), 2 (2-4), 3 (5-8), 4 (9-16).... ● for a few special word classes, use generalizations: – PROP/hum (names) – NUM (numbers) ● separate treatment of pronouns (only relative to each other)

the applicational environment ● DeepDict is hosted at where it is part of an integrated suite of translation tools ● it was developed as a spin-off from many years of – a) parser development – b) corpus annotation projects (Corpus Eye at corp.hum.sdu.dk) – c) lexicography and MT research ● The primary languages are the Germanic and Romance languages, but the method is largely language-independent given a lemmatized and dependency annotated corpus in that language

cut-and-paste text URL Tools: - WAP - sms - Firefox plugin - Docs Languages: en, da, no, se, pt, eo

The first dictionary layer: QuickDict mouse-over frequency DeepDict link TL/SL-inflected forms textual choice

The second dictionary layer: DeepDict

DeepDict: nouns

DeepDict: Verb + Prep.

DeepDict: Verb + Complements

The background: 10 dependency parsers

Data production corpus: encoding cleaning sentence separation id-marking raw text: - Wikipedia - newspaper - Internet - Europarl example concordance * DanGra m EngGra m... Comp. lexica CG grammars Dep grammar annotated corpora dep- pair extract or Statistical Database friendly user interface CG I norm frequencies

DeepDict: sources and sizes

How to use DeepDict 1 ● as a lexicographer – find inspiration as to complementation patterns for dictionary entries – find the most typical (not just the most common!) example of a certain construction – find candidates for multi-word expressions – find canditates for metaphorical usage – find semantic distinctions and subsenses not otherwise obvious, and triggered by head or dependent words (e.g. mistænksom – mistænkelig, high – big - large)

How to use DeepDict 2 ● for teaching – create lexical fields ● e.g. edibles, drinkables etc., via 'eat', 'drink') ● a list of languages? ● all about language – find phrasal verbs / prepositional complements – describe usage differences between near synonyms – distinguish between literal and abstract uses (e.g. heavy – tung – schwer)... are some of these cross-language phenomena? – find metaphores (caress_V) – find gender differences through pronouns (caress_V) – find out all about an animal (e.g. horse) – what does it do [as SUBJ], what can you do with it [as OBJ], how is it characterized [by pre- or postnominals]

gramtrans.com

Spin-offs: WebPainter trekanter (trekant) ● live in-line markup of web pages ● mouse-over translations while reading mouse-over translation: optional grammar (here: SUBJ and prep

User feedback (Danish-English on ● One of the best bits of translation software I've seen. ● This is the best translation tool I've found on the web. ● This webpage translation tool is an absolute life saver. You have made my life and my family's life much easier. ● Wow, the future is here. Never seen a machine translator as accurate as this. The next best thing to human translation. ● On a daily basis, you website. ● This site is seriously gosaves my a$$od. ● This is a tool - unparallel. ● This site is awesome. Just started doing business with a Danish company and it comes in very useful. ● It is a fantastic effort... does anything get any better than this?