The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark

Slides:



Advertisements
Similar presentations
Grammar for Fun: IT-based Gmmar Teaching with VISL Eckhard Bick, 2004 Eckhard Bick.
Advertisements

Information and Communication Technologies 1 Working with Portuguese corpora Diana Santos Linguateca
Statistical NLP: Lecture 3
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Semantic Web and Web Mining: Networking with Industry and Academia İsmail Hakkı Toroslu IST EVENT 2006.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
Information and Communication Technologies 1 Esfinge (Sphinx) at CLEF 2008: Experimenting with answer retrieval patterns. Can they help? Luís Fernando.
ABRAPT Mini-curso The Corpógrafo Theory and Practice Belinda Maia & Luís Sarmento PoloFLUP LINGUATECA.
Center for Computational Learning Systems Independent research center within the Engineering School NLP people at CCLS: Mona Diab, Nizar Habash, Martin.
Improving UML Class Diagrams using Design Patterns Semantics Shahar Maoz Work in Progress.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Telecom and Informatics 1 Linguateca activities Diana Santos Luís Costa
An innovative platform to allow translation and indexing of internet sites Localization World
Knowledge Science & Engineering Institute, Beijing Normal University, Analyzing Transcripts of Online Asynchronous.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Intuitive Coding of the Arabic Lexicon Ali Farghaly & Jean Senellart SYSTRAN Software Corporation San Diego, CA & Soisy, France.
AQUAINT Kickoff Meeting – December 2001 Integrating Robust Semantics, Event Detection, Information Fusion, and Summarization for Multimedia Question Answering.
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
Experiments on Building Language Resources for Multi-Modal Dialogue Systems Goals identification of a methodology for adapting linguistic resources for.
Dr. Kristin Bakken, NO 2014 Oddrun Grønvik, NO 2014 Dr. Daniel Ridings, DOK Sept. 7th 2004.
Suléne Pilon & Danie Prinsloo Overview: Teaching and Training in South Africa 25 November 2008;
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
Treebank Troubles Eckhard Bick Southern Denmark University
Information and Communication Technologies Linguateca University of São Paulo ICMC / NILC 1 Yes, user! compiling a corpus according to what the user wants.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
1 Branches of Linguistics. 2 Branches of linguistics Linguists are engaged in a multiplicity of studies, some of which bear little direct relationship.
MedKAT Medical Knowledge Analysis Tool December 2009.
LINGUATECA FLUP/CLUP The Corpógrafo – a Web-based environment for corpora research extract Term Candidates.
 General domain question answering system.  The starting point was the architecture described in Brill, Eric. ‘Processing Natural Language without Natural.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
INTRODUCTION TO APPLIED LINGUISTICS
The University of Illinois System in the CoNLL-2013 Shared Task Alla RozovskayaKai-Wei ChangMark SammonsDan Roth Cognitive Computation Group University.
DeepDict en ny type online-ordbog til leksikografi og undervisning Eckhard Bick Syddansk Universitet & GrammarSoft ApS.
Semantic Annotation Eckhard Bick Institute of Language and Communication University of Southern Denmark & GrammarSoft ApS
Constraint Grammar ESSLLI
Constraint Grammar ESSLLI Tuesday: Lexicon, PoS, Morphology.
HISPAL A Constraint Grammar Parser for Spanish Eckhard Bick University of Southern Denmark
DeepDict A Graphical Corpus-based Dictionary of Word Relations Eckhard Bick University of Southern Denmark & GrammarSoft ApS.
A Bare-bones Constraint Grammar
Language Identification and Part-of-Speech Tagging
Leonardo Zilio Supervisors: Prof. Dr. Maria José Bocorny Finatto
Approaches to Machine Translation
Sentiment analysis algorithms and applications: A survey
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Terminology Extraction Tool (Auto/Semi-Auto)
Urdu-to-English Stat-XFER system for NIST MT Eval 2008
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
--Mengxue Zhang, Qingyang Li
Machine Learning in Natural Language Processing
Phil Durrant Debra Myhill Mark Brenchley
Approaches to Machine Translation
CS4705 Natural Language Processing
How to publish in a format that enhances literature-based discovery?
Computational Linguistics: New Vistas
Using GOLD to Tracking L2 Development
Natural Language Processing (NLP)
Artificial Intelligence 2004 Speech & Natural Language Processing
A Link Grammar for an Agglutinative Language
Natural Language Processing (NLP)
Presentation transcript:

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship Eckhard Bick University of Southern Denmark

Outline ● Flow chart Linguateca Palavras ● History and milestones – pre- and co-Linguateca ● Parallel development strands ● A recent example of ressource integration: QuickDict Público + Folha + Dependency Parsing + Statistics = Lexicography

Raw Corpus Annotated Corpus Linguat eca PALAVRAS parser gold revision HAREM Compara Floresta data challenges evaluation annotation (PoS, syntax, sem,...) debugging tool integration AC/DC QA NER form/function category standardisation VISL teaching research format filters: xml, tokenization

History of the parser (pre-Linguateca) ● Portuguese lexicography (master's thesis) ● Morphological analyzer for Portuguese ● Development of the first version of PALAVRAS' Constraint Grammar modules (morphology and syntax) ● 1996 presentation at the 2. Propor in Curitiba, Brazil ● 1997 focus on tree structures and corpus work (Borba-Ramsey, VEJA, NILC) ● experiments with heuristics, transscribed speech (NURC), dialectal variation (Cordial-syn, Moçambique) and historical data (Tycho Brahe) ● 1997-? PALAVRAS and derived systems for other languages used for syntax teaching in Denmark (VISL) ● 2000 Dr.phil. thesis centering on PALAVRAS

History of the parser (co-Linguateca) - corpus annotation ● 1999-? AC/DC project (e.g. Santos & Bick 2000, LREC) – both Portuguese (esp. Público) and Brazilian (Folha) data – quantitatively mostly newspaper genre – mostly unrevised annotation, but feedback and multiple reannotations ● 2000-? Floresta Sintá(c)tica Project (e.g. Afonso et al. 2002, LREC) – 3 phases: (a) Odense-Oslo, (b) multipolar international, (c) Portugal – automatic annotation with human linguistic revision – (a-b) PALAVRAS + PSG (Bick 2003, CL), two-pass annotation and revision – (c) PALAVRAS + DG (Bick 2005, TLT), one-pass revision ● 2004 PALAVRAS-Linguateca license at SINTEF, Oslo – raw and edited annotation (e.g. COMPARA)

History of the parser (co-Linguateca) - joint evaluation ● 2003 Morfolimpíadas/Avalon (Santos 2007) – PALMORF, an adapted morphology module from PALAVRAS, achieves top results ● HAREM (evaluation meeting 2006, Porto) – PALAVRAS-NER (Bick 2006, Propor Itatiaia), with separate modules for name recognition and classification, participates with good results (best F-scores) ● HAREM – the SeReLep system (by PUCRS) integrates a licensed PALAVRAS

Other development strands ● Question & Answering – A prototype using PALAVRAS syntactic analysis (Bick 2003, EPIA) – PALAVRAS used as a syntactic module in other systems ● University of Evora (Quaresma et al. 2004, CLEF) ● Esfinge (Costa 2006, CLEF) ● Historical Portuguese – Parser and lexicon daptations for 18 th and 19 th century Brazilian Portuguese (Bick & Módolo 2005) ● Semantic annotation beyond NER – Semantic prototype annotation (used for MT and Floresta) – Semantic role annotation (Bick 2007, TIL) ● CG3, a new Constraint Grammar formalism and compiler – direct creation and referencing of dependency and anaphora relations, context windows larger than a sentece, reg. expressions, integration of statistical data,....

DeepDict ● A lexicographic tool to provide contextual dictionary information on the fly – useful for dictionary publishers, students, linguists.. ● example of a product integrating different types of ressources into a real life tool: Corpus data, Tagger/Parser, Statistical tools ● example of interactive, corpus-driven lexicography – (1) results can be fed back into the parsing lexicon (valency tags, semantic lumping etc) – (2) the improved parsing lexicon allows better corpus annotation and – in turn – more DeepDict data ● the Portuguese version is freely available on the Internet

DeepDict ● syntactically analyzed corpus (dependency links and functions) – Linguateca's Público and Folha corpora (CETEMPúblico, CETENFolha) – ca M words – Portuguese Wikipedia (Nov. 2005) – ca. 8 M words – Portuguese section of Europarl – ca. 27 M words ● lemmatization, “normalization” (passives, numbers, names) ● extraction of mother-daughter relations, depgrams, not ngrams (cf. Adam Kilgariff's Sketch Engine) – N A + ADJ (gravemente doente) – V (ganhar ktp, V + PRP (pensar em) ● co-occurrence measure: p(AB) / p(A) * p (B) ● graphical interface

Palavras links: ● ● visl/pt/parsing/automatic/ (live parsing, file upload etc.) ● constraint_grammar.html (formalism) ● visl/pt/info/ (categories and annotation docs)