Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.

Slides:



Advertisements
Similar presentations
The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Advertisements

Machine Translation II How MT works Modes of use.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
The Bulgarian National Corpus and Its Application in Bulgarian Academic Lexicography Diana Blagoeva, Sia Kolkovska, Nadezhda Kostova, Cvetelina Georgieva.
CALTS, UNIV. OF HYDERABAD. SAP, LANGUAGE TECHNOLOGY CALTS has been in NLP for over a decade. It has participated in the following major projects: 1. NLP-TTP,
Building an Ontology-based Multilingual Lexicon for Word Sense Disambiguation in Machine Translation Lian-Tze Lim & Tang Enya Kong Unit Terjemahan Melalui.
Natural Language and Speech Processing Creation of computational models of the understanding and the generation of natural language. Different fields coming.
Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.
1/7 INFO60021 Natural Language Processing Harold Somers Professor of Language Engineering.
Are Linguists Dinosaurs? 1.Statistical language processors seem to be doing away with the need for linguists. –Why do we need linguists when a machine.
1/26 Corpus Linguistics. 2/26 Varieties of English Relevance of corpus linguistics to this course –Previously studies of stylistics were largely informal.
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
LELA English Corpus Linguistics
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Corpora and Language Teaching
Semi-Automatic Learning of Transfer Rules for Machine Translation of Low-Density Languages Katharina Probst April 5, 2002.
1 The Web as a Parallel Corpus  Parallel corpora are useful  Training data for statistical MT  Lexical correspondences for cross-lingual IR  Early.
Research methods in corpus linguistics Xiaofei Lu.
Chapter 3: An Introduction to Corpus Linguistics Compiled by: Sajjad Ghadamyari Farhad Ghiasvand Presentation Date: Dec. 8, Monday.
Memory Strategy – Using Mental Images
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
1 Statistical NLP: Lecture 6 Corpus-Based Work. 2 4 Text Corpora are usually big. They also need to be representative samples of the population of interest.
ELN – Natural Language Processing Giuseppe Attardi
Machine Translation Dr. Radhika Mamidi. What is Machine Translation? A sub-field of computational linguistics It investigates the use of computer software.
Lemmatization Tagging LELA /20 Lemmatization Basic form of annotation involving identification of underlying lemmas (lexemes) of the words in.
Data collection and experimentation. Why should we talk about data collection? It is a central part of most, if not all, aspects of current speech technology.
April 2005CSA2050:NLTK1 CSA2050: Introduction to Computational Linguistics NLTK.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
1 Corpora: Annotating and Searching LING 5200 Computational Corpus Linguistics Martha Palmer.
Machine Translation, Digital Libraries, and the Computing Research Laboratory Indo-US Workshop on Digital Libraries June 23, 2003.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
1 Computational Linguistics Ling 200 Spring 2006.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Chapter 10 Language and Computer English Linguistics: An Introduction.
Language Data Resources About Corpora. J. Sinclair: “Language looks rather different when you look at a lot of it at once.“ P. Eisner: “Znáte jej, ten.
UCREL: from LOB to REVERE Paul Rayson. November 1999CSEG awayday Paul Rayson2 A brief history of UCREL In ten minutes, I will present a brief history.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Approaches to Machine Translation CSC 5930 Machine Translation Fall 2012 Dr. Tom Way.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
October 2005CSA3180 NLP1 CSA3180 Natural Language Processing Introduction and Course Overview.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
French / English Translation by Sharon Ulery. Purpose computational linguistics to serve students of French or English & those who only know one of these.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Generality and Openness in Enabling Methodologies for Morphology and Text Processing Anssi Yli-Jyrä Department of General Linguistics, University of Helsinki.
1 STO A Lexical Database of Danish for Language Technology Applications Anna Braasch Center for Sprogteknologi Copenhagen SPINN Seminar, October 27, 2001.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
Part-of-Speech Tagging with Limited Training Corpora Robert Staubs Period 1.
Word classes and part of speech tagging. Slide 1 Outline Why part of speech tagging? Word classes Tag sets and problem definition Automatic approaches.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
1/16 TectoMT Zdeněk Žabokrtský ÚFAL MFF UK Software framework for developing MT systems (and other NLP applications)
LingWear Language Technology for the Information Warrior Alex Waibel, Lori Levin Alon Lavie, Robert Frederking Carnegie Mellon University.
Use of Concordancers A corpus (plural corpora) – a large collection of texts, written or spoken, stored on a computer. A concordancer – a computer programme.
English-Lithuanian-English Lexicon Database Management System for MT Gintaras Barisevicius and Elvinas Cernys Kaunas University of Technology, Department.
Approaches to Machine Translation
CORPUS LINGUISTICS Corpus linguistics is the study of language as expressed in samples (corpora) or "real world" text. An approach to derive at a set of.
Computational and Statistical Methods for Corpus Analysis: Overview
Natural Language Processing (NLP)
--Mengxue Zhang, Qingyang Li
Corpus-Based ELT CEL Symposium Creating Learning Designers
Lemma: canonical (citation) form of a lexeme, which conventionally represents the set of related words Lexeme: the set of related words But….
Approaches to Machine Translation
Computational Linguistics: New Vistas
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
CSA2050: Introduction to Computational Linguistics
Natural Language Processing (NLP)
Presentation transcript:

Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a different purpose (happenstantial) – Require software to exploit them Tools as resources – Generic software for use with primary resources – Generic software for exploiting secondary resources Reusability

Primary resources (Lexical) Lexicons Dictionaries for NLP, more than just lists of words Not like dictionaries for human use –Need information that for humans is “obvious” –Don’t need some information typically found in dictionaries for human use –But, see later Not like dictionaries for human use Reusability: “theory-neutral”

Primary resources (Lexical) Structured vocabulary Conceptual structure –WordNet –(Machine readable) Roget’s Thesaurus Ontologies –Structure reflects specialised domain –Defines vocabulary and conceptual relations –Vocabulary reflects reality

Primary Resources Grammatical “Grammar” includes morphology and syntax Rules etc. have to be written, usually by a linguist Generic formalisms devised, somewhat independent of application, with associated implementations Usually depend on (and “implement”) some (linguistic) theory Reusability –Application independent –Direction-neutral (analysis vs synthesis)

Corpora Corpus (pl. corpora) is a collection of texts Use of term usually implies some “value added” –Specific to a domain –Explicitly collected (“planned”) –With some information added as a result of analysis, e.g. POS tags Illustrates usage –Word collocations –Grammatical constructions Used to build applications by machine learning

British National Corpus One of the most widely used corpora (esp. in Britain, but also elsewhere) A balanced synchronic text corpus containing 100 million words (POS tagged) Collected in late 1980s 90% text, 10% transcribed speech Encoded according to TEI standards Associated tools (mainly for searching), but many users write their own (eg in Perl)

Examples of other corpora Wall Street Journal corpus –25m words from WSJ 1987 –Parsed, indexed North American Newstext corpus –350m words of newswire text –Indexed but otherwise not annotated ATIS (Air Travel Information Sysytem) corpus –Transcriptions of real dialogues Various corpora collected for competitions –MUC, TREC, …

Parallel corpora Bilingual and multilingual corpora –Texts and their associated “translations” –Need to be aligned to be useful –Useful for translation studies, and to build MT systems, as reference corpora, or as input to SMT Major examples: –Canadian and Hong Kong Hansards –European parliament and legislation (Europarl) –Stuff from other bilingual countries –User documentation from big companies –Online newspapers with English (etc) versions

Treebanks In some cases corpora have been fully parsed (and verified) Treebanks are a very rich resource, but generally highly theory-specific Major example is Penn Treebank –includes (selections from) WSJ, Brown, ATIS corpora –ongoing

Secondary resources: Lexical Word lists aimed at human users can be useful Notably dictionaries if available in machine-readable form, eg typesetters’ tapes Since content is aimed at humans, needs sophisticated software to extract/convert information

Secondary resources: corpora Any collection of text can be turned into a corpus, in principle Raw text useful for many purposes Machine learning approaches –Language model can be learned statistically Bilingual corpora much used for building statistical MT systems –Similarly, translation rules learned from the examples in the corpus

Generic tools as resources Important idea from computer science of separating algorithms from data Distinguish: –Grammar rules and lexicon that it uses as data –Programs and user interfaces that use the data to process a given input –The algorithms underlying those programs Danger of confusion: eg Brill’s tagger is software that you can use to tag text, but you have to “program” it (actually, train it) for a given language (actually, sublanguage)

Generic tools: reusability Well-known principles of software engineering here: –Write software for a specific purpose, but try to make it as general as possible –Reusable for a different task –Reusable with different data Same principles applies to data –Distinguish between static (declarative) information and what you do with it (procedural) –Since data is voluminous (especially lexical data), important to try to be as neutral as possible regarding different purposes, so it can be reused