Corpora and databases Introduction to Computational Linguistics 17 February 2016.

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Chunking: Shallow Parsing Eric Atwell, Language Research Group.
Advertisements

The Application of Machine Translation in CADAL Huang Chen, Chen Haiying Zhejiang University Libraries, Hangzhou, China
Proceedings of the Conference on Intelligent Text Processing and Computational Linguistics (CICLing-2007) Learning for Semantic Parsing Advisor: Hsin-His.
Jing-Shin Chang National Chi Nan University, IJCNLP-2013, Nagoya 2013/10/15 ACLCLP – Activities ( ) & Text Corpora.
June 6, 20073rd PIRE Meeting1 Tectogrammatical Representation of English in Prague Czech-English Dependency Treebank Lucie Mladová Silvie Cinková, Kristýna.
Language Data Resources Treebanks. A treebank is a … database of syntactic trees corpus annotated with morphological and syntactic information segmented,
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
1 Developing Statistic-based and Rule-based Grammar Checkers for Chinese ESL Learners Howard Chen Department of English National Taiwan Normal University.
Introduction to treebanks Session 1: 7/08/
Inducing Information Extraction Systems for New Languages via Cross-Language Projection Ellen Riloff University of Utah Charles Schafer, David Yarowksy.
Resources Primary resources – Lexicons, structured vocabularies – Grammars (in widest sense) – Corpora – Treebanks Secondary resources – Designed for a.
Introduction to Machine Learning Approach Lecture 5.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Memory Strategy – Using Mental Images
ELN – Natural Language Processing Giuseppe Attardi
LDMT MURI Data Collection and Linguistic Annotations November 4, 2011 Jason Baldridge, UT Austin Ulf Hermjakob, USC/ISI.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
UAM CorpusTool: An Overview Debopam Das Discourse Research Group Department of Linguistics Simon Fraser University Feb 5, 2014.
BTANT 129 w5 Introduction to corpus linguistics. BTANT 129 w5 Corpus The old school concept – A collection of texts especially if complete and self-contained:
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Translation Studies 8. Research methods in Translation Studies Krisztina Károly, Spring, 2006 Sources: Károly, 2002; Klaudy, 2003.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
A Web Application for Customized Corpus Delivery Nancy Ide, Keith Suderman, Brian Simms Department of Computer Science Vassar College USA.
Linguistics & AI1 Linguistics and Artificial Intelligence Linguistics and Artificial Intelligence Frank Van Eynde Center for Computational Linguistics.
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
Eurovoc does not yet exist for your language? The Hungarian experience. Tamás Váradi
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Introducing MorphoLogic to LIRICS Gábor Prószéky MorphoLogic Pázmány Péter Catholic University Faculty.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
Czech-English Word Alignment Ondřej Bojar Magdalena Prokopová
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Péter Schönhofen – Ad Hoc Hungarian → English – CLEF Workshop 20 Sep 2007 Performing Cross-Language Retrieval with Wikipedia Participation report for Ad.
Using a Lemmatizer to Support the Development and Validation of the Greek WordNet Harry Kornilakis 1, Maria Grigoriadou 1, Eleni Galiotou 1,2, Evangelos.
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Seminar in Applied Corpus Linguistics: Introduction APLNG 597A Xiaofei Lu August 26, 2009.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Using Surface Syntactic Parser & Deviation from Randomness Jean-Pierre Chevallet IPAL I2R Gilles Sérasset CLIPS IMAG.
What you have learned and how you can use it : Grammars and Lexicons Parts I-III.
Natural Language Programming David Vadas The University of Sydney Supervisor: James Curran.
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
An Iterative Approach to Extract Dictionaries from Wikipedia for Under-resourced Languages G. Rohit Bharadwaj Niket Tandon Vasudeva Varma Search and Information.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
MedKAT Medical Knowledge Analysis Tool December 2009.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
Annotation Procedure in Building the Prague Czech-English Dependency Treebank Marie Mikulová and Jan Štěpánek Institute of Formal and Applied Linguistics.
Multi-level Bootstrapping for Extracting Parallel Sentence from a Quasi-Comparable Corpus Pascale Fung and Percy Cheung Human Language Technology Center,
Mining Wiki Resoures for Multilingual Named Entity Recognition Xiej un
Text segmentation Amany AlKhayat. Before any real processing is done, text needs to be segmented at least into linguistic units such as words, punctuation,
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Correcting Misuse of Verb Forms John Lee, Stephanie Seneff Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge ACL 2008.
Language Identification and Part-of-Speech Tagging
Computational and Statistical Methods for Corpus Analysis: Overview
Corpus Linguistics I ENG 617
Natural Language Processing (NLP)
Statistical n-gram David ling.
Extracting Recipes from Chemical Academic Papers
Introduction to Text Analysis
Natural Language Processing (NLP)
CS224N Section 3: Corpora, etc.
Information Retrieval
Natural Language Processing (NLP)
Presentation transcript:

Corpora and databases Introduction to Computational Linguistics 17 February 2016

Outline Basic concepts Annotation Corpus building English corpora Corpora developed at Szeged NLP group

Basic concepts Corpus (pl. corpora): a database (collection of texts) created for specific purposes Annotation: manual marking of relevant linguistic information on texts

Supervised machine learning Human Manually annotated data Corpora Machine Machine learning systems Statistics, generalizations Text Automatically processed new data

Types of corpora Monolingual Multilingual – parallel corpus: same set of data in several languages Speech corpus: recorded material Written corpus: texts

Annotation Text/document level –An is spam/ham? Sentence level –Factual/uncertain information? Token/phrase level –Morphological analysis –Named entities Without annotation –Word frequencies –Co-occurrences (n-grams)

Types of annotation manual Semi-automatic: automatic annotation corrected manually automatic single: one text – one annotator –cheap –fast multiple: one text – multiple annotators (independently of each other) –time-consuming –expensive –inter-annotator agreement rate

Agreement rates A metric to check the consistency of annotation (how similar two annotators can annotate) –accuracy –F-score (precision, recall) –Kappa Agreement rate is usually regarded as a theoretical limit for the performance of machines May denote the difficulty of the task Heavily depends on the task!

Forms of annotation Text and annotation in the same file (mostly XML) Text and annotation in separate files (standoff/standalone) Advantages/disadvantages: –Restoring the original text –Adding new texts –Deleting texts

Rövidtávú— féléves— kilátásaikat illetően a cégek egész évben októberben voltak a legoptimistábbak. Rövidtávú Rövidtávú [X] Rövidtávú [X] rövid rövid [Afp-sn] rövid [Afp-sn] rövid [Nc-sn] távú távú [Afp-sn] távú [Afp-sn]

1___ELLELL__00 ROOTROOT 2JapánbanJapánJapánNN SubPOS=p|Num=s|Cas=2|NumP=none|PerP=none|NumPd=none SubPOS=p|Num=s|Cas=2|NumP=none|PerP=none|NumPd=none11OBL OBL 3,,,,,__11 PUNCTPUNCT 4aholaholaholRRSubPOS=r|Deg=none|Num=none|Per=none SubPOS=r|Deg=none|Num=none|Per=none99TLOCYTLOCY ban MM SubPOS=c|Num=s|Cas=2|Form=d|NumP=none|PerP=none|NumPd=none SubPOS=c|Num=s|Cas=2|Form=d|NumP=none|PerP=none|NumPd=none99OBL OBL 6közelközelközelRRSubPOS=x|Deg=none|Num=none|Per=none SubPOS=x|Deg=none|Num=none|Per=none77MODEMODE 7félmilliófélmilliófélmillióMM SubPOS=c|Num=s|Cas=n|Form=l|NumP=none|PerP=none|NumPd=none SubPOS=c|Num=s|Cas=n|Form=l|NumP=none|PerP=none|NumPd=none88ATT ATT 8válástválásválásNN SubPOS=c|Num=s|Cas=a|NumP=none|PerP=none|NumPd=none SubPOS=c|Num=s|Cas=a|NumP=none|PerP=none|NumPd=none99OBJ OBJ 9mondtakmondmondVVSubPOS=m|Mood=i|Tense=s|Per=3|Num=p|Def=n SubPOS=m|Mood=i|Tense=s|Per=3|Num=p|Def=n11ATTATT 10kikikiRRSubPOS=p|Deg=none|Num=none|Per=none SubPOS=p|Deg=none|Num=none|Per=none99PREVERBPREVERB 11,,,,,__99 PUNCTPUNCT ben MM SubPOS=c|Num=s|Cas=2|Form=d|NumP=none|PerP=none|NumPd=none SubPOS=c|Num=s|Cas=2|Form=d|NumP=none|PerP=none|NumPd=none11OBL OBL 13mármármárRRSubPOS=x|Deg=none|Num=none|Per=none SubPOS=x|Deg=none|Num=none|Per=none1515MODEMODE 142,62,62,6MM SubPOS=f|Num=s|Cas=n|Form=d|NumP=none|PerP=none|NumPd=none SubPOS=f|Num=s|Cas=n|Form=d|NumP=none|PerP=none|NumPd=none1515NUM NUM 15milliótmilliómillióMM SubPOS=c|Num=s|Cas=a|Form=l|NumP=none|PerP=none|NumPd=none SubPOS=c|Num=s|Cas=a|Form=l|NumP=none|PerP=none|NumPd=none11OBJ OBJ __00 PUNCTPUNCT

Shadow_Riders.txt The Shadow Riders, known as the in the original Japanese language version, are a fictional group of villains in the Yu-Gi-Oh! GX anime series, appearing between episodes Composed of seven duelists and their leader of varying origins and backgrounds who each have their own agendas, the Shadow Riders serve as the main antagonists of the series' first season, intent on resurrecting the Sacred Beasts. However, one of them returns in the fourth and final season as the true mastermind behind the mysterious attacks that take place in Duel Academy and Domino City. Shadow_Riders.txt.annotation NE_ORG417 NE_MISC4856 NE_MISC MWE_COMPOUND_NOUN SENT_BOUND NE_ORG NE_MISC NE_MISC_SB MWE_LVC MWE_LVC_VERB MWE_LVC_NOUN NE_LOC NE_LOC NE_LOC_SB NE_ORG NE_PER NE_PER_SB NE_PER SENT_BOUND MWE_COMPOUND_NOUN MWE_COMPOUND_NOUN NE_MISC SENT_BOUND

Annotation tools Graphical user interface (GUI) Understandable for humans Easy-to-use Error rate can be reduced

How to build corpora 1.Collecting and preparing texts 2.Manual annotation –Multiple annotation – to check agreement rate –Single annotation 3. Resolving differences, checking –Disambiguation of differences in annotation 4. Final steps –Creating the final format of the corpus, correcting technical issues, publishing the corpus

Making use of corpora Reference Training machine learning algorithms Testing machine learning algorithms Collecting language data

English corpora British National Corpus (BNC) –Briish Englisg –~100M tokens –Written and oral language –Automatic annotation Wall Street Journal (WSJ) –Business English –Some parts are manually annotated (morphology, syntax) Reuters –~100M tokens –Documents and paragraphs Gigaword corpus –2 billion tokens Penn TreeBank –5 million tokens –POS tags –Syntactic analysis (constituency) Task-specific corpora: CoNLL-2003 (named entities), SemEval (semantics)… – K tokens

Magyar Nemzeti Szövegtár (MNSZ) 187,6 million tokens news, fiction, science, official, personal domains Hungarian from outside Hungary Automatic lemmatization and POS- tagging Gigaword version (1 billion token)

Webkorpusz More than 1,48 billion tokens (unfiltered, or 589 million filtered tokens) The biggest Hungarian corpus so far 18 million websites (.hu) corpushttp://mokk.bme.hu/resources/web corpus

SzegedParalell English-Hungarian parallel corpus Manually aligned paragraphs and sentences: –Language books –EU texts –Bilingual magazines –fiction 99,000 sentence alignment units

Szeged (Dependency) Treebank sentences 1,5 million tokens punctuation marks 6 domains –Student essays –Computer texts –fiction –Legal texts –Newspaper texts –Short business news Manually annotated morphological, syntactic (constituency and dependency) annotation, named entities, light verb constructions, coreference

NE-corpora CoNLL challenge ORG / LOC / PER / MISC categories ~ tokens (SZK business news) ~ tokens (articles from HVG) –tag-for-tag: I travelled to Barcelona. –tag-for-meaning: Barcelona won the game.

Corpora annotated for uncertainty BioScope (20K sentences) –Clinical texts –Biological abstracts –Biological papers CoNLL-2010 Shared Task corpora (biological papers (18K sentences) + Wikipedia articles (20K sentences) ) Szeged Uncertainty Corpus –reannotated CoNLL FactBank –Unified annotation principles WikiWeasel 2.0: discourse level uncertainty hUnCertainty: Hungarian corpus (17K sentences)

A O O lap O O szerint B-doxastic B-doxastic P. O O. O O Márió O O kitart B-doxastic O amellett O O, O O hogy O O egyáltalán O O nem O O emlékszik O O arra O O, O O hogy O O őt O O bárki O O is O O üldözte O O volna O O. O O Állítólag B-epistemic B-epistemic azon O O a O O területen O O, O O ahol O O a O O vérengzés O O történt O O, O O csak O O a O O gyilkos O O kocsijának O O a O O keréknyomát O O találták O O meg O O

MWE-corpora Multiword expressions Wiki50 corpora: –50 English Wikipedia articles (4700 sentences) –MWEs and NEs manually annotated LVCs in Szeged Treebank and SzegedParallel (in part) English, German, Spanish and Hungarian LVCs in JRC-Acquis legal parallel corpus (~100K tokens for each language)

Wiki50

HunLearner Students of Hungarian at intermediate and advanced level Essays written on computer without any dictionary 1400 sentences Morphological errors on nouns Errors of definite/indefinite conjugation

1AaTf2DETTSubPOS=f 2gyerekgyerekNc-sn9SUBJN SubPOS=c|Num=s|Cas=n|NumP=none|PerP=none|NumPd=none 3nagyonnagyonRx4MODER SubPOS=x|Deg=none 4okosokosAfp-sn9ATTA SubPOS=f|Deg=p|Num=s|Cas=n|NumP=none|PerP=none|NumPd=none 5ésésCcsw4CONJC SubPOS=c|Form=s|Coord=w 6kedveskedvesAfp-sn5COORDA SubPOS=f|Deg=p|Num=s|Cas=n|NumP=none|PerP=none|NumPd=none 7ésésCcsw6CONJC SubPOS=c|Form=s|Coord=w 8jóljólRxp7COORDRSubPOS=x|Deg=p 9müködikmüködikX0ROOTX_ 10aaTf11DETTSubPOS=f 11kapcsolatünkkapcsolatünkX9OBLX_ kapcsolatunkStem: AAssimilation: 1Matching: BSuffix number: PUNCT._

Personality markers and opinions 500 blogs on traveling to 5 destinations English blogs Positive and negative opinions on certain aspects Text spans related to personality traits also marked The portions were on the small side.

Coreference corpus Entities referring to the same entity are linked Szeged Treebank (in part)

Project work idea Create a small corpus on a topic of your choice Annotation tools are available Assistance in programming (if needed) More ideas: –error analysis of the output of an NLP tool (parsers, machine translators, sentiment analysis) –(statistical) analysis of data from a corpus