Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.

Slides:



Advertisements
Similar presentations
The Chinese Room: Understanding and Correcting Machine Translation This work has been supported by NSF Grants IIS Solution: The Chinese Room Conclusions.
Advertisements

CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)
Statistical Machine Translation Part II: Word Alignments and EM Alexander Fraser ICL, U. Heidelberg CIS, LMU München Statistical Machine Translation.
C O N T E X T - F R E E LANGUAGES ( use a grammar to describe a language) 1.
Dependency Parsing Joakim Nivre. Dependency Grammar Old tradition in descriptive grammar Modern theroretical developments: –Structural syntax (Tesnière)
Recognizing Implicit Discourse Relations in the Penn Discourse Treebank Ziheng Lin, Min-Yen Kan, and Hwee Tou Ng Department of Computer Science National.
NLP and Speech Course Review. Morphological Analyzer Lexicon Part-of-Speech (POS) Tagging Grammar Rules Parser thethe – determiner Det NP → Det.
Normalized alignment of dependency trees for detecting textual entailment Erwin Marsi & Emiel Krahmer Tilburg University Wauter Bosma & Mariët Theune University.
Part II. Statistical NLP Advanced Artificial Intelligence Part of Speech Tagging Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most.
Probabilistic Parsing: Enhancements Ling 571 Deep Processing Techniques for NLP January 26, 2011.
PCFG Parsing, Evaluation, & Improvements Ling 571 Deep Processing Techniques for NLP January 24, 2011.
Amirkabir University of Technology Computer Engineering Faculty AILAB Efficient Parsing Ahmad Abdollahzadeh Barfouroush Aban 1381 Natural Language Processing.
Växjö University Joakim Nivre Växjö University. 2 Who? Växjö University (800) School of Mathematics and Systems Engineering (120) Computer Science division.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Machine Learning in Natural Language Processing Noriko Tomuro November 16, 2006.
Probabilistic Parsing Ling 571 Fei Xia Week 5: 10/25-10/27/05.
Introduction to Machine Learning Approach Lecture 5.
Creation of a Russian-English Translation Program Karen Shiells.
Research methods in corpus linguistics Xiaofei Lu.
Part II. Statistical NLP Advanced Artificial Intelligence Applications of HMMs and PCFGs in NLP Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Data-Driven Dependency Parsing. 2 Background: Natural Language Parsing Syntactic analysis String to (tree) structure He likes fish S NP VP NP VNPrn.
Survey of Semantic Annotation Platforms
Distributional Part-of-Speech Tagging Hinrich Schütze CSLI, Ventura Hall Stanford, CA , USA NLP Applications.
Syntactically annotated corpora of Estonian Heli Uibo Institute of Computer Science University of Tartu
The CoNLL-2013 Shared Task on Grammatical Error Correction Hwee Tou Ng, Yuanbin Wu, and Christian Hadiwinoto 1 Siew.
Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.
Scott Duvall, Brett South, Stéphane Meystre A Hands-on Introduction to Natural Language Processing in Healthcare Annotation as a Central Task for Development.
The Prague (Czech-)English Dependency Treebank Jan Hajič Charles University in Prague Computer Science School Institute of Formal and Applied Linguistics.
Inductive Dependency Parsing Joakim Nivre
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Tracking Language Development with Learner Corpora Xiaofei Lu CALPER 2010 Summer Workshop July 12, 2010.
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
A Cascaded Finite-State Parser for German Michael Schiehlen Institut für Maschinelle Sprachverarbeitung Universität Stuttgart
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Coarse-to-Fine Efficient Viterbi Parsing Nathan Bodenstab OGI RPE Presentation May 8, 2006.
Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.
Tokenization & POS-Tagging
CSKGOI'08 Commonsense Knowledge and Goal Oriented Interfaces.
CSA2050 Introduction to Computational Linguistics Parsing I.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.
NLP. Introduction to NLP The probabilities don’t depend on the specific words –E.g., give someone something (2 arguments) vs. see something (1 argument)
Supertagging CMSC Natural Language Processing January 31, 2006.
Syntactic Annotation of Slovene Corpora (SDT, JOS) Nina Ledinek ISJ ZRC SAZU
LING 6520: Comparative Topics in Linguistics (from a computational perspective) Martha Palmer Jan 15,
Concepts and Realization of a Diagram Editor Generator Based on Hypergraph Transformation Author: Mark Minas Presenter: Song Gu.
Statistical Machine Translation Part II: Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Stochastic Methods for NLP Probabilistic Context-Free Parsers Probabilistic Lexicalized Context-Free Parsers Hidden Markov Models – Viterbi Algorithm Statistical.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Dependency Parsing Niranjan Balasubramanian March 24 th 2016 Credits: Many slides from: Michael Collins, Mausam, Chris Manning, COLNG 2014 Dependency Parsing.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.
Natural Language Processing Vasile Rus
The Simple Corpus Tool Martin Weisser Research Center for Linguistics & Applied Linguistics Guangdong University of Foreign Studies
CSC 594 Topics in AI – Natural Language Processing
PRESENTED BY: PEAR A BHUIYAN
[A Contrastive Study of Syntacto-Semantic Dependencies]
Natural Language Processing (NLP)
Machine Learning in Natural Language Processing
LING/C SC 581: Advanced Computational Linguistics
Lecture 12: Data Wrangling
R.Rajkumar Asst.Professor CSE
CS4705 Natural Language Processing
Parsing Unrestricted Text
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Outline Different worlds? –Corpus-based computational linguistics –Computational corpus linguistics –Similarities and differences –Opportunities for collaboration Computational linguistics – an example –Dependency-based syntactic analysis –Machine learning

Different worlds?

Corpora and computers The empirical revolution in (computational) linguistics: –Increased use of empirical data –Development of large corpora –Annotation of corpus data (syntactic, semantic) Underlying causes: –Technical development: Availability of machine-readable text (and digitized speech) Computational capacity: –Storage –Processing –Scientific shift: Criticism of armchair linguistics Development of statistical language models

Computational corpus linguistics Goal: –Knowledge of language Descriptive studies Theoretical hypothesis testing Means: –Corpus data as a source of knowledge of language Descriptive statistics Statistical inference for hypothesis testing –Computer programs for processing corpus data Corpus development and annotation Search and visualization (for humans) Statistical analysis (descriptive and inferential)

Corpus-based computational linguistics Goal: –Computer programs that process natural language Practical applications (translation, summarization, …) Models of language learning and use Means: –Corpus data as a source of knowledge of language: Statistical inference for model parameters (estimation) –Computer programs for processing corpus data Corpus development and annotation Search and information extraction (for computers) Statistical analysis (estimation/machine learning)

Corpus processing 1 Corpus development: –Tokenization (minimal units, words, etc.) –Segmentation (on several levels) –Normalization (e.g., abbreviations, orthography, multi-word units; graphical elements, metadata, etc.) Annotation: –Part-of-speech tagging (word  word class) –Lemmatization (word  base form/lemma) –Syntactic analysis (sentence  syntactic representation) –Semantic analysis (word  sense, sentence  proposition) Standard methodology: –Automatic analysis (often based on other corpus data) –Manual validation (and correction)

Corpus processing 2 Searching and sorting: –Search methods: String matching Regular expressions Dedicated query languages Special-purpose programs –Results: Concordances Frequency lists Visualization: –Textual: Concordances, etc. –Graphical: Diagram, syntax trees, etc.

Corpus processing 3 Statistical analysis: –Descriptive statistics Frequency tables and diagrams –Statistical inference Hypothesis testing (t-test,  2, Mann-Whitney, etc.) Machine learning: –Probabilistic: Estimate probability distributions –Discriminative: Approximate mapping input-output –Induction of lexical and grammatical resources (e.g. collocations, valency frames)

User Requirements Corpus linguists –Software Accessible Easy to use General –Output Suitable for humans Perspicuous (graphical visualization) –Functions Specific search Descriptive statistics Computational linguists –Software Efficient Modifiable Specific –Output Suitable for computers Well-defined format (annotated text) –Functions Exhaustive search Statistical learning

Summary Different goals: –Study language –Create computer programs … give (partly) different requirements: –Accessible and usable (for humans) –Efficient and standardized (for computers) … but (partly) the same needs: –Corpus development and annotation –Searching, sorting, and statistical analysis

Symbiosis? What can computational linguists do for corpus linguists? –Technical and general linguistic competence –Software for automatic analysis (annotation) What can corpus linguists do for computational linguists? –Linguistic and language specific competence –Manual validation of automatic analysis What can they achieve together? –Automatic annotation improves precision in corpus linguistics –Manual validation improves precision computational linguistics –A virtuous circle?

Computational linguistics – an example

Dependency analysis Economicnewshadlittleeffectonfinancialmarkets. JJNNVBDJJNNINJJNNS. ROOT NMODSBJNMOD OBJ PMOD NMOD P

Inductive dependency parsing Deterministic syntactic analysis (parsing): –Algorithm for deriving dependency structures –Requires decision function in choice situations –All decisions are final (deterministic) Inductive machine learning: –Decision function based on previous experience –Generalize from examples (successive refinement) –Examples = Annotated sentences (treebank) –No grammar – just analogy

Algorithm Data structures: –Queue of unanalyzed words (next = first in queue) –Stack of partially analyzed words (top = on top of stack) Start state: –Empty stack –All words in queue Algorithm steps: –Shift: Put next on top of stack (push) –Reduce: Remove top from stack (pop) –Right: Put next on top of stack (push); link top  next –Left: Remove top from stack (pop); link next  top

Economicnewshadlittleeffectonfinancialmarkets. JJNNVBDJJNNINJJNNS. R EDUCE LA( NMOD )S HIFT LA( SBJ )S HIFT LA( NMOD )RA( OBJ )RA( NMOD )S HIFT LA( NMOD )RA( PMOD )R EDUCE S HIFT RA( P ) NMODSBJNMOD OBJ NMOD PMOD Algorithm example ROOT 0 P

Decision function Non-determinism: Decision function: (Queue, Stack, Graph)  Step Possible approaches: –Grammar? –Inductive generalization! eatspizzawith…… OBJ RA( ATT )?RE?

Machine learning Decision function: –(Queue, Stack, Graph)  Step Model: –(Queue, Stack, Graph)  (f 1, …, f n ) Classifier: –(f 1, …, f n )  Step Learning: –{ ((f 1, …, f n ), Step) }  Classifier

Model Parts of speech: t 1, top, next, n 1, n 2, n 3 Dependency types: t.hd, t.ld, t.rd, n.ld Word forms: top, next, top.hd, n 1 hd ld rdld. thnext. top. n1n1 ………… n2n2 n3n3 t1t1 StackQueue

Memory-based learning Memory-based learning and classification: –Learning is storing experiences in memory. –Problem solving is achieved by reusing solutions of similar problems experienced in the past. T I MBL (Tilburg Memory-Based Learner): –Basic method: k-nearest neighbor –Parameters: Number of neighbors (k) Distance metrics Weighting av attributes, values and instances

Learning example Instance base: 1.(a, b, a, c)  A 2.(a, b, c, a)  B 3.(b, a, c, c)  C 4.(c, a, b, c)  A New instance: 5.(a, b, b, a) Distances: 1.D(1, 5) = 2 2.D(2, 5) = 1 3.D(3, 5) = 4 4.D(4, 5) = 3 k-NN: 1.1-NN(5) = B 2.2-NN(5) = A/B 3.3-NN(5) = A

Experimental evaluation Inductive dependency analysis: –Deterministic algorithm –Memory-based decision function Data: –English: Penn Treebank, WSJ (1M words) Converted to dependency structure –Swedish: Talbanken, Professional prose (100k words) Dependency structure based on MAMBA annotation

Results English: –87.3% of all words got the correct head –85.6% of all words got the correct head and label Svenska: –85.9% of all words got the correct head –81.6% of all words got the correct head and label

Dependency types: English High precision ( 86%  F ): VC (auxiliary verb  main verb)95.0% NMOD (noun modifier)91.0% SBJ (verb  subject)89.3% PMOD (complement of preposition)88.6% SBAR (complementizer  verb)86.1% Medium precision ( 73%  F  83% ): ROOT82.4% OBJ (verb  object)81.1% VMOD (adverbial)76.8% AMOD (adj/adv modifier)76.7% PRD (predicative complement)73.8% Low precision ( F  70% ): DEP (other)

Dependency types: Swedish High precision ( 84%  F ): IM (infinitive marker  infinitive)98.5% PR (preposition  noun)90.6% UK (complementizer  verb)86.4% VC (auxiliary verb  main verb)86.1% DET (noun  determiner)89.5% ROOT87.8% SUB (verb  subject)84.5% Medium precision ( 76%  F  80% ): ATT (noun modifier)79.2% CC (coordination)78.9% OBJ (verb  object)77.7% PRD (verb  predicative)76.8% ADV (adverbial)76.3% Low precision ( F  70% ): INF, APP, XX, ID

Corpus annotation How good is 85%? –Good enough to save time for manual annotators –Good enough to improve search precision –Recent release: SUC with syntactic annotation How can accuracy be improved further? –By annotation of more data, which facilitates machine learning –By refined linguistic analysis of the structures to be annotated and the errors performed

MaltParser Software for inductive dependency parsing: –Freely available (open source) http//maltparser.org –Evaluated on close to 30 different languages –Used for annotating corpora at Uppsala University