Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology

Outline Different worlds? –Corpus-based computational linguistics –Computational corpus linguistics –Similarities and differences –Opportunities for collaboration Computational linguistics – an example –Dependency-based syntactic analysis –Machine learning

Different worlds?

Corpora and computers The empirical revolution in (computational) linguistics: –Increased use of empirical data –Development of large corpora –Annotation of corpus data (syntactic, semantic) Underlying causes: –Technical development: Availability of machine-readable text (and digitized speech) Computational capacity: –Storage –Processing –Scientific shift: Criticism of armchair linguistics Development of statistical language models

Computational corpus linguistics Goal: –Knowledge of language Descriptive studies Theoretical hypothesis testing Means: –Corpus data as a source of knowledge of language Descriptive statistics Statistical inference for hypothesis testing –Computer programs for processing corpus data Corpus development and annotation Search and visualization (for humans) Statistical analysis (descriptive and inferential)

Corpus-based computational linguistics Goal: –Computer programs that process natural language Practical applications (translation, summarization, …) Models of language learning and use Means: –Corpus data as a source of knowledge of language: Statistical inference for model parameters (estimation) –Computer programs for processing corpus data Corpus development and annotation Search and information extraction (for computers) Statistical analysis (estimation/machine learning)

Corpus processing 1 Corpus development: –Tokenization (minimal units, words, etc.) –Segmentation (on several levels) –Normalization (e.g., abbreviations, orthography, multi-word units; graphical elements, metadata, etc.) Annotation: –Part-of-speech tagging (word  word class) –Lemmatization (word  base form/lemma) –Syntactic analysis (sentence  syntactic representation) –Semantic analysis (word  sense, sentence  proposition) Standard methodology: –Automatic analysis (often based on other corpus data) –Manual validation (and correction)

Corpus processing 2 Searching and sorting: –Search methods: String matching Regular expressions Dedicated query languages Special-purpose programs –Results: Concordances Frequency lists Visualization: –Textual: Concordances, etc. –Graphical: Diagram, syntax trees, etc.

Corpus processing 3 Statistical analysis: –Descriptive statistics Frequency tables and diagrams –Statistical inference Hypothesis testing (t-test,  2, Mann-Whitney, etc.) Machine learning: –Probabilistic: Estimate probability distributions –Discriminative: Approximate mapping input-output –Induction of lexical and grammatical resources (e.g. collocations, valency frames)

User Requirements Corpus linguists –Software Accessible Easy to use General –Output Suitable for humans Perspicuous (graphical visualization) –Functions Specific search Descriptive statistics Computational linguists –Software Efficient Modifiable Specific –Output Suitable for computers Well-defined format (annotated text) –Functions Exhaustive search Statistical learning

Summary Different goals: –Study language –Create computer programs … give (partly) different requirements: –Accessible and usable (for humans) –Efficient and standardized (for computers) … but (partly) the same needs: –Corpus development and annotation –Searching, sorting, and statistical analysis

Symbiosis? What can computational linguists do for corpus linguists? –Technical and general linguistic competence –Software for automatic analysis (annotation) What can corpus linguists do for computational linguists? –Linguistic and language specific competence –Manual validation of automatic analysis What can they achieve together? –Automatic annotation improves precision in corpus linguistics –Manual validation improves precision computational linguistics –A virtuous circle?

Computational linguistics – an example

Dependency analysis 0123456789 Economicnewshadlittleeffectonfinancialmarkets. JJNNVBDJJNNINJJNNS. ROOT NMODSBJNMOD OBJ PMOD NMOD P

Inductive dependency parsing Deterministic syntactic analysis (parsing): –Algorithm for deriving dependency structures –Requires decision function in choice situations –All decisions are final (deterministic) Inductive machine learning: –Decision function based on previous experience –Generalize from examples (successive refinement) –Examples = Annotated sentences (treebank) –No grammar – just analogy

Algorithm Data structures: –Queue of unanalyzed words (next = first in queue) –Stack of partially analyzed words (top = on top of stack) Start state: –Empty stack –All words in queue Algorithm steps: –Shift: Put next on top of stack (push) –Reduce: Remove top from stack (pop) –Right: Put next on top of stack (push); link top  next –Left: Remove top from stack (pop); link next  top

123456789 Economicnewshadlittleeffectonfinancialmarkets. JJNNVBDJJNNINJJNNS. R EDUCE LA( NMOD )S HIFT LA( SBJ )S HIFT LA( NMOD )RA( OBJ )RA( NMOD )S HIFT LA( NMOD )RA( PMOD )R EDUCE S HIFT RA( P ) NMODSBJNMOD OBJ NMOD PMOD Algorithm example ROOT 0 P

Decision function Non-determinism: Decision function: (Queue, Stack, Graph)  Step Possible approaches: –Grammar? –Inductive generalization! eatspizzawith…… OBJ RA( ATT )?RE?

Machine learning Decision function: –(Queue, Stack, Graph)  Step Model: –(Queue, Stack, Graph)  (f 1, …, f n ) Classifier: –(f 1, …, f n )  Step Learning: –{ ((f 1, …, f n ), Step) }  Classifier

Model Parts of speech: t 1, top, next, n 1, n 2, n 3 Dependency types: t.hd, t.ld, t.rd, n.ld Word forms: top, next, top.hd, n 1 hd ld rdld. thnext. top. n1n1 ………… n2n2 n3n3 t1t1 StackQueue

Memory-based learning Memory-based learning and classification: –Learning is storing experiences in memory. –Problem solving is achieved by reusing solutions of similar problems experienced in the past. T I MBL (Tilburg Memory-Based Learner): –Basic method: k-nearest neighbor –Parameters: Number of neighbors (k) Distance metrics Weighting av attributes, values and instances

Learning example Instance base: 1.(a, b, a, c)  A 2.(a, b, c, a)  B 3.(b, a, c, c)  C 4.(c, a, b, c)  A New instance: 5.(a, b, b, a) Distances: 1.D(1, 5) = 2 2.D(2, 5) = 1 3.D(3, 5) = 4 4.D(4, 5) = 3 k-NN: 1.1-NN(5) = B 2.2-NN(5) = A/B 3.3-NN(5) = A

Experimental evaluation Inductive dependency analysis: –Deterministic algorithm –Memory-based decision function Data: –English: Penn Treebank, WSJ (1M words) Converted to dependency structure –Swedish: Talbanken, Professional prose (100k words) Dependency structure based on MAMBA annotation

Results English: –87.3% of all words got the correct head –85.6% of all words got the correct head and label Svenska: –85.9% of all words got the correct head –81.6% of all words got the correct head and label

Dependency types: English High precision ( 86%  F ): VC (auxiliary verb  main verb)95.0% NMOD (noun modifier)91.0% SBJ (verb  subject)89.3% PMOD (complement of preposition)88.6% SBAR (complementizer  verb)86.1% Medium precision ( 73%  F  83% ): ROOT82.4% OBJ (verb  object)81.1% VMOD (adverbial)76.8% AMOD (adj/adv modifier)76.7% PRD (predicative complement)73.8% Low precision ( F  70% ): DEP (other)

Dependency types: Swedish High precision ( 84%  F ): IM (infinitive marker  infinitive)98.5% PR (preposition  noun)90.6% UK (complementizer  verb)86.4% VC (auxiliary verb  main verb)86.1% DET (noun  determiner)89.5% ROOT87.8% SUB (verb  subject)84.5% Medium precision ( 76%  F  80% ): ATT (noun modifier)79.2% CC (coordination)78.9% OBJ (verb  object)77.7% PRD (verb  predicative)76.8% ADV (adverbial)76.3% Low precision ( F  70% ): INF, APP, XX, ID

Corpus annotation How good is 85%? –Good enough to save time for manual annotators –Good enough to improve search precision –Recent release: SUC with syntactic annotation How can accuracy be improved further? –By annotation of more data, which facilitates machine learning –By refined linguistic analysis of the structures to be annotated and the errors performed

MaltParser Software for inductive dependency parsing: –Freely available (open source) http//maltparser.org –Evaluated on close to 30 different languages –Used for annotating corpora at Uppsala University

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.

Similar presentations

Presentation on theme: "Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology.

Similar presentations

Presentation on theme: "Corpus-based computational linguistics or computational corpus linguistics? Joakim Nivre Uppsala University Department of Linguistics and Philology."— Presentation transcript:

Similar presentations

About project

Feedback