Language Identification and Part-of-Speech Tagging

Language Identification and Part-of-Speech Tagging
Keren Solodkin Based on a paper by Sarah Schulz and Mareike Keller Digital humanities seminar 2016

Plan Introduction and Related Work Training Data
Processing of Mixed Text Results Tools for Digital Humanities Conclusion and Future Work

Introduction – Code Switching
Two or more linguistic variety in a single conversation Highly frequent in spoken language and in social media Can also be observed in medieval writing Historical mixed text is unused source of information

Example

Introduction – The Project
Automatic language identification (LID) and POS tagging Mixed Latin-Middle English text Make tools available to Humanities scholars Analysis of code-switching rules within nominal phrases Historical multilingualism research Computational linguistics

Related Work – LID Lyu and Lyu (2008) – Mandarin-Taiwanese
Solorio and Liu (2008) – Spanish-English. Yeong and Tan (2011) – Malay-English.

Related Work – POS tagging
Solorio and Liu (2008) Rodrigues and Kübler (2013) Jamatia et al. (2015)

Training Data Macaronic sermons (Horner, 2006)
Mixed Latin-Middle English text Annotate with language and part-of-speech information The annotated corpus comprises about 3000 tokens 159 sentences, average length of 19.4 tokens

Training Data Table 1: Labels annotated for LID along an explan-
ation for each label and the occurrence in percent

Training Data Table 2: Labels annotated for POS tagging along with the
explanation for each label and the occurrence in percent

Processing of Mixed Text
Two models: POS tagging builds upon the results of the LID POS tagging and LID do not inform each other LID is a step to any further processing of mixed text LID needs to be solved with a high accuracy

Processing of Mixed Text – LID
Solorio and Liu (2008) No available lemmatizer for Middle English Include POS informed word lists for both languages Middle English – Penn Parsed Corpora of Historical English Latin – the Universal Dependency treebank In case a word is found in one of the lists, its POS is added

Processing of Mixed Text – CRF Classifiers
Conditional Random Fields Take context into account Set of feature functions with weights

CRF classifiers are known to be successful for sequence labeling tasks Latin is characterized by a relatively restricted suffix assignment A context window of 5 tokens was used on all features

Features functions: Surface form POS tag Latin POS tag Middle English POS from Middle English word list POS from Latin word list Character-unigrams prefix Character-bigrams prefix Character-trigrams prefix Character-unigram suffix Character-bigram suffix Character-trigram suffix

Processing of Mixed Text – POS Tagging
For POS tagging, the same features are used Information generated by the LID system (feature 12a) The performance is evaluated by the gold LID (feature 12b) Differences in the quality of LID influence the POS tagging quality

Processing of Mixed Text – POS Tagging
Features (continuation): a LID label predicted by the LID system b Gold LID label manually annotated for our corpus

Results The evaluation was a 10-fold cross-validation
90% for training 10% for testing The reported results are average over all tests

Results – LID Majority baseline
Latin featuring Middle English insertions A combination of Latin and perfect punctuation labeling Per class precision, recall and F-score for a class Macro-averages for the overall system

Results – LID

Results – LID Table 3: Performance of the CRF system for language
identification compared to the baseline. Precision, recall and F-score per class and macro-average of all classes.

Results – LID Table 4: Percentage of incorrectly labeled tokens
per class along with the distribution of incorrect labels among the other labels.

Results – POS Tagging Majority baseline Confidence baseline
The majority of the output of the monolingual Latin tagger Confidence baseline Choose the POS label of the monolingual tagger with a higher level of confidence In case the label indicates that a word is a foreign word, we choose the label from Middle English.

Results – POS Tagging Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes.

Training Data Table 2: Labels annotated for POS tagging along with the
explanation for each label and the occurrence in percent

Results – POS Tagging Table 5: Performance of the CRF system for POS tagging compared to the majority baseline (BL1), the confidence baseline (BL2). CRFbase: system with 11 basic features, CRFpredLID: system with predicted LID as an additional feature, CRFgoldLID system with gold-standard LID as an additional feature. Precision (P), Recall (R) and F-score (F) per class and macro-average of all classes.

Results – POS Tagging The high average Recall of almost 80 is important for the task Precision has lower priority The extracted phrases are manually inspected afterwards The CRFpredLID system shows an increase in performance The CRFgoldLID system yields the best performance The differences are not statistically significant

Results – POS Tagging Table 6: Percentage of incorrectly labeled tokens per class along with the distribution of incorrect labels among the other labels for CRFpredLID system.

Results – POS Tagging

Results – POS Tagging Incorrectly tagged words appear in POS sequences which rarely appear in the training data Adding more training data will decrease errors of this kind

Results – Training Data Size
Data sparsity in general is an issue dealing with historical text Investigate how different sizes of the training set influence the results 800 tokens 1600 tokens 2400 tokens (the complete training set)

Results – Training Data Size
Table 7: Different portions of the training set along with precision, recall and F-score for LID and POS tagging.

Tools for Digital Humanities
The aim is not only to build a system Enable Humanities scholars to process their data easily A simple web service in Java The data is returned in a ICARUS format Inspect the data Pose complex search requests Combining both language information and POS tag

Figure 1: Search interface of ICARUS returning results on a query for an English adjective followed by a Latin noun within the next 3 tokens.

Tools for Digital Humanities
The method can easily be adapted to other languages Fitting monolingual taggers (TreeTagger) POS related word lists (if available) The code is publicly available on GitHub

Conclusion We saw the implementation and application of two systems developed for a specific purpose We got reasonable results given the very low size of training data We can extend the training data and correct some errors for example by adding monolingual Middle English data

Future Work Jointly modeling LID and POS tagging.
Dependency parser for mixed text Get insights into the constraints on intra-sentential code- switching

Conclusion and Future Work
Collaboration between Humanities and Computer Science. A task-oriented tool development Immediate feedback on the performance Systems are applied to real-world data. The way to give Computer Science the chance to support other fields and find new and interesting challenges

Questions?

Language Identification and Part-of-Speech Tagging

Similar presentations

Presentation on theme: "Language Identification and Part-of-Speech Tagging"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language Identification and Part-of-Speech Tagging

Similar presentations

Presentation on theme: "Language Identification and Part-of-Speech Tagging"— Presentation transcript:

Similar presentations

About project

Feedback