Download presentation
Presentation is loading. Please wait.
Published byAdrianna Christenberry Modified over 10 years ago
1
A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada {ofrunza,diana}@site.uottawa.ca David.Nadeau@nrc-cnrc.gc.ca
2
Outline BALIE System RO-BALIE Capabilities Improvements Evaluation & Results Future Work
3
BALIE- BaseLine Information Extraction Multilingual information extraction system Language identification Tokenization Sentence boundary detection Part-of-speech tagging for English, French, German, Spanish [1] Java trainable open source system Uses WEKA [2] a Machine Learning Tool Uses QTag [3] – a language independent probabilistic part-of-speech tagger
4
BALIE- BaseLine Information Extraction (cont.) Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in o ne or more texts.
5
BALIE- BaseLine Information Extraction (cont.) Output 1. Introduction Information …
6
RO-BALIE Improvements Easier manipulation of the input and output texts A new tag set that maps the numerical tag set internally used by BALIE More information in the output provided by the system Available at: http://www.site.uottawa.ca/~ofrunza/RO- Balie/RO-Balie.html
7
RO-BALIE Language Identification 2-grams (sequence of 2 characters) Naïve Bayes classifier Overall accuracy is: 99.25%. LanguageFiles Train Files Test Correctly classified Accuracy English5027 100% French50262596% Spanish5025 100% German5027 100% Romanian5032 100%
8
RO-BALIE (cont.) Tokenization Split each compound word based on “-” and “/” Examples: iat-o, socio-economic Tokenization results: TokensPrecisionRecall 90499.5%98.7%
9
RO-BALIE (cont.) Sentence Boundary Detection Training – 106 hand-tagged English sentences Decision Tree Classifier Features Beginning of the sentence – first token Previous token Current token Next token
10
RO-BALIE (cont.) Sentence Boundary Detection (cont.) Feature values Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. A list with Romanian abbreviations (510) Evaluation on Orwell’s 1984 novel TextAccuracyPrecisionRecall Romanian97%92%71% English97.5%96.5%82%
11
RO-BALIE (cont.) Part-of-speech tagging – QTag tagger Used a corpus of 40 million words of newspaper articles Romanian newspapers 3-year period The training corpus is 98% accurate Our system has a tagset of 14 tags for POS and 30 tags for punctuations Train CorpusTest CorpusAccuracy 2.5 mil words13.425 words95.3%
12
RO-BALIE (cont.) Output for Apel tirziu si inutil NISTORESCU. Apel tirziu si inutil NISTORESCU.
13
RO-BALIE (cont.) Future Work Use machine learning for the tokenization task Add new services: morphological analysis, named entity recognition, etc. Add more specific information for each supported language.
14
RO-BALIE (cont.) References 1. http://balie.sourceforge.net/index.htmlhttp://balie.sourceforge.net/index.html 2. http://www.cs.waikato.ac.nz/~ml/weka/http://www.cs.waikato.ac.nz/~ml/weka/ 3.http://www.english.bham.ac.uk/staff/omason/software/qt ag.htmlhttp://www.english.bham.ac.uk/staff/omason/software/qt ag.html http://www.site.uottawa.ca/~ofrunza/RO-Balie/RO- Balie.html
15
THANK YOU! ? ? ? ?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.