A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information.

Slides:

Advertisements

Similar presentations

Three Basic Problems Compute the probability of a text: P m (W 1,N ) Compute maximum probability tag sequence: arg max T 1,N P m (T 1,N | W 1,N ) Compute.

Advertisements

School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING An open discussion and exchange of ideas Introduced by Eric Atwell, Language.

ThemeInformation Extraction for World Wide Web PaperUnsupervised Learning of Soft Patterns for Generating Definitions from Online News Author Cui, H.,

Projects Data Representation Basic testing and evaluation schemes

Florida International University COP 4770 Introduction of Weka.

1/(20) Introduction to ANNIE Diana Maynard University of Sheffield March 2004

An Introduction to GATE

Bijay Dahal {2008/BCT/509} Kabindra Shrestha {2008/BCT/516} Raj Kumar Shrestha {2008/BCT/527}

University of Sheffield NLP Module 4: Machine Learning.

University of Sheffield NLP Module 11: Advanced Machine Learning.

Corpus Processing and NLP

Artificial Neural Networks And XML

A Machine Learning Approach to Coreference Resolution of Noun Phrases By W.M.Soon, H.T.Ng, D.C.Y.Lim Presented by Iman Sen.

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.

Weka & Rapid Miner Tutorial By Chibuike Muoh. WEKA:: Introduction A collection of open source ML algorithms – pre-processing – classifiers – clustering.

Large-Scale Entity-Based Online Social Network Profile Linkage.

Presenters: Arni, Sanjana.  Subtask of Information Extraction  Identify known entity names – person, places, organization etc  Identify the boundaries.

ClearTK: A Framework for Statistical Biomedical Natural Language Processing Philip Ogren Philipp Wetzler Department of Computer Science University of Colorado.

Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.

Automatic Identification of Cognates and False Friends in French and English Diana Inkpen and Oana Frunza University of Ottawa and Greg Kondrak University.

NATURAL LANGUAGE PROCESSING. Applications  Classification ( spam )  Clustering ( news stories, twitter )  Input correction ( spell checking )  Sentiment.

A Framework for Automated Corpus Generation for Semantic Sentiment Analysis Amna Asmi and Tanko Ishaya, Member, IAENG Proceedings of the World Congress.

Shallow Processing: Summary Shallow Processing Techniques for NLP Ling570 December 7, 2011.

Named Entity Recognition for Digitised Historical Texts by Claire Grover, Sharon Givon, Richard Tobin and Julian Ball (UK) presented by Thomas Packer 1.

Dept. of Computer Science & Engg. Indian Institute of Technology Kharagpur Part-of-Speech Tagging and Chunking with Maximum Entropy Model Sandipan Dandapat.

Large-Scale Cost-sensitive Online Social Network Profile Linkage.

Named Entity Recognition without Training Data on a Language you don’t speak Diana Maynard Valentin Tablan Hamish Cunningham NLP group, University of Sheffield,

Some Advances in Transformation-Based Part of Speech Tagging

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Information Extraction From Medical Records by Alexander Barsky.

CS 396 Pattern Recognition Project Language Classifier v1.0 By Paul Troncone, David Keiper, Eugene Schvarts.

Comparative study of various Machine Learning methods For Telugu Part of Speech tagging -By Avinesh.PVS, Sudheer, Karthik IIIT - Hyderabad.

1 Named Entity Recognition based on three different machine learning techniques Zornitsa Kozareva JRC Workshop September 27, 2005.

Web Page Language Identification Based on URLs Reporter: 鄭志欣 Advisor: Hsing-Kuo Pao 1.

Lecture 6 Hidden Markov Models Topics Smoothing again: Readings: Chapters January 16, 2013 CSCE 771 Natural Language Processing.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

ACBiMA: Advanced Chinese Bi-Character Word Morphological Analyzer 1 Ting-Hao (Kenneth) Huang Yun-Nung (Vivian) Chen Lingpeng Kong

Multi-lingual & multi- institutional distant learning Example of an international master programme in Computational Linguistics November, Blaubeuren,

A Language Independent Method for Question Classification COLING 2004.

20 th of May 2004 Beatrice Alex School of Informatics The University of Edinburgh Mixed-Lingual Entity Recognition.

©2003 Paula Matuszek Taken primarily from a presentation by Lin Lin. CSC 9010: Text Mining Applications.

Project Final Presentation – Dec. 6, 2012 CS 5604 : Information Storage and Retrieval Instructor: Prof. Edward Fox GTA : Tarek Kanan ProjArabic Team Ahmed.

Improving Morphosyntactic Tagging of Slovene by Tagger Combination Jan Rupnik Miha Grčar Tomaž Erjavec Jožef Stefan Institute.

Dependency Parser for Swedish Project for EDA171 by Jonas Pålsson Marcus Stamborg.

Tokenization & POS-Tagging

TEXT ANALYTICS - LABS Maha Althobaiti Udo Kruschwitz Massimo Poesio.

Auckland 2012Kilgarriff: NLP and Corpus Processing1 The contribution of NLP: corpus processing.

Languages at Inxight Ian Hersey Co-Founder and SVP, Corporate Development and Strategy.

Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏

Detection of Spelling Errors in Swedish Clinical Text Nizamuddin Uddin and Hercules Dalianis Department of Computer and Systems Sciences, (DSV)

POS Tagger and Chunker for Tamil

Shallow Parsing for South Asian Languages -Himanshu Agrawal.

PoS tagging and Chunking with HMM and CRF

Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.

CSA2050: Introduction to Computational Linguistics Part of Speech (POS) Tagging II Transformation Based Tagging Brill (1995)

Institute of Informatics & Telecommunications NCSR “Demokritos” Spidering Tool, Corpus collection Vangelis Karkaletsis, Kostas Stamatakis, Dimitra Farmakiotou.

A CRF-BASED NAMED ENTITY RECOGNITION SYSTEM FOR TURKISH Information Extraction Project Reyyan Yeniterzi.

Tasneem Ghnaimat. Language Model An abstract representation of a (natural) language. An approximation to real language Assume we have a set of sentences,

Using Human Language Technology for Automatic Annotation and Indexing of Digital Library Content Kalina Bontcheva, Diana Maynard, Hamish Cunningham, Horacio.

Problem Solving with NLTK MSE 2400 EaLiCaRA Dr. Tom Way.

© NCSR, Frascati, July 18-19, 2002 CROSSMARC big picture Domain-specific Web sites Domain-specific Spidering Domain Ontology XHTML pages WEB Focused Crawling.

Language Identification and Part-of-Speech Tagging

A Simple Approach for Author Profiling in MapReduce

A Straightforward Author Profiling Approach in MapReduce

Tools for Natural Language Processing Applications

Tokenizer and Sentence Splitter CSCI-GA.2591

Introduction Task: extracting relational facts from text

Text Mining & Natural Language Processing

University of Illinois System in HOO Text Correction Shared Task

Presentation transcript:

A Text Processing Tool for the Romanian Language Oana Frunza and Diana InkpenDavid Nadeau School of Information Technology and Institute for Information Technology Engineering, University of Ottawa National Research Council of Canada

Outline BALIE System RO-BALIE Capabilities Improvements Evaluation & Results Future Work

BALIE- BaseLine Information Extraction Multilingual information extraction system Language identification Tokenization Sentence boundary detection Part-of-speech tagging for English, French, German, Spanish [1] Java trainable open source system Uses WEKA [2] a Machine Learning Tool Uses QTag [3] – a language independent probabilistic part-of-speech tagger

BALIE- BaseLine Information Extraction (cont.) Input Example 1.Introduction Information Extraction (IE) is the name given to any process which selectively structures and combines data which is found, explicitly stated or implied, in o ne or more texts.

BALIE- BaseLine Information Extraction (cont.) Output 1. Introduction Information …

RO-BALIE Improvements Easier manipulation of the input and output texts A new tag set that maps the numerical tag set internally used by BALIE More information in the output provided by the system Available at: Balie/RO-Balie.html

RO-BALIE Language Identification 2-grams (sequence of 2 characters) Naïve Bayes classifier Overall accuracy is: 99.25%. LanguageFiles Train Files Test Correctly classified Accuracy English % French % Spanish % German % Romanian %

RO-BALIE (cont.) Tokenization Split each compound word based on “-” and “/” Examples: iat-o, socio-economic Tokenization results: TokensPrecisionRecall %98.7%

RO-BALIE (cont.) Sentence Boundary Detection Training – 106 hand-tagged English sentences Decision Tree Classifier Features Beginning of the sentence – first token Previous token Current token Next token

RO-BALIE (cont.) Sentence Boundary Detection (cont.) Feature values Period, Open Quote, Close Quote, New Line, Capital Word, Digit, Abbreviation, etc. A list with Romanian abbreviations (510) Evaluation on Orwell’s 1984 novel TextAccuracyPrecisionRecall Romanian97%92%71% English97.5%96.5%82%

RO-BALIE (cont.) Part-of-speech tagging – QTag tagger Used a corpus of 40 million words of newspaper articles Romanian newspapers 3-year period The training corpus is 98% accurate Our system has a tagset of 14 tags for POS and 30 tags for punctuations Train CorpusTest CorpusAccuracy 2.5 mil words words95.3%

RO-BALIE (cont.) Output for Apel tirziu si inutil NISTORESCU. Apel tirziu si inutil NISTORESCU.

RO-BALIE (cont.) Future Work Use machine learning for the tokenization task Add new services: morphological analysis, named entity recognition, etc. Add more specific information for each supported language.

RO-BALIE (cont.) References ag.htmlhttp:// ag.html Balie.html

THANK YOU! ? ? ? ?