Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor.

Slides:



Advertisements
Similar presentations
LIS618 lecture 2 Thomas Krichel Structure Theory: information retrieval performance Practice: more advanced dialog.
Advertisements

Natural Language and Text Processing Laboratory Projects and Research Directions Head: Alexander Gelbukh
LABELING TURKISH NEWS STORIES WITH CRF Prof. Dr. Eşref Adalı ISTANBUL TECHNICAL UNIVERSITY COMPUTER ENGINEERING 1.
Improved TF-IDF Ranker
Modern Language Association (MLA) International Bibliography Hosted by Gale Cengage Welcome to our Guided Tour Tour takes about 7 minutes. The show will.
Automatic indexing and retrieval of crime-scene photographs Katerina Pastra, Horacio Saggion, Yorick Wilks NLP group, University of Sheffield Scene of.
A New Suffix Tree Similarity Measure for Document Clustering Hung Chim, Xiaotie Deng City University of Hong Kong WWW 2007 Session: Similarity Search April.
1 Entity Ranking Using Wikipedia as a Pivot (CIKM 10’) Rianne Kaptein, Pavel Serdyukov, Arjen de Vries, Jaap Kamps 2010/12/14 Yu-wen,Hsu.
Supporting e-learning with automatic glossary extraction Experiments with Portuguese Rosa Del Gaudio, António Branco RANLP, Borovets 2007.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
A Flexible Workbench for Document Analysis and Text Mining NLDB’2004, Salford, June Gulla, Brasethvik and Kaada A Flexible Workbench for Document.
Web Logs and Question Answering Richard Sutcliffe 1, Udo Kruschwitz 2, Thomas Mandl University of Limerick, Ireland 2 - University of Essex, UK 3.
1/17 Acquiring Selectional Preferences from Untagged Text for Prepositional Phrase Attachment Disambiguation Hiram Calvo and Alexander Gelbukh Presented.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Enhance legal retrieval applications with an automatically induced knowledge base Ka Kan Lo.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Yuliya Morozova Institute for Informatics Problems of the Russian Academy of Sciences, Moscow.
Mining and Summarizing Customer Reviews
Title Extraction from Bodies of HTML Documents and its Application to Web Page Retrieval Microsoft Research Asia Yunhua Hu, Guomao Xin, Ruihua Song, Guoping.
Computational Methods to Vocalize Arabic Texts H. Safadi*, O. Al Dakkak** & N. Ghneim**
 Clustering of Web Documents Jinfeng Chen. Zhong Su, Qiang Yang, HongHiang Zhang, Xiaowei Xu and Yuhen Hu, Correlation- based Document Clustering using.
C OLLECTIVE ANNOTATION OF WIKIPEDIA ENTITIES IN WEB TEXT - Presented by Avinash S Bharadwaj ( )
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Information Need Question Understanding Selecting Sources Information Retrieval and Extraction Answer Determina tion Answer Presentation This work is supported.
LIS618 lecture 4 before searching + introduction to dialog Thomas Krichel
“How much context do you need?” An experiment about context size in Interactive Cross-language Question Answering B. Navarro, L. Moreno-Monteagudo, E.
Learning Object Metadata Mining Masoud Makrehchi Supervisor: Prof. Mohamed Kamel.
Building Search Portals With SP2013 Search. 2 SharePoint 2013 Search  Introduction  Changes in the Architecture  Result Sources  Query Rules/Result.
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Researcher affiliation extraction from homepages I. Nagy, R. Farkas, M. Jelasity University of Szeged, Hungary.
Grouping search-engine returned citations for person-name queries Reema Al-Kamha, David W. Embley (Proceedings of the 6th annual ACM international workshop.
A Language Independent Method for Question Classification COLING 2004.
21/11/2002 The Integration of Lexical Knowledge and External Resources for QA Hui YANG, Tat-Seng Chua Pris, School of Computing.
INTERESTING NUGGETS AND THEIR IMPACT ON DEFINITIONAL QUESTION ANSWERING Kian-Wei Kor, Tat-Seng Chua Department of Computer Science School of Computing.
Chapter 6: Information Retrieval and Web Search
Comparing syntactic semantic patterns and passages in Interactive Cross Language Information Access (iCLEF at the University of Alicante) Borja Navarro,
UB LIS 571 Soergel Lecture 6.2b Document analysis for retrieval and information extraction Dagobert Soergel Department of Library and Information Studies.
Spanish FrameNet Project Autonomous University of Barcelona Marc Ortega.
CS 6998 NLP for the Web Columbia University 04/22/2010 Analyzing Wikipedia and Gold-Standard Corpora for NER Training William Y. Wang Computer Science.
Wikipedia as Sense Inventory to Improve Diversity in Web Search Results Celina SantamariaJulio GonzaloJavier Artiles nlp.uned.es UNED,c/Juan del Rosal,
LOGO A comparison of two web-based document management systems ShaoxinYu Columbia University March 31, 2009.
Computational linguistics A brief overview. Computational Linguistics might be considered as a synonym of automatic processing of natural language, since.
Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.
2015/12/121 Extracting Key Terms From Noisy and Multi-theme Documents Maria Grineva, Maxim Grinev and Dmitry Lizorkin Proceeding of the 18th International.
INFORMATION MANAGEMENT Module INFORMATION MANAGEMENT Module
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Information Retrieval
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
Principals of Research Writing. What is Research Writing? Process of communicating your research  Before the fact  Research proposal  After the fact.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Finding document topics for improving topic segmentation Source: ACL2007 Authors: Olivier Ferret (18 route du Panorama, BP6) Reporter:Yong-Xiang Chen.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Survey Jaehui Park Copyright  2008 by CEBT Introduction  Members Jung-Yeon Yang, Jaehui Park, Sungchan Park, Jongheum Yeon  We are interested.
Survey on Long Queries in Keyword Search : Phrase-based IR Sungchan Park
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Toward Entity Retrieval over Structured and Text Data Mayssam Sayyadian, Azadeh Shakery, AnHai Doan, ChengXiang Zhai Department of Computer Science University.
SEPTEMBER 2015 Databases. Database (review) A database is a collection of data arranged for ease and speed of search and retrieval (The American Heritage.
Text Based Information Retrieval
Multimedia Information Retrieval
Social Knowledge Mining
Text Mining & Natural Language Processing
CS246: Information Retrieval
Databases WOW!! A database is a collection of related data.
Information Retrieval and Web Design
Information Retrieval
Presentation transcript:

Identification of Composite Named Entities in a Spanish Textual Database Sofía N. Galicia-Haro Facultad de Ciencias - UNAM Alexander F. Gelbukh and Igor A. Bolshakov Lab. Lenguaje Natural CIC – IPN México, D. F.

Contents Introduction Named Entities in Textual Databases NE Analysis Recognition Method Conclusions

Contents Introduction Named Entities in Textual Databases NE Analysis Recognition Method Conclusions

Textual Databases They have been entered to computers and to Web to save tons of paper to allow people to have remote access to provide much better access to texts in electronic format, etc. Searching through this huge material for information is a time consuming task

Named Entities NE mentioned in textual databases constitute an important part of their semantic contents A collection of political electronic texts shows that almost 50% of the total sentences contains at least one NE This indicates the relevance of NE identification and its role in document indexing and retrieval

Composite Named Entities NE with coordinated constituents Luz y Fuerza del Centro NE with prepositional phrases Ejército Zapatista de Liberación Nacional

Contents Introduction Named Entities in Textual Databases NE Analysis Recognition Method Conclusions

NEs in Mexican Textual DB NEs appear at least in 50% of the sentences Selection of Collection 1 taken for training Collections of Political Mexican texts Coll. 1Coll. 2 # Sentences442,719208,298 # Sentences w/named entities243,165100,602

Initial NE Recognition Step Identification of linguistic characteristics Example: Prepositions link two different NE are included in the NE Identification of style characteristics Ex: Specific words introduce convention names coordinadora del programa Mundo Maya ‘Mundo Maya program’s coordinator’

Contents Introduction Named Entities in Textual Databases NE Analysis Recognition Method Conclusions

Training File A Perl program extracts “compounds” Los miembros del Ejercito Federal (1) lejos de aplicar la Ley sobre Armas de Fuego y Explosivos parecen (2) proteger a los participantes en el tiroteo. Compounds contain no more than three non- capitalized words between capitalized words Compounds are left- and right- delimited by a punctuation marks or a word

Sentences of coll.1 From 243,165 sentences 472,087 compounds 500 randomly selected sentences were manually analyzed Main result from analysis: Syntactic ambiguity is frequent

Syntactic Ambiguity Coordination of coordinated names Comisión Federal de Electricidad y Luz y Fuerza del Centro Margarita Diéguez y Armas y Carlos Virgilio Prepositional phrase attachment Different names linked by prepositions Comandancia General del Ejército Zapatista de Liberación Nacional

Contents Introduction Named Entities in Textual Databases NE Analysis Recognition Method Conclusions

Knowledge Contributions External lists Linguistic knowledge Heuristics Statistics

External Lists Hand-made list of similes (625 items) paz y justicia ‘peace and justice’ Latinoamérica y el Caribe Hand-made list of words Lists from the WEB personal names (697 items) main Mexican cities (910 items)

Linguistic Knowledge Examples of linguistic restrictions Lists of groups of capitalized words Corea del Sur (1), Taiwan (2), Checoslovaquia (3) y Sudáfrica (4) Preposition por followed by indefinite article cannot be the link within a personal name Cuauhtémoc Cárdenas (1) por la Alianza por la Ciudad de México (2)

Heuristics and Statistics Heuristic example: a first name can be the part of only one name sequence among those coordinated Ex.: Margarita Diéguez y Armas y Carlos Virgilio Carlos belongs to the list of first names. Thus there are two name sequences here: Margarita Diéguez y Armas Vs. Carlos Virgilio Statistics from training file With a high score, Estados Unidos is a 2-word group Thus Estados Unidos sobre México could be separated

Application of the Method Obtaining compounds with functional words Using previous resources, the program decides on splitting, delimiting or leaving each compound as such Extract coordinated groups prepositional phrases the rest of groups of capitalized words

Results - 1 Obtained from 500 sentences of Coll. 2 Number of: Coordinated Groups Prepositional Phrase Groups total Precision Recall486787

Results - 2 Total: 1496 NE 63 names with coordination 167 prepositional groups To compare with: Carreras, X., L. Márques and L. Padró. Named Entity Extraction using AdaBoost, CoNLL % for precision and 91% for recall However, the test file only includes one coordinated name If a NE is embedded in another one, only the top level entity was marked

Conclusions We present a method to identify and disambiguate groups of capitalized words Our work is focused on composite named entities Our method use extremely short lists and a small POS-marked dictionary The method use heterogeneous knowledge to decide on splitting or joining groups with capitalized words

Thanks!