A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009) 1 Nov. 30, 2009 Frederik Hogenboom

Slides:



Advertisements
Similar presentations
CILC2011 A framework for structured knowledge extraction and representation from natural language via deep sentence analysis Stefania Costantini Niva Florio.
Advertisements

Polarity Analysis of Texts using Discourse Structure CIKM 2011 Bas Heerschop Erasmus University Rotterdam Frank Goossen Erasmus.
Learning Semantic Information Extraction Rules from News The Dutch-Belgian Database Day 2013 (DBDBD 2013) Frederik Hogenboom Erasmus.
Semantic News Recommendation Using WordNet and Bing Similarities 28th Symposium On Applied Computing 2013 (SAC 2013) March 21, 2013 Michel Capelle
A Linguistic Approach for Semantic Web Service Discovery International Symposium on Management Intelligent Systems 2012 (IS-MiS 2012) July 13, 2012 Jordy.
Exploiting Discourse Structure for Sentiment Analysis of Text OR 2013 Alexander Hogenboom In collaboration with Flavius Frasincar, Uzay Kaymak, and Franciska.
Determining Negation Scope and Strength in Sentiment Analysis SMC 2011 Paul van Iterson Erasmus School of Economics Erasmus University Rotterdam
Exploiting Emoticons in Sentiment Analysis SAC 2013 Daniella Bal Erasmus University Rotterdam Flavius Frasincar Erasmus University.
January 12, Statistical NLP: Lecture 2 Introduction to Statistical NLP.
OntoBlog: Informal Knowledge Management by Semantic Blogging Aman Shakya 1, Vilas Wuwongse 2, Hideaki Takeda 1, Ikki Ohmukai 1 1 National Institute of.
Predicting Text Quality for Scientific Articles Annie Louis University of Pennsylvania Advisor: Ani Nenkova.
Erasmus University Rotterdam Frederik HogenboomEconometric Institute School of Economics Flavius Frasincar.
Information Extraction CS 652 Information Extraction and Integration.
Sentiment Lexicon Creation from Lexical Resources BIS 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Basi di dati distribuite Prof. M.T. PAZIENZA a.a
Information Extraction and Ontology Learning Guided by Web Directory Authors:Martin Kavalec Vojtěch Svátek Presenter: Mark Vickers.
Automatically Annotating Web Pages Using Google Rich Snippets 11th Dutch-Belgian Information Retrieval Workshop (DIR 2011) February 4, 2011 Frederik Hogenboom.
Semantics For the Semantic Web: The Implicit, the Formal and The Powerful Amit Sheth, Cartic Ramakrishnan, Christopher Thomas CS751 Spring 2005 Presenter:
Detecting Economic Events Using a Semantics-Based Pipeline 22nd International Conference on Database and Expert Systems Applications (DEXA 2011) September.
An Overview of Event Extraction from Text Workhop on Detection, Representation, and Exploitation of Events in the Semantic Web (DeRiVE'11) October 23,
News Personalization using the CF-IDF Semantic Recommender International Conference on Web Intelligence, Mining, and Semantics (WIMS 2011) May 25, 2011.
تمرين شماره 1 درس NLP سيلابس درس NLP در دانشگاه هاي ديگر ___________________________ راحله مکي استاد درس: دکتر عبدالله زاده پاييز 85.
Article by: Feiyu Xu, Daniela Kurz, Jakub Piskorski, Sven Schmeier Article Summary by Mark Vickers.
Building Knowledge-Driven DSS and Mining Data
Analyzing Sentiment in a Large Set of Web Data while Accounting for Negation AWIC 2011 Bas Heerschop Erasmus School of Economics Erasmus University Rotterdam.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
 Copyright 2009 Digital Enterprise Research Institute. All rights reserved Digital Enterprise Research Institute Ontologies & Natural Language.
An Automatic Segmentation Method Combined with Length Descending and String Frequency Statistics for Chinese Shaohua Jiang, Yanzhong Dang Institute of.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2005 Lecture 1 21 July 2005.
Word Sense Disambiguation for Automatic Taxonomy Construction from Text-Based Web Corpora 12th International Conference on Web Information System Engineering.
Sentiment Analysis with a Multilingual Pipeline 12th International Conference on Web Information System Engineering (WISE 2011) October 13, 2011 Daniëlla.
Erasmus University Rotterdam Introduction Nowadays, emerging news on economic events such as acquisitions has a substantial impact on the financial markets.
Erasmus University Rotterdam Introduction With the vast amount of information available on the Web, there is an increasing need to structure Web data in.
Knowledge Discovery in Ontology Learning A survey.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 Statistical NLP: Lecture 10 Lexical Acquisition.
RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.
Automatic Lexical Annotation Applied to the SCARLET Ontology Matcher Laura Po and Sonia Bergamaschi DII, University of Modena and Reggio Emilia, Italy.
Survey of Semantic Annotation Platforms
 Knowledge Acquisition  Machine Learning. The transfer and transformation of potential problem solving expertise from some knowledge source to a program.
Ontology Updating Driven by Events Dutch-Belgian Database Day 2012 (DBDBD 2012) November 21, 2012 Frederik Hogenboom Jordy Sangers.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
1 Technologies for (semi-) automatic metadata creation Diana Maynard.
Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
W ORD S ENSE D ISAMBIGUATION By Mahmood Soltani Tehran University 2009/12/24 1.
인공지능 연구실 황명진 FSNLP Introduction. 2 The beginning Linguistic science 의 4 부분 –Cognitive side of how human acquire, produce, and understand.
Knowledge Representation of Statistic Domain For CBR Application Supervisor : Dr. Aslina Saad Dr. Mashitoh Hashim PM Dr. Nor Hasbiah Ubaidullah.
1 CSI 5180: Topics in AI: Natural Language Processing, A Statistical Approach Instructor: Nathalie Japkowicz Objectives of.
Evaluating Semantic Metadata without the Presence of a Gold Standard Yuangui Lei, Andriy Nikolov, Victoria Uren, Enrico Motta Knowledge Media Institute,
Semantics-Based News Recommendation with SF-IDF+ International Conference on Web Intelligence, Mining, and Semantics (WIMS 2013) June 13, 2013 Marnix Moerland.
Erasmus University Rotterdam Introduction Content-based news recommendation is traditionally performed using the cosine similarity and TF-IDF weighting.
Towards Cross-Language Sentiment Analysis through Universal Star Ratings KMO 2012 Malissa Bal Erasmus University Rotterdam Flavius.
Towards the Semantic Web 6 Generating Ontologies for the Semantic Web: OntoBuilder R.H.P. Engles and T.Ch.Lech 이 은 정
For Monday Read chapter 24, sections 1-3 Homework: –Chapter 23, exercise 8.
For Monday Read chapter 26 Last Homework –Chapter 23, exercise 7.
Lexico-semantic Patterns for Information Extraction from Text The International Conference on Operations Research 2013 (OR 2013) Frederik Hogenboom
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
For Friday Finish chapter 23 Homework –Chapter 23, exercise 15.
CPSC 422, Lecture 27Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 27 Nov, 16, 2015.
The Unreasonable Effectiveness of Data
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Semantics-Based News Recommendation International Conference on Web Intelligence, Mining, and Semantics (WIMS 2012) June 14, 2012 Michel Capelle
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
For Monday Read chapter 26 Homework: –Chapter 23, exercises 8 and 9.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
A Simple English-to-Punjabi Translation System By : Shailendra Singh.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Kenneth Baclawski et. al. PSB /11/7 Sa-Im Shin
Presentation transcript:

A Survey of Approaches on Mining the Structure from Unstructured Data Dutch-Belgian Database Day 2009 (DBDBD 2009) 1 Nov. 30, 2009 Frederik Hogenboom Flavius Frasincar Uzay Kaymak Econometric Institute Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, the Netherlands

Introduction A lot of data is generated every day Difficult to find information that meets one’s needs There is a need to mine the structure of data as a first step towards understanding it Part of the effort to make the Web machine-understandable Solution: employ NLP techniques to extract knowledge from unstructured text written in natural language Dutch-Belgian Database Day 2009 (DBDBD 2009) 2 Nov. 30, 2009

Which Technique to Choose? Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Statistics-Based NLP (1) Utilize statistics and mathematical models based on probability theory Refers to all non-symbolic and non-logical work on NLP, i.e., it encompasses all quantitative approaches to automated language processing, including: –Probabilistic modeling –Information theory –Linear algebra Phrases extracted from text written in an arbitrary natural language are analyzed in order to find (statistical) relations Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Statistics-Based NLP (2) Word-based: –Statistics collection on words –Frequency counting and ranking generation (e.g., TF-IDF) –Collocations (cliff-hanger, eye candy, take care, profit announcement, etc.) –Word Sense Disambiguation (WSD) –Inference models: n-grams –Clustering Grammar-based: –Part-Of-Speech (POS) tagging –Stochastic Context-Free Grammars (SCFG) Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Statistics-Based NLP (3) Advantages: –Not based on knowledge, thus they do not require linguistic resources, nor do they require expert knowledge –Issues regarding leaking grammars, inconsistencies among humans, dialects, etc. are alleviated Disadvantages: –Often need a large amount of data –Approaches do not deal with meaning explicitly, i.e., statistical methods discover relations in corpora without considering semantics Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Statistics-Based NLP (4) Examples: –(Bannard et al., 2003) discuss several techniques for using statistical models acquired from corpus data to infer the meaning of verb-particle constructions: Collocation-like approach, frequency counting Focus on mining relations between words –(Taira and Soderland, 1999) implement a statistical natural language processor: Based on resonance probabilities between word pairs Uses word affinity knowledge from training sentences Focus on acquiring knowledge from radiology reports Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Pattern-Based NLP (1) Use linguistic patterns to extract data from texts Patterns can be: –Predefined –Discovered (learned) Knowledge used: –Lexical knowledge –Syntactic knowledge –Semantic knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Pattern-Based NLP (2) Lexico-syntactic patterns: –Combine lexical and syntactic elements with regular expressions –E.g., “{NNP, }* NNP{,}? and NNP {(announce | discuss)} collaboration {with NNP}?” mines a corpus for information on fusions and collaborations of companies and/or persons Lexico-semantic patterns: –Enrich lexico-syntactic patterns through the addition of semantics –Gazetteers (simple typing): Use linguistic meaning of text E.g., “ [sub:company] announces collaboration with [obj:company] ” –Ontologies (complex typing): Include also relationships E.g., “ [kb:Company] kb:collaborates [kb:Company] ” Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Pattern-Based NLP (3) Advantages: –Need less training data –Complex expressions can be defined –Results are easily interpretable Disadvantages: –Lexical knowledge is required –Prior expert/domain knowledge might be required (for lexico- semantic patterns) –Defining and maintaining patterns is a cumbersome and non-trivial task Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Pattern-Based NLP (4) Examples: –CAFETIERE (Black et al., 2005): Employs extraction rules defined at lexico-semantic level Makes use of gazetteering Knowledge is stored using Narrative Knowledge Representation Language (NKRL) Knowledge base lacks reasoning support Focus on extracting relations from corpora –Hermes (Frasincar et al., 2009): Patterns defined at lexico-semantic level Makes use of ontologies and reasoning engines Knowledge is based on an OWL domain ontology Focus on the use of pattern-based NLP in building personalized news services Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Hybrid NLP (1) Combine linguistic knowledge with statistical methods Usually, it appears to be difficult to stay within the boundaries of a single approach Thus, it is convenient to combine best from both worlds: –Bootstrapping lexical methods –Solving lack of expert knowledge by applying statistical methods –Statistical methods that use some present (lexical) knowledge Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Hybrid NLP (2) Advantages: –Solve problems related to scaling and required expert knowledge of pattern-based approaches –Do not require as much data as statistical approaches –Inherit some of the advantages of both statistical and pattern-based approaches Disadvantages: –By combining different techniques, maintaining completeness and accuracy of the systems becomes more difficult –Multidisciplinary aspects –Inherit some of the disadvantages of both statistical and pattern- based approaches Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Hybrid NLP (3) Examples: –Corpus-Based Statistics-Oriented techniques (Su et al., 1996): Mainly statistical learning techniques, guided by high-level linguistic constructs Applications in POS tagging, semantic analysis of corpora, machine translation, annotation, etc. Focus is on extracting inductive knowledge from corpora to support building large scale NLP systems –PANKOW (Cimiano et al., 2004): Generates instances of lexico-syntactic patterns indicating a certain semantic or ontological relation Counts number of occurrences of patterns Statistical distribution of instances of these patterns constitutes the collective knowledge Focus is on supporting annotation Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

Conclusions Three main approaches to NLP: –Statistics-based –Pattern-based –Hybrid Which techniques to use for your NLP tasks? There is no single best approach, but consider these rough guidelines: –Evaluate your problem, preferences, and available resources –If you are less concerned with semantics and you assume that knowledge lies within statistical facts on a specific corpus, use a statistics-based approach –If you are concerned with the semantics of discovered information, or you want to be able to easily explain and control the results, use a pattern-based approach –If you need to bootstrap a pattern-based approach using statistics (e.g., insufficient knowledge available) or the other way around (e.g., need of a priori knowledge) use a hybrid approach Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,

References C. Bannard, T. Baldwin, and A. Lascarides. A statistical approach to the semantics of verb- particles. In ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, pages Association for Computational Linguistics, W. J. Black, J. M c Naught, A. Vasilakopoulos, K. Zervanou, B. Theodoulidis, and F. Rinaldi. CAFETIERE: Conceptual Annotations for Facts, Events, Terms, Individual Entities, and Relations. Technical Report TR-U4.3.1, Department of Computation, UMIST, Manchester, P. Cimiano, S. Handschuh, and S. Staab. Towards the Self-Annotating Web. In 13th International Conference on World Wide Web (WWW 2004), pages ACM, F. Frasincar, J. Borsje, and L. Levering. A Semantic Web-Based Approach for Building Personalized News Services. International Journal of E-Business Research, 5(3):35-53, K.-Y. Su, T.-H. Chiang, and J.-S. Chang. An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing. Computational Linguistics and Chinese Language Processing, 1(1): , R. K. Taira and S. G. Sodepages rland. A statistical natural language processor for medical reports. In AMIA Symposium 1999, pages American Medical Informatics Association, Dutch-Belgian Database Day 2009 (DBDBD 2009)Nov. 30,