Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Knowledge from Text Using Information Extraction.

Slides:



Advertisements
Similar presentations
1 Relational Learning of Pattern-Match Rules for Information Extraction Presentation by Tim Chartrand of A paper bypaper Mary Elaine Califf and Raymond.
Advertisements

Web Mining Research: A Survey Authors: Raymond Kosala & Hendrik Blockeel Presenter: Ryan Patterson April 23rd 2014 CS332 Data Mining pg 01.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining and Summarizing Customer Reviews Advisor : Dr.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Aki Hecht Seminar in Databases (236826) January 2009
Relational Learning of Pattern-Match Rules for Information Extraction Mary Elaine Califf Raymond J. Mooney.
Ch 10 Part-of-Speech Tagging Edited from: L. Venkata Subramaniam February 28, 2002.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Huimin Ye.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Learning Table Extraction from Examples Ashwin Tengli, Yiming Yang and Nian Li Ma School of Computer Science Carnegie Mellon University Coling 04.
Mining and Summarizing Customer Reviews
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Hidden Markov Models Applied to Information Extraction Part I: Concept Part I: Concept HMM Tutorial HMM Tutorial Part II: Sample Application Part II: Sample.
Information Extraction: Distilling Structured Data from Unstructured Text. -Andrew McCallum Presented by Lalit Bist.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 The k-means range algorithm for personalized data clustering.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Taxonomy of Similarity Mechanisms for Case-Based Reasoning.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.
1 A Hierarchical Approach to Wrapper Induction Presentation by Tim Chartrand of A paper bypaper Ion Muslea, Steve Minton and Craig Knoblock.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. OpinionMiner: A Novel Machine Learning System for Web Opinion Mining and Extraction Presenter : Jiang-Shan.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Visualizing Ontology Components through Self-Organizing.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Finding Terminology Translations From Hyperlinks On the.
Presenter: Shanshan Lu 03/04/2010
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using term informativeness for named entity detection.
CS774. Markov Random Field : Theory and Application Lecture 19 Kyomin Jung KAIST Nov
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Virus Pattern Recognition Using Self-Organization Map.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Ming Hsiao Author : Bing Liu Yiyuan Xia Philp S. Yu 國立雲林科技大學 National Yunlin University.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
BioRAT: Extracting Biological Information from Full-length Papers David P.A. Corney, Bernard F. Buxton, William B. Langdon and David T. Jones Bioinformatics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Learning Phonetic Similarity for Matching Named Entity.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using the Web for Automated Translation Extraction in.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. How valuable is medical social media data? Content analysis of the medical web Presenter :Tsai Tzung.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Development of a reading material recommendation system based on a knowledge engineering approach Presenter.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Word sense disambiguation of WordNet glosses Presenter: Chun-Ping Wu Author: Dan Moldovan, Adrian Novischi.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Evolving Reactive NPCs for the Real-Time Simulation Game.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Using Text Mining and Natural Language Processing for.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Yu Cheng Chen Author: YU-SHENG.
Intelligent Database Systems Lab Advisor : Dr. Hsu Graduate : Chien-Shing Chen Author : Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A text mining approach on automatic generation of web.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Psychiatric document retrieval using a discourse-aware model Presenter : Wu, Jia-Hao Authors : Liang-Chih.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Mining massive document collections by the WEBSOM method Presenter : Yu-hui Huang Authors :Krista Lagus,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Improving the performance of personal name disambiguation.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. 1 Mining knowledge from natural language texts using fuzzy associated concept mapping Presenter : Wu,
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Translation of Web Queries Using Anchor Text Mining Advisor.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.
Exploiting Named Entity Taggers in a Second Language Thamar Solorio Computer Science Department National Institute of Astrophysics, Optics and Electronics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Recognizing Partially Occluded, Expression Variant Faces.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Automatic Document Indexing in Large Medical Collections.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology ACM SIGMOD1 Subsequence Matching on Structured Time Series.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Text Classification Improved through Multigram Models.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Growing Hierarchical Tree SOM: An unsupervised neural.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien-Shing Chen Author: Gustavo.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 2005.ACM GECCO.8.Discriminating and visualizing anomalies.
Overview of Statistical NLP IR Group Meeting March 7, 2006.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Information Extraction from Wikipedia: Moving Down the Long.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Graduate : Chun Kai Chen Author : Andrew.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Named Entity Disambiguation by Leveraging Wikipedia Semantic Knowledge Presenter : Jiang-Shan Wang Authors.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Knowledge from Text Using Information Extraction Advisor : Dr. Hsu Presenter : Chih-Ling Wang

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Introduction Information Extraction Extracting Knowledge Mining Extracted Data Future Research Conclusions My opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation Most data-mining research assumes that the information to be “mined” is already in the form of a relational database. Unfortunately, for many applications, available electronic information is in the form of unstructured natural-language documents rather than structured databases.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective We discuss two approaches to using natural-language information extraction for text mining.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction Information extraction (IE) distills structured data or knowledge from unstructured text by identifying references to named entities as well as stated relationships between such entities. IE systems can be used to directly extricate abstract knowledge from a text corpus, or to extract concrete data from a set of documents which can then be further analyzed with traditional data-mining techniques to discover more general patterns.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 6 Introduction (cont.) Most of the work in text mining does not exploit any form of natural-language processing (NLP), treating documents as an unordered “bag of words” as is typical in information retrieval. Although full natural-language understanding is still far from the capabilities of current technology, existing methods in information extraction are able to recognize several types of entities in text and identify some relationships that are asserted between them.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 7 Introduction (cont.) We review two approaches to text mining with information extraction, using one of our own research projects to illustrate each approach. First, we introduce the basics of information extraction. Next, we discuss using IE to directly extract knowledge from text. Finally, we discuss discovering knowledge by mining data that is first extracted from unstructured or semi-structured text.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 8 Information Extraction- IE problem Information Extraction concerns locating specific pieces of data in natural-language documents, thereby extracting structured information from unstructured text. Entity recognition: involves identifying references to particular kinds of objects. Names of people, companies, and locations.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 9 Information Extraction- IE problem (cont.) In addition to recognizing entities, an important problem is extracting specific types of relations between entities. For example, in newspaper text, one can identify that an organization is located in a particular city or that a person is affiliated with a specific organization. In Figure 1 would require extracting the relation: interacts (NOSIP, eNOS)

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 10 Information Extraction- IE problem (cont.) IE can also be used to extract fillers for a predetermined set of slots (roles) in a particular template (frame) relevant to the domain. In this paper, we consider the task of extracting a database from postings to the USENET newsgroup, austin.jobs.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 11 Information Extraction- IE problem (cont.) Another application of IE is extracting structured data from unstructured or semi-structured web pages. When applied to semi-structrued HTML, typically generated from an underlying database by a program on a web server, an IE system is typically called a wrapper, and the process is sometimes referred to as screen scraping. A typical application is extracting data on commercial items from web stores for a comparison shopping agent. A wrapper may extract the title, author, ISBN number, publisher, and price of book from an Amazon web page.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 12 Information Extraction- IE problem (cont.) IE systems can also be used to extract data or knowledge from less-structured web sites by using both the HTML text in their pages as well as the structure of the hyperlinks between their pages.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 13 Information Extraction- IE Methods One approach is to manually develop information-extraction rules by encoding patterns that reliably identify the desired entities or relations. Due to the variety of forms and contexts in which the desired information can appear, manually developing patterns is very difficult and tedious and rarely results in robust systems. Consequently, supervised machine-learning methods trained on human annotated corpora has become the most successful approach to developing robust IE systems.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 14 Information Extraction- IE Methods (cont.) One approach is to automatically learn pattern-based extraction rules for identifying each type of entity or relation. For example, RAPIER learns extraction rules consisting of three parts: 1. a pre-filler pattern that matches the text immediately preceding the phrase to be extracted, 2. a filler pattern that matches the phrase to be extracted, and 3. a post-filler pattern that matches the text immediately following the filler. Pre-filler PatternFiller PatternPost-filler Pattern $ or NT 元 or dollars

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 15 Information Extraction- IE Methods (cont.) An alternative general approach to IE is to treat it as a sequence labeling task in which each word (token) in the document is assigned a label from a fixed set of alternatives. For example, for each slot, X, to be extracted, we include a token label BeginX to mark the beginning of a filler for X. IndexX to mark other tokens in a filler for X. Other for tokens that are not included in the filler of any slot.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 16 Information Extraction- IE Methods (cont.) One approach to the resulting sequence labeling problem is to use a statistical sequence model such as a Hidden Markov Model (HMM) or a Conditional Random Field (CRF). Several earlier IE systems used generative HMM models; however, discriminately-trained CRF models have recently been shown to have an advantage over HMM’s. In both cases, the model parameters are learned from a supervised training corpus and then an efficient dynamic programming method based on the Viterbi algorithm is used to determine the most probable tagging of a complete test document.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 17 Information Extraction- IE Methods (cont.) Another approach to the sequence labeling problem for IE is to use a standard feature-based inductive classifier to predict the label of each token based on both the token itself and its surrounding context. The context is represented by a set of features that include the one or two tokens on either side of the target token.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 18 Information Extraction- IE Methods (cont.) Many IE systems simply treat text as a sequence of uninterpreted tokens; however, many others use a variety of other NLP tools or knowledge bases. For example A number of systems preprocess the text with a part-of-speech (POS) tagger and use words’ POS as an extra feature. Several IE systems use phrase chunkers to identify potential phrases to extract. Others use complete syntactic parsers, particularly those which try to extract relations between entities by examining the syntactic relationship between the phrases describing the relevant entities. Some use lexical semantic databases, such as WordNet, which provide word classes that can be used to define more general extraction patterns.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 19 Information Extraction- IE Methods (cont.) This rule extracts the value “undisclosed” from phrases such as “sold to the bank for an undisclosed amount” or “ paid Honeywell an undisclosed price.” The pre-filler pattern matches a noun or proper noun followed by at most two other unconstrained words. The filler pattern matches the word “undisclosed” only when its POS tag is “adjective.” The post-filler pattern matches any word in WordNet’s semantic class named “price”.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 20 Extraction Knowledge If the information extracted from a corpus of documents represents abstract knowledge rather than concrete data, IE itself can be considered a form of “discovering” knowledge from text. We found that CRF’s gave us the best result on extracting human protein names. However, although CRF’s capture the dependence between the labels of adjacent words, it does not adequately capture long- distance dependencies between potential extractions in different parts of a document.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 21 Extraction Knowledge (cont.) We recently developed a new IE method based on Relational Markov Networks (RMN’s) that captures dependencies between distinct candidate extractions in a document. We developed a new IE learning method, ELCS, that automatically induces such patterns using a bottom-up rule learning algorithm that computes generalizations based on longest common subsequences. Another approach we have taken to identifying protein interactions is based on co-citation. Exploit the idea that if many different abstracts reference both protein A and protein B, then A and B are likely to interact.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 22 Extraction Knowledge (cont.) Based on comparisons to these existing protein databases, the co-citation plus text-classification approach was found to be more effective at identifying interactions than our IE approach based on ELCS.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 23 Mining Extraction Data If extracted information is specific data rather than abstract knowledge, an alternative approach to text mining is to first use IE to obtain structured data from unstructured text and then use traditional KDD tools to discover knowledge from this extracted data. Using this approach, we developed a text-mining system called DISCOTEX (Discovery from Text Extraction).

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 24 Mining Extraction Data (cont.) In DISCOTEX, IE plays the important role of preprocessing a corpus of text documents into a structured database suitable for mining. DISCOTEX uses two learning systems to build extractors, RAPIER and BWI. By training on a corpus of documents annotated with their filled templates, these systems acquire pattern-matching rules that can be used to extract data from novel documents.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 25 Mining Extraction Data (cont.) DISCOTEX induces rules for predicting each piece of information in each database filed given all other information in a record. In order to discover prediction rules, we treat each slot- value pair in the extracted database as a distinct binary feature and learn rules for predicting each feature from all other features. We have applied C4.5 RULES to discover interesting rules from the resulting binary data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 26 Mining Extraction Data (cont.) The last rule illustrates the discovery of an interesting concept which could be called “the IBM shop;” i.e. companies that require knowledge of an IBM operating system and DBMS, also require knowledge of Lotus Notes, another IBM product.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Mining Extraction Data (cont.) Unfortunately, the accuracy of current IE systems is limited, and therefore an automatically extracted database will inevitably contain a fair number of errors. We have conducted experiments on job postings showing that rules discovered from an automatically extracted database are very close in accuracy to those discovered from a corresponding manually- constructed database. These results demonstrate that mining extracted data is a reliable approach to discovering accurate knowledge from unstructured text.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 28 Mining Extraction Data (cont.) Another potential problem with mining extracted data is that the heterogeneity of extracted text frequently prevents traditional data-mining algorithms from discovering useful knowledge. We developed two approaches to addressing this problem. One approach is to first “clean” the data by identifying all of the extracted strings that refer to the same entity and then replacing sets of equivalent strings with canonical entity names. Another approach to handling heterogeneity is to mine”soft matching” rules directly from the “dirty” data extracted from text.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 29 Mining Extraction Data (cont.) We developed two novel methods for mining sort-matching rules. First, is an algorithm called TEXTRISE that learns rules whose conditions are partially matched to data using a similarity metric. We also developed SOFTAPRIORI, a generalization of the standard APRIORI algorithm for discovering association rules that allows soft matching using a specified similarity metric for each field. Experimental results in several domains have demonstrated that both TEXTRISE and SOFTAPRIORI allow the discovery of interesting “soft-matching” rules from automatically-extracted data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 30 Future Research Information extraction remains a challenging problem with many potential avenues for progress. Most IE systems are developed by training on human annotated corpora; however, constructing corpora sufficient for training accurate IE systems is a burdensome chore. Use active learning methods to decrease the amount of training data.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 31 Future Research (cont.) Another approach to reducing demanding corpus-building requirements is to develop unsupervised learning methods for building IE system. Developing semi-supervised learning methods for IE is a related research direction.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 32 Conclusions In this paper we have discussed two approaches to using natural-language information extraction for text mining. First, one can extract general knowledge directly from text. Second, one can first extract structured data from text documents or web pages and then apply traditional KDD methods to discover patterns in the extracted data.