Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr.

Slides:

Advertisements

Similar presentations

Arnd Christian König Venkatesh Ganti Rares Vernica Microsoft Research Entity Categorization Over Large Document Collections.

Advertisements

Document Summarization using Conditional Random Fields Dou Shen, Jian-Tao Sun, Hua Li, Qiang Yang, Zheng Chen IJCAI 2007 Hao-Chin Chang Department of Computer.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

Linear Model Incorporating Feature Ranking for Chinese Documents Readability Gang Sun, Zhiwei Jiang, Qing Gu and Daoxu Chen State Key Laboratory for Novel.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: Hichem.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining and Summarizing Customer Reviews Advisor ： Dr.

Chapter 1: Introduction to Pattern Recognition

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Predicting the Semantic Orientation of Adjective Vasileios Hatzivassiloglou and Kathleen R. McKeown Presented By Yash Satsangi.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Extreme Re-balancing for SVMs: a case study Advisor ：

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Tie-Yan.

Presented by Tienwei Tsai July, 2005

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Satoshi Oyama Takashi Kokubo Toru lshida 國立雲林科技大學 National Yunlin.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A Comparison of SOM Based Document Categorization Systems.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Positive and Negative Patterns for Relevance Feature.

A Language Independent Method for Question Classification COLING 2004.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology On Data Labeling for Clustering Categorical Data Hung-Leng.

Self Organization of a Massive Document Collection Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Author : Teuvo Kohonen et al.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. A Web 2.0-based collaborative annotation system for enhancing knowledge sharing in collaborative learning.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien Shing Chen Author: Wei-Hao.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Extracting meaningful labels for WEBSOM text archives Advisor.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Greedy is not Enough: An Efficient Batch Mode Active Learning Algorithm Chen, Yi-wen( 陳憶文 ) Graduate Institute of Computer Science ＆ Information Engineering.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Student : Sheng-Hsuan Wang Department.

Event retrieval in large video collections with circulant temporal encoding CVPR 2013 Oral.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Manoranjan.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Mining Logs Files for Data-Driven System Management Advisor.

A New Temporal Pattern Identification Method for Characterization and Prediction of Complex Time Series Events Advisor ： Dr. Hsu Graduate ： You-Cheng Chen.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Chung-hung.

Neural Text Categorizer for Exclusive Text Categorization Journal of Information Processing Systems, Vol.4, No.2, June 2008 Taeho Jo* 報告者 : 林昱志.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A modified version of the K-means algorithm with a distance.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Fuzzy integration of structure adaptive SOMs for web content.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Yu Cheng Chen Author: YU-SHENG.

Intelligent Database Systems Lab Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller 國立雲林科技大學 National Yunlin University of.

Intelligent Database Systems Lab Advisor ： Dr. Hsu Graduate ： Chien-Shing Chen Author ： Juan D.Velasquez Richard Weber Hiroshi Yasuda 國立雲林科技大學 National.

Computational Approaches for Biomarker Discovery SubbaLakshmiswetha Patchamatla.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Iterative Translation Disambiguation for Cross-Language.

Exploring in the Weblog Space by Detecting Informative and Affective Articles Xiaochuan Ni, Gui-Rong Xue, Xiao Ling, Yong Yu Shanghai Jiao-Tong University.

Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Unsupervised Learning with Mixed Numeric and Nominal Data.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien Shing Chen Author: Wei-Hao.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 A self-organizing map for adaptive processing of structured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Cost- sensitive boosting for classification of imbalanced.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Direct mining of discriminative patterns for classifying.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Yu Cheng Chen Author: Wei Xu,

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Mining Advisor-Advisee Relationships from Research Publication.

 Effective Multi-Label Active Learning for Text Classification Bishan yang, Juan-Tao Sun, Tengjiao Wang, Zheng Chen KDD’ 09 Supervisor: Koh Jia-Ling Presenter:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Presenter ： Chien-Shing Chen Author: Gustavo.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor ： Dr. Hsu Graduate ： Chun Kai Chen Author ： Andrew.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Boosting the Feature Space: Text Classification for Unstructured.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Intelligent Exploration for Genetic Algorithms Using Self-Organizing.

Basic machine learning background with Python scikit-learn

A Unifying View on Instance Selection

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Presentation transcript:

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Instance Filtering for Entity Recognition Advisor ： Dr. Hsu Reporter ： Chun Kai Chen Author ： Alfio Massimiliano, Claudio Giuliano and Raffaella Rinaldi SIGKDD Explorations. Volume 7, Issue 1

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 2 Outline Motivation Objective Background And Related work Instance Filtering Experimental Results Conclusions Personal Opinion

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 3 Motivation_Introduction(1/3)  The objective of Information Extraction (IE) ─ to identify a set of relevant domain-specific classes of entities ─ their relations in textual documents ─ this paper focus on the problem of Entity Recognition (ER)  Recent evaluation campaigns on ER ─ most of the participating systems approach the task as a supervised classification problem ─ assigning an appropriate classification label for each token in the input documents ─ two problems are usually related to this approach the skewed class distribution the data set size

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 4 Objective_Introduction(2/3)  To address these problems, we propose a technique called Instance Filtering (IF)  The goal of IF ─ reduce both the skewness and the data set size ─ main peculiarity of this technique performed on both the training and test sets reduces the computation time and the memory requirements for learning and classification, improves the classification performance

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 5 Introduction(3/3)  Present a comparative study ─ on Stop Word Filters ─ Ex. “He got a job from this company.” (Considering a, from and this to be stop words.),  To evaluate our filtering techniques ─ the SIE system a supervised system for ER developed at ITC-irst ─ designed to achieve the goal of being easily and quickly portable across tasks and languages ─ based on Support Vector Machines and uses a (standard) general purpose feature set  Performed experiments ─ on three different ER tasks (i.e. Named Entity, Bio-Entity and Temporal Expression Recognition) ─ in two languages (i.e. English and Dutch)

Intelligent Database Systems Lab N.Y.U.S.T. I. M Background And Related work  Learning with skewed class distributions is a very well-known problem in machine learning ─ The most common technique for dealing with skewed data sets is sampling  An additional problem is the huge size of the data sets. ─ Instance Pruning techniques have been mainly applied to instance-based learning algorithms (e.g. kNN), to speed up the classification process while minimizing the memory requirements The main drawback of many Instance Pruning techniques is their time complexity

Intelligent Database Systems Lab N.Y.U.S.T. I. M Instance Filtering  IF is a preprocessing step ─ performed to reduce the number of instances given as input to a supervised classifier for ER ─ In this section describe a formal framework for IF introduce two metrics to evaluate an Instance Filter. ─ In addition define the class of Stop Word Filters propose an algorithm for their optimization

Intelligent Database Systems Lab N.Y.U.S.T. I. M A general framework(1/2)  An Instance Filter is a function Δ(t i ; T) ─ returns 0 if the token t i is not expected to be part of a relevant entity, 1 otherwise.  Instance Filter can be evaluated using the two following functions: ─ ψ(Δ,T) is called the Filtering Rate denotes the total percentage of filtered tokens in the data set T ─ ψ + (Δ,T) is named as Positive Filtering Rate denotes the percentage of positive tokens (wrongly) removed

Intelligent Database Systems Lab N.Y.U.S.T. I. M A general framework(2/2)  a good filter ─ if ψ + (Δ,T) is minimized and ψ(Δ,T) is maximized ─ reduce as much as possible the data set size while preserving most of the positive instances  avoid over-fitting ─ the Filtering Rates among the training and test set (T L and T T, respectively) have to be preserved:  skewness ratio ─ to evaluate the ability of an Instance Filter to reduce the data skewness

Intelligent Database Systems Lab N.Y.U.S.T. I. M Stop Word Filters  They are implemented in two steps: ─ first, Stop Words are identified from the training corpus T and collected in the set of types U V ─ then all their tokens are removed from both the training and the test set

Intelligent Database Systems Lab N.Y.U.S.T. I. M Stop Word Filters  Information Content ─ removing tokens has a very low information content  Correlation Coefficient (CC) ─ χ 2 statistic is used to measure the lack of independence to find less likely to express relevant information  Odds Ratio (OR) ─ measures the ratio between the probability of a type to occur in the positive or negative class ─ relevant documents is different from the distribution on non- relevant documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M Information Content (IC)  The most commonly used feature selection metric in text classification is based on document frequency  Our approach consists in removing all the tokens whose type has a very low information content

Intelligent Database Systems Lab N.Y.U.S.T. I. M Correlation Coefficient (CC)  In text classification the χ 2 statistic ─ is used to measure the lack of independence between a type w and a category [20].  In our approach ─ we use the correlation coefficient CC 2 = χ 2 of a term w with the negative class, ─ to find those types that are less likely to express relevant information in texts.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Odds Ratio (OR)  Odds ratio ─ measures the ratio between the probability of a type to occur in the positive class, and its probability to occur in the negative class. ─ the idea is that the distribution of the features on the relevant documents is different from the distribution on non-relevant documents [21].  Following this assumption, our approach is ─ a type is non-informative when its probability of being a negative example is sensibly higher than its probability of being a positive example [8].

Intelligent Database Systems Lab N.Y.U.S.T. I. M Optimization Issues  How to find the optimal threshold for a Stop Word Filter  To solve this problem ─ we observe the behaviors of ψ and ψ+

Intelligent Database Systems Lab N.Y.U.S.T. I. M A Simple Information Extraction System(1/4)  In the training phase ─ SIE learns o-line a set of data models from a corpus prepared in IOBE format (see 4.1).  In the classification phase ─ these models are applied to tag new documents

Intelligent Database Systems Lab N.Y.U.S.T. I. M A Simple Information Extraction System(2/4)  Input Format ─ The corpus must be prepared in IOBE notation  Instance Filtering Module ─ implements the 3 different Stop Word Filters ─ different Stop Word Lists provided for the beginning and the end boundaries of each entity, as SIE learns two distinct classifiers for them

Intelligent Database Systems Lab N.Y.U.S.T. I. M A Simple Information Extraction System(3/4)  Feature Extraction ─ used to extract a predefined set of features for each unfiltered token in both the training and the test sets.  Classification ─ SIE approaches the IE task as a classification problem ─ by assigning an appropriate classification label to unfiltered tokens ─ We use SVM light for training the classiers

Intelligent Database Systems Lab N.Y.U.S.T. I. M A Simple Information Extraction System(4/4)  Tag Matcher ─ All the positive predictions produced by the begin and end classifiers are paired by the Tag Matcher module provides the final output of the system ─ assigns a score to each candidate entity. If nested or overlapping entities occur, it selects the entity with the maximal score ─ The score of each entity is proportional to the entity length probability (i.e. the probability that an entity has a certain length) ─ and to the confidence provided by the classifiers to the boundary predictions.

Intelligent Database Systems Lab N.Y.U.S.T. I. M EVALUATION  In order to assess the portability and the language independence of our filtering techniques ─ we performed a set of comparative experiments on three different tasks in two different languages (see Subsection 5.1).

Intelligent Database Systems Lab N.Y.U.S.T. I. M Task Descriptions  JNLPBA ─ International Joint Workshop on Natural Language Processing in Biomedicine and its Application ─ five entity types: DNA, RNA, protein, cell-line, and cell-type  CoNLL-2002 ─ Recognize named entities from Dutch texts ─ Four types of named entities are considered: persons, locations, organizations and names of miscellaneous entities  TERN ─ The TERN (Time Expression Recognition and Normalization)

Intelligent Database Systems Lab N.Y.U.S.T. I. M Filtering Rates(1/2)  The results indicate ─ both CC and OR do exhibit good performance and are far better than IC in all the tasks ─ also highlight that our optimization strategy is robust against over fitting

Intelligent Database Systems Lab N.Y.U.S.T. I. M Filtering Rates(2/2)  also report a significant reduction of the data skewness ─ Table 3 shows that all the IF techniques reduce sensibly the data skewness on the JNLPBA data set13. ─ As expected, both CC and OR consistently outperform IC.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Time Reduction  Figure 4 displays the impact of IF on the computation time 14 required to perform the overall IE process. ─ It is important to note that the cost of the IF optimization process is negligible ─ The curves indicate that both CC and OR are far superior to IC, allowing a drastic reduction of the time.

Intelligent Database Systems Lab N.Y.U.S.T. I. M Prediction Accuracy  Figure 5 plots the values of the micro-averaged F ─ Both OR and CC allows to drastically reduce the computation time maintain the prediction accuracy with small values of ε

Intelligent Database Systems Lab N.Y.U.S.T. I. M Comparison with the state-of-the-art  Tables 4, 5 and 6 summarize the performance of SIE compared to the baselines and to the best systems in all the tasks

Intelligent Database Systems Lab N.Y.U.S.T. I. M. 27 Conclusion  The high complexity of these algorithms ─ a preprocessing technique to alleviate two relevant problems of classification-based learning  An important advantage of Instance Filtering ─ reduction of the computation time required ─ by the entity recognition system to perform both training and classification  We presented a class of instance filters based on feature selection metrics ─ Stop Word Filters  The experiments ─ the results are close to the state-of-the-art