ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population.

Slides:



Advertisements
Similar presentations
Understanding Tables on the Web Jingjing Wang. Problem to Solve A wealth of information in the World Wide Web Not easy to access or process by machine.
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Problem Semi supervised sarcasm identification using SASI
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Person Name Disambiguation by Bootstrapping Presenter: Lijie Zhang Advisor: Weining Zhang.
Applications Chapter 9, Cimiano Ontology Learning Textbook Presented by Aaron Stewart.
Towards large-scale, open-domain and ontology-based named entity classification Philipp Cimiano and Johanna Völker University of Karlsruhe Proceedings.
IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.
Xyleme A Dynamic Warehouse for XML Data of the Web.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Methods for Domain-Independent Information Extraction from the Web An Experimental Comparison Oren Etzioni et al. Prepared by Ang Sun
Learning syntactic patterns for automatic hypernym discovery Rion Snow, Daniel Jurafsky and Andrew Y. Ng Prepared by Ang Sun
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Xiaomeng Su & Jon Atle Gulla Dept. of Computer and Information Science Norwegian University of Science and Technology Trondheim Norway June 2004 Semantic.
HYPERGEO 1 st technical verification ARISTOTLE UNIVERSITY OF THESSALONIKI Baseline Document Retrieval Component N. Bassiou, C. Kotropoulos, I. Pitas 20/07/2000,
A New Approach for Cross- Language Plagiarism Analysis Rafael Corezola Pereira, Viviane P. Moreira, and Renata Galante Universidade Federal do Rio Grande.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Empirical Methods in Information Extraction Claire Cardie Appeared in AI Magazine, 18:4, Summarized by Seong-Bae Park.
1 The BT Digital Library A case study in intelligent content management Paul Warren
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee, Zhongyuan Wang, Haixun Wang, Seung-won Hwang Microsoft Research Asia Speaker: Bo.
Survey of Semantic Annotation Platforms
Presented by Tienwei Tsai July, 2005
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Text Classification, Active/Interactive learning.
Illinois-Coref: The UI System in the CoNLL-2012 Shared Task Kai-Wei Chang, Rajhans Samdani, Alla Rozovskaya, Mark Sammons, and Dan Roth Supported by ARL,
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
PAUL ALEXANDRU CHIRITA STEFANIA COSTACHE SIEGFRIED HANDSCHUH WOLFGANG NEJDL 1* L3S RESEARCH CENTER 2* NATIONAL UNIVERSITY OF IRELAND PROCEEDINGS OF THE.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
CROSSMARC Web Pages Collection: Crawling and Spidering Components Vangelis Karkaletsis Institute of Informatics & Telecommunications NCSR “Demokritos”
Recognizing Names in Biomedical Texts: a Machine Learning Approach GuoDong Zhou 1,*, Jie Zhang 1,2, Jian Su 1, Dan Shen 1,2 and ChewLim Tan 2 1 Institute.
Relevance Detection Approach to Gene Annotation Aid to automatic annotation of databases Annotation flow –Extraction of molecular function of a gene from.
Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.
Using a Named Entity Tagger to Generalise Surface Matching Text Patterns for Question Answering Mark A. Greenwood and Robert Gaizauskas Natural Language.
1 SIGIR 2004 Web-page Classification through Summarization Dou Shen Zheng Chen * Qiang Yang Presentation : Yao-Min Huang Date : 09/15/2004.
Entity Set Expansion in Opinion Documents Lei Zhang Bing Liu University of Illinois at Chicago.
A Novel Pattern Learning Method for Open Domain Question Answering IJCNLP 2004 Yongping Du, Xuanjing Huang, Xin Li, Lide Wu.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.
IR Homework #3 By J. H. Wang May 4, Programming Exercise #3: Text Classification Goal: to classify each document into predefined categories Input:
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
Department of Software and Computing Systems Research Group of Language Processing and Information Systems The DLSIUAES Team’s Participation in the TAC.
Extracting Keyphrases to Represent Relations in Social Networks from Web Junichiro Mori and Mitsuru Ishizuka Universiry of Tokyo Yutaka Matsuo National.
1 A Web Search Engine-Based Approach to Measure Semantic Similarity between Words Presenter: Guan-Yu Chen IEEE Trans. on Knowledge & Data Engineering,
Automatic Video Tagging using Content Redundancy Stefan Siersdorfer 1, Jose San Pedro 2, Mark Sanderson 2 1 L3S Research Center, Germany 2 University of.
Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Semantic web Bootstrapping & Annotation Hassan Sayyadi Semantic web research laboratory Computer department Sharif university of.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
4. Relationship Extraction Part 4 of Information Extraction Sunita Sarawagi 9/7/2012CS 652, Peter Lindes1.
Acquisition of Categorized Named Entities for Web Search Marius Pasca Google Inc. from Conference on Information and Knowledge Management (CIKM) ’04.
Context-Aware Query Classification Huanhuan Cao, Derek Hao Hu, Dou Shen, Daxin Jiang, Jian-Tao Sun, Enhong Chen, Qiang Yang Microsoft Research Asia SIGIR.
Discovering Relations among Named Entities from Large Corpora Takaaki Hasegawa *, Satoshi Sekine 1, Ralph Grishman 1 ACL 2004 * Cyberspace Laboratories.
Learning Taxonomic Relations from Heterogeneous Evidence Philipp Cimiano Aleksander Pivk Lars Schmidt-Thieme Steffen Staab (ECAI 2004)
Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.
Marko Grobelnik, Janez Brank, Blaž Fortuna, Igor Mozetič.
Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
GoRelations: an Intuitive Query System for DBPedia Lushan Han and Tim Finin 15 November 2011
Question Answering Passage Retrieval Using Dependency Relations (SIGIR 2005) (National University of Singapore) Hang Cui, Renxu Sun, Keya Li, Min-Yen Kan,
Semi-Supervised Recognition of Sarcastic Sentences in Twitter and Amazon -Smit Shilu.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
User Modeling for Personal Assistant
Presented by: Hassan Sayyadi
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

ONTOLOGY LEARNING AND POPULATION FROM FROM TEXT Ch8 Population

Population Population of ontology: Finding instances of relations as well as of concepts Requires full understanding of natural language More modest target: The extraction of a set of predefined relations In this chapter: No acquisition of instances of relations The detection of instances of concepts

Population Common Approaches Corpus-based Population A standard similarity-based approach Learning by Googling Semi-supervised approach PANKOW C-PANKOW

Common Approaches Lexico-syntactic Patterns Hearst patterns Similarity-based Classification Algorithm12 Data sparseness problem Supervised Approaches Predict the category of a certain instance with a model Requires thousands of training examples to train the model Not feasible - considering hundreds of concepts as possible tags

Similarity-based Classification of Named Entities Using different similarity measures Cosine, Jaccard, L1 norm, Jensen-Shannon, Skew Using different feature weighting measures Conditional, PMI, Resnik

Evaluation Goal: learn a function f s f a and f b : specified by two annotators Functions as sets: Measurement Precision, Recall, F-measure, learning accuracy

Experiments Using Word Windows n words to the left and right of a word of interest Excluding stopwords without trespassing sentence boundaries Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. Mopti: traditional(l), biggest(1) Niger: city(l), delta(l), view(l) Gao: San(l), ofFer(l), town(l), junction(l) San: offer(l), view(l), Gao(l), nice(l)

Experiments Result:

Experiments Result:

Experiments Using Pseudo-syntactic Dependencies Object-attribute pair Mopti is the biggest city along the Niger with one of the most vibrant ports and a large bustling market. Mopti has a traditional ambience that other towns seem to have lost. It is also the center of the local tourist industry and suffers from hard-sell overload. The nearby junction towns of Gao and San offer nice views over the Niger's delta. Mopti: is-city(l), has_ambience(l) Niger: has_delta(l) Gao: junction.of(l) San: offer_subj(l) Result:

Experiments Dealing with Data Sparseness Using Conjunctions When two named entities linked by conjunctions Result:

Experiments Dealing with Data Sparseness Exploiting the Taxonomy Compute the context vector of a certain term by considering the context vectors of its subconcepts Take only into account the context vectors of direct subconcepts Normalizing aggregated vectors: Standard normalization of the vector Calculating its centroid

Experiments Dealing with Data Sparseness Exploiting the Taxonomy Result:

Experiments Dealing with Data Sparseness Anaphora Resolution Replace each anaphoric reference to the corresponding antecedent The port capital of Vathy is dominated by its fortified Venetian har- bor. The port capital of Vathy is dominated by Vathy's fortified Venetian harbor. Result:

Experiments Dealing with Data Sparseness Downloading Documents from the Web Downloading 20 additional documents D i for each named entity i keep d that its similarity is over an threshold of 0.2 Result:

Experiments Dealing with Data Sparseness Post-processing The k best answers of the system are checked for their statistical plausibility on the web Result:

PANKOW Pattern-based Annotation through Knowledge on the Web Certain lexico-syntactic patterns as defined by Hearst can be matched in corpus AND World Wide Web

PANKOW The Process of PANKOW Step 1: iterates the set of entities to be classified and generates instances of patterns, one for each concept in the ontology. For example: instance - South Africa, concepts – country and resulting in pattern instances - ' 'South Africa is a country" and ' 'South Africa is a hotel" or "countries such as South Africa" and "hotels such as South Africa". Result 1: A set of pattern instances Step 2: Google is queried for the pattern instances through its Web service API Result 2: the counts for each pattern instance Step 3: sums up the query results to a total for each concept. Result: The statistical web fingerprint for each entity, that is, the results of aggregating for each entity the number of Google counts for all pattern instances conveying the relation of interest.

PANKOW The Process of PANKOW

PANKOW Evaluation From the two annotators Reference standards for subject A and B Measurement: Precision, recall, and F-measure

PANKOW Evaluation Measurement: Average the results for both annotatores

PANKOW Result:

C-PANKOW Shortcoming of PANKOW A lot of actual instances of the pattern schema are not found Large number of queries sent to the Google Web API Not scale to larger ontologies

C-PANKOW C-PANKOW Process the web page to be annotated is scanned for candidate instances. for each instance i discovered and for each clue-pattern pair in our pattern library P, an automatically generated query is issued to Google and the abstracts or snippets of the n first hits are downloaded. Then the similarity between the document to be annotated and the downloaded abstract is calculated. If the similarity is above a given threshold t, the actual pattern found in the abstract reveals a phrase which may possibly describe the concept that the instance belongs to in the context in question. The pattern matched in a certain Google abstract is only considered if the similarity between the original page and this abstract is above a given threshold. In this way the pattern-matching process is contextualized. Finally, the instance i is annotated with that concept c having the largest number as well as most contextually relevant hits.

C-PANKOW C-PANKOW Process

C-PANKOW Evaluation Same dataset and evaluation measures as PANKOW BUT the C-PANKOW uses the 682 concepts of the pruned Tourism ontology as possible tags Added learning accuracy

C-PANKOW Result:

C-PANKOW Result:

C-PANKOW Result: